Fix section_text last_paragraph start index off by separator length#1996
Open
Chessing234 wants to merge 1 commit intoMIT-LCP:mainfrom
Open
Conversation
When no 'impression'/'findings' section is present and the final section
contains '\n \n', section_text splits the final section and appends a
synthetic 'last_paragraph' section:
if '\n \n' in sections[-1]:
sections.append('\n \n'.join(sections[-1].split('\n \n')[1:]))
sections[-2] = sections[-2].split('\n \n')[0]
section_names.append('last_paragraph')
section_idx.append(section_idx[-1] + len(sections[-2]))
The new entry in section_idx is computed from the kept first part only,
but the second part actually starts after that first part *plus* the
three-character '\n \n' separator. Callers that use section_idx to
slice back into the original text -- which is what the docstring
advertises it for ('list of start indices of the text in the section')
-- therefore get a last_paragraph offset that points 3 characters early,
into the trailing separator of the previous section.
Adding len('\n \n') accounts for the separator that split() removes, so
the new offset lands on the actual first character of the last paragraph
and matches the semantics of the indices already produced for the other
sections (all of which are set to s.end() of the section header match,
i.e. past the separator).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bug
section_parser.section_textreturns asection_idxlist documented as "list of start indices of the text in the section" — offsets back into the original report string. When the report has noimpression/findingssection and the final section contains'\n \n', the function synthesises a'last_paragraph'section by splitting on that separator:The computed offset for the new
'last_paragraph'section issection_idx[-1] + len(sections[-2]), butsections[-2]at that point holds only the first part of the split — the original section up to (but not including) the'\n \n'separator.str.splitremoves the separator, so the second part (which becomes the new'last_paragraph') actually startslen('\n \n')(3) characters later in the original text.Root cause
The offset math omits the length of the separator that
split()consumed. For every other entry insection_idx, the value is the end of the section-header regex match (s.end()) — i.e. already past the:\sseparator — so using raw offsets relative to the original text is the invariant callers rely on. This branch breaks that invariant.Why the fix is correct
Adding
len('\n \n')puts the offset at the first character of the last paragraph in the original text, matching whattext[section_idx[i]:]gives for every other produced index:sections[-2]is the first part (unchanged by this commit).section_idx[-1]is still the original section's start index at the point of append (not yet mutated).len('\n \n')is the separator length thatstr.split('\n \n')drops.No other behavioural change: the returned
sectionslist is identical,section_namesis identical, only the previously-incorrectsection_idx[-1]for thelast_paragraphentry moves forward by 3 to the correct position.