Skip to content

Fix section_text last_paragraph start index off by separator length#1996

Open
Chessing234 wants to merge 1 commit intoMIT-LCP:mainfrom
Chessing234:fix/section-parser-last-paragraph-idx-missing-separator
Open

Fix section_text last_paragraph start index off by separator length#1996
Chessing234 wants to merge 1 commit intoMIT-LCP:mainfrom
Chessing234:fix/section-parser-last-paragraph-idx-missing-separator

Conversation

@Chessing234
Copy link
Copy Markdown

Bug

section_parser.section_text returns a section_idx list documented as "list of start indices of the text in the section" — offsets back into the original report string. When the report has no impression/findings section and the final section contains '\n \n', the function synthesises a 'last_paragraph' section by splitting on that separator:

if '\n \n' in sections[-1]:
    sections.append('\n \n'.join(sections[-1].split('\n \n')[1:]))
    sections[-2] = sections[-2].split('\n \n')[0]
    section_names.append('last_paragraph')
    section_idx.append(section_idx[-1] + len(sections[-2]))

The computed offset for the new 'last_paragraph' section is section_idx[-1] + len(sections[-2]), but sections[-2] at that point holds only the first part of the split — the original section up to (but not including) the '\n \n' separator. str.split removes the separator, so the second part (which becomes the new 'last_paragraph') actually starts len('\n \n') (3) characters later in the original text.

Root cause

The offset math omits the length of the separator that split() consumed. For every other entry in section_idx, the value is the end of the section-header regex match (s.end()) — i.e. already past the :\s separator — so using raw offsets relative to the original text is the invariant callers rely on. This branch breaks that invariant.

Why the fix is correct

Adding len('\n \n') puts the offset at the first character of the last paragraph in the original text, matching what text[section_idx[i]:] gives for every other produced index:

-            section_idx.append(section_idx[-1] + len(sections[-2]))
+            section_idx.append(section_idx[-1] + len(sections[-2]) + len('\n \n'))
  • sections[-2] is the first part (unchanged by this commit).
  • section_idx[-1] is still the original section's start index at the point of append (not yet mutated).
  • len('\n \n') is the separator length that str.split('\n \n') drops.

No other behavioural change: the returned sections list is identical, section_names is identical, only the previously-incorrect section_idx[-1] for the last_paragraph entry moves forward by 3 to the correct position.

When no 'impression'/'findings' section is present and the final section
contains '\n \n', section_text splits the final section and appends a
synthetic 'last_paragraph' section:

    if '\n \n' in sections[-1]:
        sections.append('\n \n'.join(sections[-1].split('\n \n')[1:]))
        sections[-2] = sections[-2].split('\n \n')[0]
        section_names.append('last_paragraph')
        section_idx.append(section_idx[-1] + len(sections[-2]))

The new entry in section_idx is computed from the kept first part only,
but the second part actually starts after that first part *plus* the
three-character '\n \n' separator. Callers that use section_idx to
slice back into the original text -- which is what the docstring
advertises it for ('list of start indices of the text in the section')
-- therefore get a last_paragraph offset that points 3 characters early,
into the trailing separator of the previous section.

Adding len('\n \n') accounts for the separator that split() removes, so
the new offset lands on the actual first character of the last paragraph
and matches the semantics of the indices already produced for the other
sections (all of which are set to s.end() of the section header match,
i.e. past the separator).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant