Fix section_text last_paragraph start index off by separator length by Chessing234 · Pull Request #1996 · MIT-LCP/mimic-code

Chessing234 · 2026-04-20T05:26:32Z

Bug

section_parser.section_text returns a section_idx list documented as "list of start indices of the text in the section" — offsets back into the original report string. When the report has no impression/findings section and the final section contains '\n \n', the function synthesises a 'last_paragraph' section by splitting on that separator:

if '\n \n' in sections[-1]:
    sections.append('\n \n'.join(sections[-1].split('\n \n')[1:]))
    sections[-2] = sections[-2].split('\n \n')[0]
    section_names.append('last_paragraph')
    section_idx.append(section_idx[-1] + len(sections[-2]))

The computed offset for the new 'last_paragraph' section is section_idx[-1] + len(sections[-2]), but sections[-2] at that point holds only the first part of the split — the original section up to (but not including) the '\n \n' separator. str.split removes the separator, so the second part (which becomes the new 'last_paragraph') actually starts len('\n \n') (3) characters later in the original text.

Root cause

The offset math omits the length of the separator that split() consumed. For every other entry in section_idx, the value is the end of the section-header regex match (s.end()) — i.e. already past the :\s separator — so using raw offsets relative to the original text is the invariant callers rely on. This branch breaks that invariant.

Why the fix is correct

Adding len('\n \n') puts the offset at the first character of the last paragraph in the original text, matching what text[section_idx[i]:] gives for every other produced index:

-            section_idx.append(section_idx[-1] + len(sections[-2]))
+            section_idx.append(section_idx[-1] + len(sections[-2]) + len('\n \n'))

sections[-2] is the first part (unchanged by this commit).
section_idx[-1] is still the original section's start index at the point of append (not yet mutated).
len('\n \n') is the separator length that str.split('\n \n') drops.

No other behavioural change: the returned sections list is identical, section_names is identical, only the previously-incorrect section_idx[-1] for the last_paragraph entry moves forward by 3 to the correct position.

When no 'impression'/'findings' section is present and the final section contains '\n \n', section_text splits the final section and appends a synthetic 'last_paragraph' section: if '\n \n' in sections[-1]: sections.append('\n \n'.join(sections[-1].split('\n \n')[1:])) sections[-2] = sections[-2].split('\n \n')[0] section_names.append('last_paragraph') section_idx.append(section_idx[-1] + len(sections[-2])) The new entry in section_idx is computed from the kept first part only, but the second part actually starts after that first part *plus* the three-character '\n \n' separator. Callers that use section_idx to slice back into the original text -- which is what the docstring advertises it for ('list of start indices of the text in the section') -- therefore get a last_paragraph offset that points 3 characters early, into the trailing separator of the previous section. Adding len('\n \n') accounts for the separator that split() removes, so the new offset lands on the actual first character of the last paragraph and matches the semantics of the indices already produced for the other sections (all of which are set to s.end() of the section header match, i.e. past the separator).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix section_text last_paragraph start index off by separator length#1996

Fix section_text last_paragraph start index off by separator length#1996
Chessing234 wants to merge 1 commit intoMIT-LCP:mainfrom
Chessing234:fix/section-parser-last-paragraph-idx-missing-separator

Chessing234 commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Chessing234 commented Apr 20, 2026

Bug

Root cause

Why the fix is correct

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant