Skip to content

feat(stdlib): add ChunkingStrategy ABC and built-in chunkers#923

Open
planetf1 wants to merge 5 commits intogenerative-computing:mainfrom
planetf1:feat/899-chunking-strategy
Open

feat(stdlib): add ChunkingStrategy ABC and built-in chunkers#923
planetf1 wants to merge 5 commits intogenerative-computing:mainfrom
planetf1:feat/899-chunking-strategy

Conversation

@planetf1
Copy link
Copy Markdown
Contributor

Adds mellea/stdlib/chunking.py with chunking infrastructure for streaming validation.

Changes

  • ChunkingStrategy ABC with split(accumulated_text: str) -> list[str]
  • SentenceChunker — splits on ., !, ? boundaries
  • WordChunker — splits on whitespace
  • ParagraphChunker — splits on \n\n
  • Unit tests in test/stdlib/test_chunking.py (27 tests, all passing)

Test plan

  • uv run pytest test/stdlib/test_chunking.py -v passes (27/27)
  • uv run ruff check clean
  • uv run mypy mellea/stdlib/chunking.py clean
  • Pre-commit hooks pass (ruff, mypy, codespell)

Closes #899
Part of #891

Adds mellea/stdlib/chunking.py with ChunkingStrategy ABC and three
built-in implementations: SentenceChunker, WordChunker, ParagraphChunker.
split(accumulated_text) returns complete chunks, holding back trailing
fragments for the next call.

Closes generative-computing#899
Part of generative-computing#891

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
@planetf1 planetf1 requested a review from a team as a code owner April 24, 2026 08:58
@planetf1 planetf1 requested review from nrfulton and psschwei April 24, 2026 08:58
@github-actions github-actions Bot added the enhancement New feature or request label Apr 24, 2026
Pyright flagged import as unaccessed; no pytest.* calls in the file.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
@planetf1 planetf1 marked this pull request as draft April 24, 2026 09:21
…hunkingStrategy

- Fix SentenceChunker whitespace leak: lstrip() after match.end() so
  double-space / tab separators don't bleed into the next chunk as
  leading whitespace
- Add end-of-stream contract to ABC docstring (callers responsible for
  trailing fragment after stream terminates)
- Fix incorrect comment "end-of-string" → "whitespace"
- Compile _WHITESPACE / _PARA_BOUNDARY / _PARA_BOUNDARY_END at module
  level (consistent with _SENTENCE_BOUNDARY; avoids per-call recompile)
- Expand SentenceChunker char class to include right curly double/single
  quotes (U+201D / U+2019) for common LLM output patterns
- Document CRLF limitation on ParagraphChunker
- Re-export ChunkingStrategy + chunkers from mellea.stdlib.__init__
- Add __all__ to chunking.py
- Add tests: closing paren, double-space separator, tab separator,
  abbreviation edge case (known-bad split), WordChunker leading-whitespace

Assisted-by: Claude Code
…acy and curly-quote test

- Fix misleading comment on _SENTENCE_BOUNDARY: was "processed by re engine
  as \u escapes" but the file contained literal Unicode chars. Now uses
  chr(0x201d) + chr(0x2019) for Python 3.12 compatibility (U+2019 is treated
  as a string delimiter in single-quoted raw strings on 3.12).
- Add test_sentence_chunker_curly_quotes to verify U+201D/U+2019 matching.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
…mputing#923)

- Simplify _SENTENCE_BOUNDARY regex to use \u escapes instead of chr()
  concatenation (cleaner, same semantics, Python 3.12-safe)
- Document that SentenceChunker discards inter-sentence whitespace via lstrip()
- Add test_chunking_strategy_is_abstract to document the extension-point contract

Assisted-by: Claude Code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(stdlib): add ChunkingStrategy ABC and built-in chunkers

2 participants