Propagate element metadata to chunks in MEDI chunkers#7516
Propagate element metadata to chunks in MEDI chunkers#7516luisquintanilla wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses MEDI ingestion metadata loss by propagating IngestionDocumentElement.Metadata into IngestionChunk<string>.Metadata across the built-in chunkers, so downstream components (e.g., VectorStoreWriter) can persist element-derived metadata on produced chunks.
Changes:
- Added element-metadata accumulation/application logic to
ElementsChunker(affectingSectionChunker,HeaderChunker, andSemanticSimilarityChunker). - Added similar metadata accumulation/application to
DocumentTokenChunkerduring element iteration and chunk finalization. - Introduced a new test suite validating metadata propagation behavior for several chunkers and scenarios.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/ChunkerMetadataPropagationTests.cs | Adds tests asserting element metadata is propagated to chunk metadata under various conditions. |
| src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/ElementsChunker.cs | Accumulates element metadata while building chunks and applies it when committing chunks. |
| src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs | Accumulates element metadata during token chunking and applies it when finalizing chunks. |
| AccumulateMetadata(element, ref accumulatedMetadata); | ||
|
|
||
| int elementTokenCount = CountTokens(semanticContent.AsSpan()); | ||
| if (elementTokenCount + totalTokenCount <= _maxTokensPerChunk) | ||
| { |
There was a problem hiding this comment.
Accepted and fixed.
Good catch — this was a real timing bug. In the original code, AccumulateMetadata ran unconditionally before the branch logic, so when a table pre-commit or overflow triggered FinalizeCurrentChunk, the element's metadata attached to the previous chunk instead of the one receiving the content.
Fix (commit 8bfe9bd): Moved AccumulateMetadata into each of the 3 branches with flag-based deferred accumulation:
- Fits branch (L72-78): Accumulate immediately before append — straightforward, element stays in current chunk.
- Table branch (L79-163):
tableMetadataAccumulatedflag — defer until the first actualAppendNewLineAndSpancall to_currentChunk. This handles pre-commit (header doesn't fit),rowIndex==1edge case (first data row doesn't fit), and mid-table row splits. - Non-table overflow (L164-213):
elementMetadataAccumulatedflag — accumulate only whenindex > 0(content actually appended to a chunk).
| { | ||
| continue; | ||
| } | ||
|
|
||
| AccumulateMetadata(element, ref accumulatedMetadata); | ||
|
|
||
| int contentToProcessTokenCount = _tokenizer.CountTokens(elementContent!, considerNormalization: false); | ||
| ReadOnlyMemory<char> contentToProcess = elementContent.AsMemory(); | ||
| while (stringBuilderTokenCount + contentToProcessTokenCount >= _maxTokensPerChunk) | ||
| { | ||
| int index = _tokenizer.GetIndexByTokenCount( | ||
| text: contentToProcess.Span, | ||
| maxTokenCount: _maxTokensPerChunk - stringBuilderTokenCount, | ||
| out string? _, | ||
| out int _, | ||
| considerNormalization: false); | ||
|
|
||
| unsafe | ||
| { | ||
| fixed (char* ptr = &MemoryMarshal.GetReference(contentToProcess.Span)) | ||
| { | ||
| _ = stringBuilder.Append(ptr, index); | ||
| } | ||
| } | ||
| yield return FinalizeChunk(); | ||
| yield return FinalizeChunk(ref accumulatedMetadata); | ||
|
|
There was a problem hiding this comment.
Accepted and fixed.
Correct — in the original code, AccumulateMetadata ran at line 86 before the while loop. If index == 0 and the buffer was already full, the while loop would call FinalizeCurrentChunk first, attaching the element's metadata to the wrong (previous) chunk.
Fix (commit 8bfe9bd): Introduced elementMetadataAccumulated flag:
- Inside the while loop: accumulate only when
index > 0(meaning content has actually been appended to a chunk from this element) - After the while loop: if the flag is still false (remaining content goes to buffer), accumulate then
This ensures metadata always follows the content, regardless of whether a chunk boundary is crossed at the start of the element.
| private static IngestionChunker<string> CreateDocumentTokenChunker(int maxTokensPerChunk = 2_000) | ||
| { | ||
| var tokenizer = TiktokenTokenizer.CreateForModel("gpt-4o"); | ||
| return new DocumentTokenChunker(new(tokenizer) { MaxTokensPerChunk = maxTokensPerChunk, OverlapTokens = 0 }); | ||
| } | ||
|
|
||
| [Fact] |
There was a problem hiding this comment.
Partially accepted.
Agree that test coverage should be broader. Here's what was added:
Boundary/overlap tests (commit 8bfe9bd — 6 tests):
ElementsChunker_TablePreCommit_MetadataGoesToCorrectChunk— table element triggers pre-commit; metadata follows content to new chunkElementsChunker_ExactFill_MetadataStaysOnCurrentChunk— element exactly fills remaining capacity; metadata stays on current chunkDocumentTokenChunker_OverlapTokens_MetadataOnlyOnOriginalChunks— overlap content doesn't duplicate metadataDocumentTokenChunker_ExactFill_MetadataAttachesToCorrectChunk— boundary precisionElementsChunker_TableSplit_MetadataGoesToFirstTableChunk— large table split across chunksElementsChunker_NonTableOverflow_MetadataGoesToNewChunk— non-table overflow triggers new chunk
SemanticSimilarityChunker tests (commit bd31e18 — 2 tests):
SemanticSimilarityChunker_SingleElementWithMetadata_PropagatesMetadata— basic metadata flowSemanticSimilarityChunker_MultipleElementsDifferentKeys_AllKeysAppear— per-element metadata preserved across chunks
All 22 metadata propagation tests pass across net8.0, net9.0, net10.0, and net462.
|
MEDI is an acronym regularly used to reference Microsoft.Extensions.DependencyInjection - using it for this library is another reason why these AI technology packages should not be in the root Microsoft.Extensions namespace. |
|
@adamsitnik I confirmed this wasn't a design choice. Can I please get a review before merging. Thanks! |
🎉 Good job! The coverage increased 🎉
Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=1454486&view=codecoverage-tab |
6c9bd82 to
8bfe9bd
Compare
🎉 Good job! The coverage increased 🎉
Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=1454834&view=codecoverage-tab |
Fix #7465: All four IngestionChunker implementations (SectionChunker, HeaderChunker, SemanticSimilarityChunker, DocumentTokenChunker) now propagate IngestionDocumentElement.Metadata to IngestionChunk.Metadata. Design decisions: - First-wins merge strategy (TryAdd) for conflicting keys - Null metadata values skipped (element allows object?, chunk requires object) - Split elements: metadata goes to the first chunk only - Lazy allocation: dictionary only created when elements have metadata ElementsChunker (fixes SectionChunker, HeaderChunker, SemanticSimilarityChunker): - Added AccumulateMetadata/ApplyMetadata static helpers - Accumulates metadata as elements are processed - Applies to chunk on commit, then clears accumulator DocumentTokenChunker: - Added AccumulateMetadata static helper - Accumulates metadata during element iteration - Applies in FinalizeChunk, then clears accumulator Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix metadata accumulation timing bugs in ElementsChunker and DocumentTokenChunker where AccumulateMetadata was called before determining which chunk the element's content contributes to. When a Commit/FinalizeChunk happens before the new element adds content (table pre-commit, non-table overflow, exact-fill boundary), the metadata was incorrectly applied to the previous chunk. ElementsChunker fixes: - Branch 1 (fits): accumulate right before appending - Branch 2 (table): use flag, accumulate before first table content append to _currentChunk, after any pre-commit or row-level commit - Branch 3 (non-table too big): use flag, accumulate when index > 0 (first content contribution in the while loop) DocumentTokenChunker fixes: - Use flag to defer accumulation until first content contribution - In while loop: accumulate only when index > 0 - After while loop: accumulate if not yet done (element fits entirely) New boundary tests (6 tests): - Previous element fills chunk, next element metadata on new chunk - Non-table element too large, metadata on correct chunks - Table pre-commit: table metadata not on pre-committed chunk - DocumentTokenChunker boundary with large filler element - DocumentTokenChunker with overlap enabled - Table split across chunks: first chunk gets metadata Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add 2 tests covering SemanticSimilarityChunker metadata flow: - Single element with metadata propagates to chunk - Multiple elements with different keys each carry metadata Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bd31e18 to
8bac32d
Compare
🎉 Good job! The coverage increased 🎉
Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=1454942&view=codecoverage-tab |
adamsitnik
left a comment
There was a problem hiding this comment.
The chunkers never read element metadata, so any metadata attached to document elements (e.g., page numbers, source URIs, element types) was silently dropped during chunking. This meant VectorStoreWriter - which already correctly persists chunk metadata - had nothing to write.
This was done on purpose, as I expect that this information may be required during chunking/processing, but it's not always desired to be persisted in the vector store.
Sample metadata produced when parsing (and not persisted): page numbers/page size
Sample metadata produced when processing (and persisted): sentiment/categories/keyword/summary
I am not against changing this approach, but I would like to know the rationale, especially since persisting metadata now will require an additional setup on the user side (#7396). Let's chat offline about this.
adamsitnik
left a comment
There was a problem hiding this comment.
One more thing to consider: how metadata should be aggregated when we are using multiple elements to build a single chunk. For example: 3 paragraphs, each from a different page (1, 2, 3). How should this be expressed? Most likely as "page numbers: 1-3". But in order to perform such "smart" aggregation we would need to recognize certain metadata keys and this would require some kind of standardization (right now we don't enforce the readers to use "PageNumber", "page" or any other name)
Summary
Fixes #7465
All four built-in
IngestionChunkerimplementations (SectionChunker,HeaderChunker,SemanticSimilarityChunker,DocumentTokenChunker) now propagateIngestionDocumentElement.MetadatatoIngestionChunk<T>.Metadata.Problem
The chunkers never read element metadata, so any metadata attached to document elements (e.g., page numbers, source URIs, element types) was silently dropped during chunking. This meant
VectorStoreWriter- which already correctly persists chunk metadata - had nothing to write.Solution
ElementsChunker (internal, fixes 3 public chunkers)
AccumulateMetadata/ApplyMetadatastatic helpersDocumentTokenChunker (independent chunker)
AccumulateMetadatastatic helperFinalizeChunk, then clears the accumulatorDesign Decisions
Testing
ChunkerMetadataPropagationTestscovering all scenariosMicrosoft Reviewers: Open in CodeFlow