Propagate element metadata to chunks in MEDI chunkers by luisquintanilla · Pull Request #7516 · dotnet/extensions

luisquintanilla · 2026-05-07T14:13:48Z

Summary

All four built-in IngestionChunker implementations (SectionChunker, HeaderChunker, SemanticSimilarityChunker, DocumentTokenChunker) now propagate IngestionDocumentElement.Metadata to IngestionChunk<T>.Metadata.

Problem

The chunkers never read element metadata, so any metadata attached to document elements (e.g., page numbers, source URIs, element types) was silently dropped during chunking. This meant VectorStoreWriter - which already correctly persists chunk metadata - had nothing to write.

Solution

ElementsChunker (internal, fixes 3 public chunkers)

Added AccumulateMetadata / ApplyMetadata static helpers
As elements are processed, their metadata is accumulated into a lazily-allocated dictionary
When a chunk is committed, accumulated metadata is applied to the chunk and the accumulator is cleared

DocumentTokenChunker (independent chunker)

Added AccumulateMetadata static helper
Accumulates metadata during element iteration
Applies metadata in FinalizeChunk, then clears the accumulator

Design Decisions

Decision	Rationale
First-wins merge (TryAdd)	When multiple elements share a key, the first element's value prevails - predictable and deterministic
Null values skipped	Element metadata allows object? but chunk metadata requires object - nulls are meaningless for downstream consumers
Split elements -> first chunk only	When an element is split across chunks, metadata goes to the first chunk. This avoids duplication and matches the semantic intent
Lazy allocation	Dictionary only allocated when the first element with metadata is encountered

Testing

14 new tests in ChunkerMetadataPropagationTests covering all scenarios
All 128 existing DataIngestion tests pass with no regressions
Verified on net8.0 and net9.0 (builds clean on all 5 TFMs)

Microsoft Reviewers: Open in CodeFlow

Copilot

Pull request overview

This PR addresses MEDI ingestion metadata loss by propagating IngestionDocumentElement.Metadata into IngestionChunk<string>.Metadata across the built-in chunkers, so downstream components (e.g., VectorStoreWriter) can persist element-derived metadata on produced chunks.

Changes:

Added element-metadata accumulation/application logic to ElementsChunker (affecting SectionChunker, HeaderChunker, and SemanticSimilarityChunker).
Added similar metadata accumulation/application to DocumentTokenChunker during element iteration and chunk finalization.
Introduced a new test suite validating metadata propagation behavior for several chunkers and scenarios.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/ChunkerMetadataPropagationTests.cs	Adds tests asserting element metadata is propagated to chunk metadata under various conditions.
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/ElementsChunker.cs	Accumulates element metadata while building chunks and applies it when committing chunks.
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs	Accumulates element metadata during token chunking and applies it when finalizing chunks.

luisquintanilla · 2026-06-08T18:43:01Z

+            AccumulateMetadata(element, ref accumulatedMetadata);
+
            int elementTokenCount = CountTokens(semanticContent.AsSpan());
            if (elementTokenCount + totalTokenCount <= _maxTokensPerChunk)
            {


Accepted and fixed.

Good catch — this was a real timing bug. In the original code, AccumulateMetadata ran unconditionally before the branch logic, so when a table pre-commit or overflow triggered FinalizeCurrentChunk, the element's metadata attached to the previous chunk instead of the one receiving the content.

Fix (commit 8bfe9bd): Moved AccumulateMetadata into each of the 3 branches with flag-based deferred accumulation:

Fits branch (L72-78): Accumulate immediately before append — straightforward, element stays in current chunk.

Table branch (L79-163): tableMetadataAccumulated flag — defer until the first actual AppendNewLineAndSpan call to _currentChunk. This handles pre-commit (header doesn't fit), rowIndex==1 edge case (first data row doesn't fit), and mid-table row splits.

Non-table overflow (L164-213): elementMetadataAccumulated flag — accumulate only when index > 0 (content actually appended to a chunk).

luisquintanilla · 2026-06-08T18:43:14Z

                {
                    continue;
                }

+                AccumulateMetadata(element, ref accumulatedMetadata);
+
                int contentToProcessTokenCount = _tokenizer.CountTokens(elementContent!, considerNormalization: false);
                ReadOnlyMemory<char> contentToProcess = elementContent.AsMemory();
                while (stringBuilderTokenCount + contentToProcessTokenCount >= _maxTokensPerChunk)
                {
                    int index = _tokenizer.GetIndexByTokenCount(
                        text: contentToProcess.Span,
                        maxTokenCount: _maxTokensPerChunk - stringBuilderTokenCount,
                        out string? _,
                        out int _,
                        considerNormalization: false);

                    unsafe
                    {
                        fixed (char* ptr = &MemoryMarshal.GetReference(contentToProcess.Span))
                        {
                            _ = stringBuilder.Append(ptr, index);
                        }
                    }
-                    yield return FinalizeChunk();
+                    yield return FinalizeChunk(ref accumulatedMetadata);



Accepted and fixed.

Correct — in the original code, AccumulateMetadata ran at line 86 before the while loop. If index == 0 and the buffer was already full, the while loop would call FinalizeCurrentChunk first, attaching the element's metadata to the wrong (previous) chunk.

Fix (commit 8bfe9bd): Introduced elementMetadataAccumulated flag:

Inside the while loop: accumulate only when index > 0 (meaning content has actually been appended to a chunk from this element)

After the while loop: if the flag is still false (remaining content goes to buffer), accumulate then

This ensures metadata always follows the content, regardless of whether a chunk boundary is crossed at the start of the element.

luisquintanilla · 2026-06-08T18:43:39Z

+    private static IngestionChunker<string> CreateDocumentTokenChunker(int maxTokensPerChunk = 2_000)
+    {
+        var tokenizer = TiktokenTokenizer.CreateForModel("gpt-4o");
+        return new DocumentTokenChunker(new(tokenizer) { MaxTokensPerChunk = maxTokensPerChunk, OverlapTokens = 0 });
+    }
+
+    [Fact]


Partially accepted.

Agree that test coverage should be broader. Here's what was added:

Boundary/overlap tests (commit 8bfe9bd — 6 tests):

ElementsChunker_TablePreCommit_MetadataGoesToCorrectChunk — table element triggers pre-commit; metadata follows content to new chunk

ElementsChunker_ExactFill_MetadataStaysOnCurrentChunk — element exactly fills remaining capacity; metadata stays on current chunk

DocumentTokenChunker_OverlapTokens_MetadataOnlyOnOriginalChunks — overlap content doesn't duplicate metadata

DocumentTokenChunker_ExactFill_MetadataAttachesToCorrectChunk — boundary precision

ElementsChunker_TableSplit_MetadataGoesToFirstTableChunk — large table split across chunks

ElementsChunker_NonTableOverflow_MetadataGoesToNewChunk — non-table overflow triggers new chunk

SemanticSimilarityChunker tests (commit bd31e18 — 2 tests):

SemanticSimilarityChunker_SingleElementWithMetadata_PropagatesMetadata — basic metadata flow

SemanticSimilarityChunker_MultipleElementsDifferentKeys_AllKeysAppear — per-element metadata preserved across chunks

All 22 metadata propagation tests pass across net8.0, net9.0, net10.0, and net462.

CZEMacLeod · 2026-05-08T05:59:02Z

MEDI is an acronym regularly used to reference Microsoft.Extensions.DependencyInjection - using it for this library is another reason why these AI technology packages should not be in the root Microsoft.Extensions namespace.

luisquintanilla · 2026-06-08T02:43:26Z

@adamsitnik I confirmed this wasn't a design choice. Can I please get a review before merging. Thanks!

dotnet-comment-bot · 2026-06-08T16:59:46Z

‼️ Found issues ‼️

Project	Coverage Type	Expected	Actual
Microsoft.Extensions.Diagnostics.Testing	Line	99	98.65 🔻
Microsoft.Extensions.Telemetry	Line	93	91.95 🔻
Microsoft.Extensions.AI	Line	89	88.51 🔻
Microsoft.Extensions.AI	Branch	89	88.53 🔻
Microsoft.Extensions.AI.OpenAI	Line	75	62.62 🔻
Microsoft.Extensions.AI.OpenAI	Branch	75	49.63 🔻
Microsoft.Extensions.DataIngestion.MarkItDown	Line	75	4.46 🔻
Microsoft.Extensions.DataIngestion.MarkItDown	Branch	75	0 🔻
Microsoft.Extensions.Diagnostics.ResourceMonitoring	Line	99	96.03 🔻
Microsoft.Extensions.Diagnostics.ResourceMonitoring	Branch	99	94.39 🔻
Microsoft.Extensions.Diagnostics.ResourceMonitoring.Kubernetes	Line	99	97.73 🔻
Microsoft.Extensions.ServiceDiscovery.Dns	Line	75	69.93 🔻
Microsoft.Extensions.ServiceDiscovery.Abstractions	Line	75	42.11 🔻
Microsoft.Extensions.ServiceDiscovery.Abstractions	Branch	75	42.86 🔻
Microsoft.Extensions.ServiceDiscovery	Line	75	67.51 🔻
Microsoft.Extensions.ServiceDiscovery	Branch	75	71.43 🔻
Microsoft.Extensions.ServiceDiscovery.Yarp	Line	75	73.85 🔻
Microsoft.Extensions.ServiceDiscovery.Yarp	Branch	75	70 🔻
Microsoft.Extensions.VectorData.Abstractions	Line	75	37.39 🔻
Microsoft.Extensions.VectorData.Abstractions	Branch	75	22.73 🔻

🎉 Good job! The coverage increased 🎉
Update MinCodeCoverage in the project files.

Project	Expected	Actual
Microsoft.Gen.BuildMetadata	97	100
Microsoft.Gen.MetadataExtractor	57	73
Microsoft.Gen.MetricsReports	67	69
Microsoft.Extensions.AI.Abstractions	82	85
Microsoft.Extensions.AI.Evaluation.NLP	0	78
Microsoft.Extensions.Caching.Hybrid	82	89
Microsoft.Extensions.DataIngestion.Abstractions	75	91
Microsoft.Extensions.DataIngestion	75	89
Microsoft.Extensions.DataIngestion.Markdig	75	90
Microsoft.Extensions.Http.Resilience	97	100

Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=1454486&view=codecoverage-tab

dotnet-comment-bot · 2026-06-08T19:43:54Z

‼️ Found issues ‼️

Project	Coverage Type	Expected	Actual
Microsoft.Extensions.Diagnostics.Testing	Line	99	98.65 🔻
Microsoft.Extensions.Telemetry	Line	93	91.95 🔻
Microsoft.Extensions.AI	Line	89	88.59 🔻
Microsoft.Extensions.AI	Branch	89	88.53 🔻
Microsoft.Extensions.AI.OpenAI	Line	75	62.62 🔻
Microsoft.Extensions.AI.OpenAI	Branch	75	49.63 🔻
Microsoft.Extensions.DataIngestion.MarkItDown	Line	75	4.46 🔻
Microsoft.Extensions.DataIngestion.MarkItDown	Branch	75	0 🔻
Microsoft.Extensions.Diagnostics.ResourceMonitoring	Line	99	96.03 🔻
Microsoft.Extensions.Diagnostics.ResourceMonitoring	Branch	99	94.39 🔻
Microsoft.Extensions.Diagnostics.ResourceMonitoring.Kubernetes	Line	99	97.73 🔻
Microsoft.Extensions.ServiceDiscovery.Dns	Line	75	68.32 🔻
Microsoft.Extensions.ServiceDiscovery.Abstractions	Line	75	42.11 🔻
Microsoft.Extensions.ServiceDiscovery.Abstractions	Branch	75	42.86 🔻
Microsoft.Extensions.ServiceDiscovery	Line	75	67.21 🔻
Microsoft.Extensions.ServiceDiscovery	Branch	75	71.43 🔻
Microsoft.Extensions.ServiceDiscovery.Yarp	Line	75	73.85 🔻
Microsoft.Extensions.ServiceDiscovery.Yarp	Branch	75	70 🔻
Microsoft.Extensions.VectorData.Abstractions	Line	75	37.39 🔻
Microsoft.Extensions.VectorData.Abstractions	Branch	75	22.73 🔻

🎉 Good job! The coverage increased 🎉
Update MinCodeCoverage in the project files.

Project	Expected	Actual
Microsoft.Gen.BuildMetadata	97	100
Microsoft.Gen.MetadataExtractor	57	73
Microsoft.Gen.MetricsReports	67	69
Microsoft.Extensions.AI.Abstractions	82	85
Microsoft.Extensions.AI.Evaluation.NLP	0	78
Microsoft.Extensions.Caching.Hybrid	82	84
Microsoft.Extensions.DataIngestion.Abstractions	75	91
Microsoft.Extensions.DataIngestion	75	89
Microsoft.Extensions.DataIngestion.Markdig	75	90
Microsoft.Extensions.Http.Resilience	97	100

Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=1454834&view=codecoverage-tab

Fix #7465: All four IngestionChunker implementations (SectionChunker, HeaderChunker, SemanticSimilarityChunker, DocumentTokenChunker) now propagate IngestionDocumentElement.Metadata to IngestionChunk.Metadata. Design decisions: - First-wins merge strategy (TryAdd) for conflicting keys - Null metadata values skipped (element allows object?, chunk requires object) - Split elements: metadata goes to the first chunk only - Lazy allocation: dictionary only created when elements have metadata ElementsChunker (fixes SectionChunker, HeaderChunker, SemanticSimilarityChunker): - Added AccumulateMetadata/ApplyMetadata static helpers - Accumulates metadata as elements are processed - Applies to chunk on commit, then clears accumulator DocumentTokenChunker: - Added AccumulateMetadata static helper - Accumulates metadata during element iteration - Applies in FinalizeChunk, then clears accumulator Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Fix metadata accumulation timing bugs in ElementsChunker and DocumentTokenChunker where AccumulateMetadata was called before determining which chunk the element's content contributes to. When a Commit/FinalizeChunk happens before the new element adds content (table pre-commit, non-table overflow, exact-fill boundary), the metadata was incorrectly applied to the previous chunk. ElementsChunker fixes: - Branch 1 (fits): accumulate right before appending - Branch 2 (table): use flag, accumulate before first table content append to _currentChunk, after any pre-commit or row-level commit - Branch 3 (non-table too big): use flag, accumulate when index > 0 (first content contribution in the while loop) DocumentTokenChunker fixes: - Use flag to defer accumulation until first content contribution - In while loop: accumulate only when index > 0 - After while loop: accumulate if not yet done (element fits entirely) New boundary tests (6 tests): - Previous element fills chunk, next element metadata on new chunk - Non-table element too large, metadata on correct chunks - Table pre-commit: table metadata not on pre-committed chunk - DocumentTokenChunker boundary with large filler element - DocumentTokenChunker with overlap enabled - Table split across chunks: first chunk gets metadata Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add 2 tests covering SemanticSimilarityChunker metadata flow: - Single element with metadata propagates to chunk - Multiple elements with different keys each carry metadata Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

dotnet-comment-bot · 2026-06-08T21:16:42Z

‼️ Found issues ‼️

Project	Coverage Type	Expected	Actual
Microsoft.Extensions.Diagnostics.Testing	Line	99	98.65 🔻
Microsoft.Extensions.Telemetry	Line	93	91.95 🔻
Microsoft.Extensions.AI	Line	89	88.57 🔻
Microsoft.Extensions.AI	Branch	89	88.53 🔻
Microsoft.Extensions.AI.OpenAI	Line	75	62.62 🔻
Microsoft.Extensions.AI.OpenAI	Branch	75	49.63 🔻
Microsoft.Extensions.DataIngestion.MarkItDown	Line	75	4.46 🔻
Microsoft.Extensions.DataIngestion.MarkItDown	Branch	75	0 🔻
Microsoft.Extensions.Diagnostics.ResourceMonitoring	Line	99	96.03 🔻
Microsoft.Extensions.Diagnostics.ResourceMonitoring	Branch	99	94.39 🔻
Microsoft.Extensions.Diagnostics.ResourceMonitoring.Kubernetes	Line	99	97.73 🔻
Microsoft.Extensions.ServiceDiscovery.Dns	Line	75	69.93 🔻
Microsoft.Extensions.ServiceDiscovery.Abstractions	Line	75	42.11 🔻
Microsoft.Extensions.ServiceDiscovery.Abstractions	Branch	75	42.86 🔻
Microsoft.Extensions.ServiceDiscovery	Line	75	68.88 🔻
Microsoft.Extensions.ServiceDiscovery	Branch	75	71.43 🔻
Microsoft.Extensions.ServiceDiscovery.Yarp	Line	75	73.85 🔻
Microsoft.Extensions.ServiceDiscovery.Yarp	Branch	75	70 🔻
Microsoft.Extensions.VectorData.Abstractions	Line	75	37.39 🔻
Microsoft.Extensions.VectorData.Abstractions	Branch	75	22.73 🔻

🎉 Good job! The coverage increased 🎉
Update MinCodeCoverage in the project files.

Project	Expected	Actual
Microsoft.Gen.BuildMetadata	97	100
Microsoft.Gen.MetadataExtractor	57	73
Microsoft.Gen.MetricsReports	67	69
Microsoft.Extensions.AI.Abstractions	82	85
Microsoft.Extensions.AI.Evaluation.NLP	0	78
Microsoft.Extensions.Caching.Hybrid	82	89
Microsoft.Extensions.DataIngestion.Abstractions	75	91
Microsoft.Extensions.DataIngestion	75	89
Microsoft.Extensions.DataIngestion.Markdig	75	90
Microsoft.Extensions.Http.Resilience	97	100

Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=1454942&view=codecoverage-tab

adamsitnik

The chunkers never read element metadata, so any metadata attached to document elements (e.g., page numbers, source URIs, element types) was silently dropped during chunking. This meant VectorStoreWriter - which already correctly persists chunk metadata - had nothing to write.

This was done on purpose, as I expect that this information may be required during chunking/processing, but it's not always desired to be persisted in the vector store.

Sample metadata produced when parsing (and not persisted): page numbers/page size
Sample metadata produced when processing (and persisted): sentiment/categories/keyword/summary

I am not against changing this approach, but I would like to know the rationale, especially since persisting metadata now will require an additional setup on the user side (#7396). Let's chat offline about this.

adamsitnik

One more thing to consider: how metadata should be aggregated when we are using multiple elements to build a single chunk. For example: 3 paragraphs, each from a different page (1, 2, 3). How should this be expressed? Most likely as "page numbers: 1-3". But in order to perform such "smart" aggregation we would need to recognize certain metadata keys and this would require some kind of standardization (right now we don't enforce the readers to use "PageNumber", "page" or any other name)

Copilot AI review requested due to automatic review settings May 7, 2026 14:13

github-actions Bot added the area-telemetry label May 7, 2026

dotnet-policy-service Bot assigned luisquintanilla May 7, 2026

Copilot started reviewing on behalf of luisquintanilla May 7, 2026 14:15 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

luisquintanilla requested a review from adamsitnik June 8, 2026 02:43

luisquintanilla force-pushed the fix/chunker-metadata-propagation branch from 6c9bd82 to 8bfe9bd Compare June 8, 2026 18:18

luisquintanilla and others added 3 commits June 8, 2026 16:23

luisquintanilla force-pushed the fix/chunker-metadata-propagation branch from bd31e18 to 8bac32d Compare June 8, 2026 20:23

adamsitnik reviewed Jun 11, 2026

View reviewed changes

adamsitnik reviewed Jun 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Propagate element metadata to chunks in MEDI chunkers#7516

Propagate element metadata to chunks in MEDI chunkers#7516
luisquintanilla wants to merge 3 commits into
mainfrom
fix/chunker-metadata-propagation

luisquintanilla commented May 7, 2026 •

edited by dotnet-policy-service Bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

luisquintanilla Jun 8, 2026 •

edited

Loading

Uh oh!

luisquintanilla Jun 8, 2026 •

edited

Loading

Uh oh!

luisquintanilla Jun 8, 2026 •

edited

Loading

Uh oh!

CZEMacLeod commented May 8, 2026

Uh oh!

luisquintanilla commented Jun 8, 2026

Uh oh!

dotnet-comment-bot commented Jun 8, 2026

Uh oh!

dotnet-comment-bot commented Jun 8, 2026

Uh oh!

dotnet-comment-bot commented Jun 8, 2026

Uh oh!

adamsitnik left a comment

Uh oh!

adamsitnik left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

luisquintanilla commented May 7, 2026 • edited by dotnet-policy-service Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

ElementsChunker (internal, fixes 3 public chunkers)

DocumentTokenChunker (independent chunker)

Design Decisions

Testing

Microsoft Reviewers: Open in CodeFlow

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

luisquintanilla Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

luisquintanilla Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

luisquintanilla Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CZEMacLeod commented May 8, 2026

Uh oh!

luisquintanilla commented Jun 8, 2026

Uh oh!

dotnet-comment-bot commented Jun 8, 2026

Uh oh!

dotnet-comment-bot commented Jun 8, 2026

Uh oh!

dotnet-comment-bot commented Jun 8, 2026

Uh oh!

adamsitnik left a comment

Choose a reason for hiding this comment

Uh oh!

adamsitnik left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

luisquintanilla commented May 7, 2026 •

edited by dotnet-policy-service Bot

Loading

luisquintanilla Jun 8, 2026 •

edited

Loading

luisquintanilla Jun 8, 2026 •

edited

Loading

luisquintanilla Jun 8, 2026 •

edited

Loading