fix/index: preserve skipped-file category through shard paths#1073
Merged
Conversation
ShardBuilder.Add still determined language before file category for callers that bypass Builder.Add. In that path we replace skipped content with the not-indexed marker, so doing language first left category detection operating on synthetic content instead of the original file and missed the cheaper skip-aware language path discussed in review. Determine the file category before rewriting skipped content, then infer language afterward so direct ShardBuilder callers follow the same behavior as Builder.Add and keep content-aware categorization for skipped documents. Test Plan: go test ./index -run 'TestShardBuilderAddDeterminesCategoryBeforeReplacingSkippedContent|TestDetermineLanguageIfUnknown|TestFileRank|TestDetermineFileCategory' Amp-Thread-ID: https://ampcode.com/threads/T-019d7686-51be-707a-a017-6ddd291e18e3 Co-authored-by: Amp <amp@ampcode.com>
Merged shards reconstruct documents through ShardBuilder, so relying on category inference can discard metadata that the original indexer already stored. This is especially fragile for skipped generated files where the content used for categorization may not survive reconstruction. Carry the stored category into the rebuilt document so merging remains metadata-preserving. Amp-Thread-ID: https://ampcode.com/threads/T-019ed48e-abd7-73cc-877e-1866db891594 Co-authored-by: Amp <amp@ampcode.com>
stefanhengl
approved these changes
Jun 17, 2026
burmudar
approved these changes
Jun 17, 2026
| } | ||
| // Preserve the original content for category detection in callers that | ||
| // bypass Builder.Add and pass skipped documents directly. | ||
| DetermineFileCategory(&doc) |
Contributor
There was a problem hiding this comment.
minor but why don't we do the above bytes.IndexByte check inside DetermineFileCategory? Would that make it expensive?
Member
Author
There was a problem hiding this comment.
for some callers doc.Content is empty. So we rely on doc.SkipReason to be set in those cases. Also DetermineFileCategory only mutates doc.Category, not doc.SkipReason.
But I think you might be on to something around making this easier to understand. It isn't crystal clear in my head why things are done exactly as they are.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ShardBuilder.Add still determined language before file category for callers that bypass Builder.Add. In that path skipped content can be replaced with the not-indexed marker, so doing language first leaves category detection operating on synthetic content instead of the original file and misses the cheaper skip-aware language path.
This changes ShardBuilder.Add to determine the file category before rewriting skipped content, then infer language afterward so direct ShardBuilder callers follow the same behavior as Builder.Add and keep content-aware categorization for skipped documents.
Shard merging also reconstructs documents through ShardBuilder, so this PR carries the category already stored in the source shard into the rebuilt document. That keeps merge metadata-preserving instead of forcing category inference to rediscover information the original indexer already knew.