Skip to content

fix/index: preserve skipped-file category through shard paths#1073

Merged
keegancsmith merged 2 commits into
mainfrom
k/shardbuilder-category-before-language
Jun 17, 2026
Merged

fix/index: preserve skipped-file category through shard paths#1073
keegancsmith merged 2 commits into
mainfrom
k/shardbuilder-category-before-language

Conversation

@keegancsmith

@keegancsmith keegancsmith commented Jun 17, 2026

Copy link
Copy Markdown
Member

ShardBuilder.Add still determined language before file category for callers that bypass Builder.Add. In that path skipped content can be replaced with the not-indexed marker, so doing language first leaves category detection operating on synthetic content instead of the original file and misses the cheaper skip-aware language path.

This changes ShardBuilder.Add to determine the file category before rewriting skipped content, then infer language afterward so direct ShardBuilder callers follow the same behavior as Builder.Add and keep content-aware categorization for skipped documents.

Shard merging also reconstructs documents through ShardBuilder, so this PR carries the category already stored in the source shard into the rebuilt document. That keeps merge metadata-preserving instead of forcing category inference to rediscover information the original indexer already knew.

keegancsmith and others added 2 commits April 10, 2026 10:38
ShardBuilder.Add still determined language before file category for callers that bypass Builder.Add. In that path we replace skipped content with the not-indexed marker, so doing language first left category detection operating on synthetic content instead of the original file and missed the cheaper skip-aware language path discussed in review.

Determine the file category before rewriting skipped content, then infer language afterward so direct ShardBuilder callers follow the same behavior as Builder.Add and keep content-aware categorization for skipped documents.

Test Plan: go test ./index -run 'TestShardBuilderAddDeterminesCategoryBeforeReplacingSkippedContent|TestDetermineLanguageIfUnknown|TestFileRank|TestDetermineFileCategory'

Amp-Thread-ID: https://ampcode.com/threads/T-019d7686-51be-707a-a017-6ddd291e18e3
Co-authored-by: Amp <amp@ampcode.com>
Merged shards reconstruct documents through ShardBuilder, so relying on category inference can discard metadata that the original indexer already stored. This is especially fragile for skipped generated files where the content used for categorization may not survive reconstruction. Carry the stored category into the rebuilt document so merging remains metadata-preserving.

Amp-Thread-ID: https://ampcode.com/threads/T-019ed48e-abd7-73cc-877e-1866db891594
Co-authored-by: Amp <amp@ampcode.com>
@keegancsmith keegancsmith changed the title fix/index: preserve skipped-file categorization before language fallback fix/index: preserve skipped-file category through shard paths Jun 17, 2026
@keegancsmith keegancsmith marked this pull request as ready for review June 17, 2026 10:54
Comment thread index/shard_builder.go
}
// Preserve the original content for category detection in callers that
// bypass Builder.Add and pass skipped documents directly.
DetermineFileCategory(&doc)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor but why don't we do the above bytes.IndexByte check inside DetermineFileCategory? Would that make it expensive?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for some callers doc.Content is empty. So we rely on doc.SkipReason to be set in those cases. Also DetermineFileCategory only mutates doc.Category, not doc.SkipReason.

But I think you might be on to something around making this easier to understand. It isn't crystal clear in my head why things are done exactly as they are.

@keegancsmith keegancsmith merged commit 2d47455 into main Jun 17, 2026
14 checks passed
@keegancsmith keegancsmith deleted the k/shardbuilder-category-before-language branch June 17, 2026 11:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants