fix/index: preserve skipped-file category through shard paths by keegancsmith · Pull Request #1073 · sourcegraph/zoekt

keegancsmith · 2026-06-17T10:43:14Z

ShardBuilder.Add still determined language before file category for callers that bypass Builder.Add. In that path skipped content can be replaced with the not-indexed marker, so doing language first leaves category detection operating on synthetic content instead of the original file and misses the cheaper skip-aware language path.

This changes ShardBuilder.Add to determine the file category before rewriting skipped content, then infer language afterward so direct ShardBuilder callers follow the same behavior as Builder.Add and keep content-aware categorization for skipped documents.

Shard merging also reconstructs documents through ShardBuilder, so this PR carries the category already stored in the source shard into the rebuilt document. That keeps merge metadata-preserving instead of forcing category inference to rediscover information the original indexer already knew.

ShardBuilder.Add still determined language before file category for callers that bypass Builder.Add. In that path we replace skipped content with the not-indexed marker, so doing language first left category detection operating on synthetic content instead of the original file and missed the cheaper skip-aware language path discussed in review. Determine the file category before rewriting skipped content, then infer language afterward so direct ShardBuilder callers follow the same behavior as Builder.Add and keep content-aware categorization for skipped documents. Test Plan: go test ./index -run 'TestShardBuilderAddDeterminesCategoryBeforeReplacingSkippedContent|TestDetermineLanguageIfUnknown|TestFileRank|TestDetermineFileCategory' Amp-Thread-ID: https://ampcode.com/threads/T-019d7686-51be-707a-a017-6ddd291e18e3 Co-authored-by: Amp <amp@ampcode.com>

Merged shards reconstruct documents through ShardBuilder, so relying on category inference can discard metadata that the original indexer already stored. This is especially fragile for skipped generated files where the content used for categorization may not survive reconstruction. Carry the stored category into the rebuilt document so merging remains metadata-preserving. Amp-Thread-ID: https://ampcode.com/threads/T-019ed48e-abd7-73cc-877e-1866db891594 Co-authored-by: Amp <amp@ampcode.com>

burmudar · 2026-06-17T11:08:00Z

 		}
+		// Preserve the original content for category detection in callers that
+		// bypass Builder.Add and pass skipped documents directly.
+		DetermineFileCategory(&doc)


minor but why don't we do the above bytes.IndexByte check inside DetermineFileCategory? Would that make it expensive?

for some callers doc.Content is empty. So we rely on doc.SkipReason to be set in those cases. Also DetermineFileCategory only mutates doc.Category, not doc.SkipReason.

But I think you might be on to something around making this easier to understand. It isn't crystal clear in my head why things are done exactly as they are.

keegancsmith and others added 2 commits April 10, 2026 10:38

keegancsmith mentioned this pull request Jun 17, 2026

fix/index: preserve category when merging shards #1072

Merged

keegancsmith changed the title ~~fix/index: preserve skipped-file categorization before language fallback~~ fix/index: preserve skipped-file category through shard paths Jun 17, 2026

keegancsmith requested review from burmudar and stefanhengl June 17, 2026 10:54

keegancsmith marked this pull request as ready for review June 17, 2026 10:54

stefanhengl approved these changes Jun 17, 2026

View reviewed changes

burmudar approved these changes Jun 17, 2026

View reviewed changes

keegancsmith merged commit 2d47455 into main Jun 17, 2026
14 checks passed

keegancsmith deleted the k/shardbuilder-category-before-language branch June 17, 2026 11:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix/index: preserve skipped-file category through shard paths#1073

fix/index: preserve skipped-file category through shard paths#1073
keegancsmith merged 2 commits into
mainfrom
k/shardbuilder-category-before-language

keegancsmith commented Jun 17, 2026 •

edited

Loading

Uh oh!

burmudar Jun 17, 2026

Uh oh!

keegancsmith Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

keegancsmith commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

burmudar Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

keegancsmith Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

keegancsmith commented Jun 17, 2026 •

edited

Loading