[common] Introduce PrefixFileIndex for prefix query optimization#7750
Open
xuzifu666 wants to merge 3 commits intoapache:masterfrom
Open
[common] Introduce PrefixFileIndex for prefix query optimization#7750xuzifu666 wants to merge 3 commits intoapache:masterfrom
xuzifu666 wants to merge 3 commits intoapache:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
In real-world analytics scenarios, prefix queries on high-cardinality string columns are very common. For example:
Existing file indexes in Paimon, such as BloomFilter and Bitmap, excel at equality lookups but cannot efficiently handle prefix matching. BloomFilter only checks exact value existence; Bitmap Index maps each distinct value to a bitmap, making it impossible to determine which values share a common prefix without scanning all entries.When no suitable index exists, the query engine must perform a full file scan — reading the entire data file (often tens of MBs) just to discover that no rows match the prefix predicate. This becomes prohibitively expensive at scale.
This PR introduces PrefixFileIndex, a new pluggable file-level index that accelerates prefix queries through a lightweight inverted index structure.
Prefix File Index is an inverted index that maps prefix strings to row number bitmaps. Unlike Bitmap Index which indexes exact values, it extracts the first N characters from each string value and groups rows by their prefix.
According to benchmark test result:
Test Environment
{category}_{id}format, 5 categories)paimon-benchmark/paimon-micro-benchmarks1. Index Size Comparison
Key Finding: Prefix Index size is independent of data cardinality, depending only on the number of prefix types. Even at cardinality 10000, the index remains at ~649KB.
2. Index Build Time Comparison
3. Query Performance — Skip Scenario (Core Value)
Querying a non-existing prefix; no-index scan must check all 1 million rows to confirm no match:
4. Production Scenario Inference
The above tests were conducted in memory, without accounting for disk I/O. In production:
Conclusion
Tests
PrefixFileIndexTest
PrefixIndexBenchmark