[common] Introduce PrefixFileIndex for prefix query optimization by xuzifu666 · Pull Request #7750 · apache/paimon

xuzifu666 · 2026-04-30T09:13:40Z

Purpose

In real-world analytics scenarios, prefix queries on high-cardinality string columns are very common. For example:

WHERE url LIKE '/api/v1/%'
WHERE order_id LIKE 'ORD2024%'

Existing file indexes in Paimon, such as BloomFilter and Bitmap, excel at equality lookups but cannot efficiently handle prefix matching. BloomFilter only checks exact value existence; Bitmap Index maps each distinct value to a bitmap, making it impossible to determine which values share a common prefix without scanning all entries.When no suitable index exists, the query engine must perform a full file scan — reading the entire data file (often tens of MBs) just to discover that no rows match the prefix predicate. This becomes prohibitively expensive at scale.

This PR introduces PrefixFileIndex, a new pluggable file-level index that accelerates prefix queries through a lightweight inverted index structure.

Prefix File Index is an inverted index that maps prefix strings to row number bitmaps. Unlike Bitmap Index which indexes exact values, it extracts the first N characters from each string value and groups rows by their prefix.

According to benchmark test result：

Test Environment

CPU: Apple M4
JVM: Java HotSpot 17.0.12
Data Volume: 1 million string rows ({category}_{id} format, 5 categories)
Test Module: paimon-benchmark/paimon-micro-benchmarks

1. Index Size Comparison

Cardinality	PrefixLen=2	PrefixLen=3	PrefixLen=4	BitmapIndex	Raw Data	Prefix Space Saving
100	649 KB	649 KB	649 KB	2.0 MB	13.1 MB	20x vs data
1000	649 KB	649 KB	649 KB	2.7 MB	14.7 MB	23x vs data
10000	649 KB	649 KB	649 KB	7.7 MB	15.7 MB	24x vs data

Key Finding: Prefix Index size is independent of data cardinality, depending only on the number of prefix types. Even at cardinality 10000, the index remains at ~649KB.

2. Index Build Time Comparison

Cardinality	PrefixLen=2	PrefixLen=3	PrefixLen=4	BitmapIndex	Prefix Build Speedup
100	126 ms	114 ms	110 ms	163 ms	1.3-1.5x
1000	112 ms	112 ms	110 ms	246 ms	2.2x
10000	111 ms	110 ms	113 ms	633 ms	5.6x

3. Query Performance — Skip Scenario (Core Value)

Querying a non-existing prefix; no-index scan must check all 1 million rows to confirm no match:

Cardinality	PrefixIndex	BitmapIndex	No-Index-Full-Scan	Prefix Index Speedup
100	~1.3 μs	~2.1 μs	12.558 μs	~9.8x
1000	~1.1 μs	~2.1 μs	13.147 μs	~11.9x
10000	~1.2 μs	~1.6 μs	13.262 μs	~11.3x

4. Production Scenario Inference

The above tests were conducted in memory, without accounting for disk I/O. In production:

Scenario	No Index	Prefix Index	Inferred Speedup
Data file size	~15 MB (1M rows)	~649 KB	-
Disk read time	~50-200 ms	~0.1 ms (cache hit)	500-2000x
Skip decision time	Must read all data	Returns SKIP in 1 μs	Tens of thousands x

Conclusion

Dimension	Prefix Index Advantage
Index Size	Only 1/20 of raw data, 3-12x smaller than Bitmap Index
Build Speed	Up to 5.6x faster in high-cardinality scenarios
Skip Performance	10-12x faster in memory, hundreds to thousands x in real disk I/O

Tests

PrefixFileIndexTest
PrefixIndexBenchmark

xuzifu666 added 3 commits April 30, 2026 15:51

[core] Introduce PrefixFileIndex for prefix query optimization

560eb1e

added

7ce0e7d

added

c346975

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[common] Introduce PrefixFileIndex for prefix query optimization#7750

[common] Introduce PrefixFileIndex for prefix query optimization#7750
xuzifu666 wants to merge 3 commits intoapache:masterfrom
xuzifu666:prefix_file_index_support

xuzifu666 commented Apr 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xuzifu666 commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Environment

1. Index Size Comparison

2. Index Build Time Comparison

3. Query Performance — Skip Scenario (Core Value)

4. Production Scenario Inference

Conclusion

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xuzifu666 commented Apr 30, 2026 •

edited

Loading