Add in-map index for O(log n) jsonb field access#37085
Draft
antiguru wants to merge 1 commit into
Draft
Conversation
Accessing a field of a `jsonb` value scanned the underlying `DatumMap` linearly, so a "JSON to columns" query pulling `k` fields from an `n`-key object did O(n*k) work per row. Maps are already key-sorted, so add a small, deterministic index to the in-memory map encoding and binary search it in a new `DatumMap::get`. Single field access drops to O(log n) and the `->`, `->>`, `#>`, `#>>` operators use it transparently. The index is built in `push_dict_with` and skipped by `iter`, so equality, hashing, ordering, columnar, and proto paths are unaffected. This relies on three properties of the Tag-based Row encoding: it is not persisted (no migration), Row sort order is implementation-defined, and Row equality is byte equality (the index is a deterministic function of the sorted entries, so equal maps still encode identically). Adds a `JsonbToColumns` feature benchmark and a design doc. https://claude.ai/code/session_018oZXhtehttk4y1RtuSd8jD
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Accessing a field of a
jsonbvalue currently requires a linear scan of the underlyingDatumMap. A "JSON to columns" query that pullskfields out of an object withnkeys doesO(n * k)work per row. Since the map entries are already sorted by key at pack time, we can exploit this to enable binary search.Description
This change adds a small, deterministic index to the in-memory
Rowencoding of maps, enablingO(log n)single-key lookup instead ofO(n)linear scan.Key changes:
Map payload layout (
src/repr/src/row.rs):[count: u32][offset_1: u32]...[offset_{n-1}: u32][entries...](key, value)datum pairsIndex construction (
finish_dictfunction):RowPacker::push_dict_withbefore the length is fixedNew
DatumMapmethods:len()/is_empty(): Report map cardinalityget(key): Binary search for a key, returning its value inO(log n)entries(): Helper to skip the header when iteratingentry_offset(i): Helper to locate entryivia the indexExisting code updates:
DatumMap::iter()now skips the header, so all iter-based code (equality, hashing, ordering, columnar encoding) is unaffectedjsonb_get_stringinsrc/expr/src/scalar/func.rsnow usesdict.get(k)instead ofiter().find()Design doc (
doc/developer/design/20260616_jsonb_map_index.md):Benchmark scenario (
misc/python/materialize/feature_benchmark/scenarios/benchmark_main.py):JsonbToColumnsscenario that reads 50 fields from a 50-key object per rowWhy this is safe:
Tagbyte layout is never persisted (onlyProtoRowand Arrow/Parquet are durable)Rowequality is byte-equality, and the index is a pure, deterministic function of the sorted entries, so equal maps still produce identical bytesRowsort order is implementation-defined, so changing the map layout has no correctness impactVerification
test_datum_map_getcovering maps of various sizes (0, 1, 2, 3, 7, 16, 50 entries), verifying thatget()agrees with linear scan for both present keys and misses, and thatlen()/is_empty()report correct cardinalityget()iter()skips the headerhttps://claude.ai/code/session_018oZXhtehttk4y1RtuSd8jD