Add in-map index for O(log n) jsonb field access by antiguru · Pull Request #37085 · MaterializeInc/materialize

antiguru · 2026-06-16T20:52:26Z

Motivation

Accessing a field of a jsonb value currently requires a linear scan of the underlying DatumMap. A "JSON to columns" query that pulls k fields out of an object with n keys does O(n * k) work per row. Since the map entries are already sorted by key at pack time, we can exploit this to enable binary search.

Description

This change adds a small, deterministic index to the in-memory Row encoding of maps, enabling O(log n) single-key lookup instead of O(n) linear scan.

Key changes:

Map payload layout (src/repr/src/row.rs):
- Non-empty maps now have a header: [count: u32][offset_1: u32]...[offset_{n-1}: u32][entries...]
- The header stores the entry count and byte offsets of entries 1..n (entry 0 is always at offset 0)
- Empty maps remain empty (no header) to preserve canonical encoding
- Entries remain unchanged: sorted (key, value) datum pairs
Index construction (finish_dict function):
- Called at the end of RowPacker::push_dict_with before the length is fixed
- Walks the just-written entries to compute offsets, then splices the header in front
- Uses in-place memmove to avoid extra allocations
New DatumMap methods:
- len() / is_empty(): Report map cardinality
- get(key): Binary search for a key, returning its value in O(log n)
- entries(): Helper to skip the header when iterating
- entry_offset(i): Helper to locate entry i via the index
Existing code updates:
- DatumMap::iter() now skips the header, so all iter-based code (equality, hashing, ordering, columnar encoding) is unaffected
- jsonb_get_string in src/expr/src/scalar/func.rs now uses dict.get(k) instead of iter().find()
Design doc (doc/developer/design/20260616_jsonb_map_index.md):
- Explains why this is safe (in-memory encoding, not persisted; byte-equality preserved; sort order is implementation-defined)
- Discusses alternatives considered and why this approach was chosen
Benchmark scenario (misc/python/materialize/feature_benchmark/scenarios/benchmark_main.py):
- New JsonbToColumns scenario that reads 50 fields from a 50-key object per row
- Exercises the exact workload this optimization targets

Why this is safe:

The Tag byte layout is never persisted (only ProtoRow and Arrow/Parquet are durable)
Row equality is byte-equality, and the index is a pure, deterministic function of the sorted entries, so equal maps still produce identical bytes
Row sort order is implementation-defined, so changing the map layout has no correctness impact

Verification

Added comprehensive unit test test_datum_map_get covering maps of various sizes (0, 1, 2, 3, 7, 16, 50 entries), verifying that get() agrees with linear scan for both present keys and misses, and that len()/is_empty() report correct cardinality
Test also verifies nested values (lists) are returned intact via get()
Existing tests remain unaffected since iter() skips the header

https://claude.ai/code/session_018oZXhtehttk4y1RtuSd8jD

Accessing a field of a `jsonb` value scanned the underlying `DatumMap` linearly, so a "JSON to columns" query pulling `k` fields from an `n`-key object did O(n*k) work per row. Maps are already key-sorted, so add a small, deterministic index to the in-memory map encoding and binary search it in a new `DatumMap::get`. Single field access drops to O(log n) and the `->`, `->>`, `#>`, `#>>` operators use it transparently. The index is built in `push_dict_with` and skipped by `iter`, so equality, hashing, ordering, columnar, and proto paths are unaffected. This relies on three properties of the Tag-based Row encoding: it is not persisted (no migration), Row sort order is implementation-defined, and Row equality is byte equality (the index is a deterministic function of the sorted entries, so equal maps still encode identically). Adds a `JsonbToColumns` feature benchmark and a design doc. https://claude.ai/code/session_018oZXhtehttk4y1RtuSd8jD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add in-map index for O(log n) jsonb field access#37085

Add in-map index for O(log n) jsonb field access#37085
antiguru wants to merge 1 commit into
mainfrom
claude/loving-goldberg-e93dh2

antiguru commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

antiguru commented Jun 16, 2026

Motivation

Description

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants