Skip to content

Add in-map index for O(log n) jsonb field access#37085

Draft
antiguru wants to merge 1 commit into
mainfrom
claude/loving-goldberg-e93dh2
Draft

Add in-map index for O(log n) jsonb field access#37085
antiguru wants to merge 1 commit into
mainfrom
claude/loving-goldberg-e93dh2

Conversation

@antiguru

Copy link
Copy Markdown
Member

Motivation

Accessing a field of a jsonb value currently requires a linear scan of the underlying DatumMap. A "JSON to columns" query that pulls k fields out of an object with n keys does O(n * k) work per row. Since the map entries are already sorted by key at pack time, we can exploit this to enable binary search.

Description

This change adds a small, deterministic index to the in-memory Row encoding of maps, enabling O(log n) single-key lookup instead of O(n) linear scan.

Key changes:

  1. Map payload layout (src/repr/src/row.rs):

    • Non-empty maps now have a header: [count: u32][offset_1: u32]...[offset_{n-1}: u32][entries...]
    • The header stores the entry count and byte offsets of entries 1..n (entry 0 is always at offset 0)
    • Empty maps remain empty (no header) to preserve canonical encoding
    • Entries remain unchanged: sorted (key, value) datum pairs
  2. Index construction (finish_dict function):

    • Called at the end of RowPacker::push_dict_with before the length is fixed
    • Walks the just-written entries to compute offsets, then splices the header in front
    • Uses in-place memmove to avoid extra allocations
  3. New DatumMap methods:

    • len() / is_empty(): Report map cardinality
    • get(key): Binary search for a key, returning its value in O(log n)
    • entries(): Helper to skip the header when iterating
    • entry_offset(i): Helper to locate entry i via the index
  4. Existing code updates:

    • DatumMap::iter() now skips the header, so all iter-based code (equality, hashing, ordering, columnar encoding) is unaffected
    • jsonb_get_string in src/expr/src/scalar/func.rs now uses dict.get(k) instead of iter().find()
  5. Design doc (doc/developer/design/20260616_jsonb_map_index.md):

    • Explains why this is safe (in-memory encoding, not persisted; byte-equality preserved; sort order is implementation-defined)
    • Discusses alternatives considered and why this approach was chosen
  6. Benchmark scenario (misc/python/materialize/feature_benchmark/scenarios/benchmark_main.py):

    • New JsonbToColumns scenario that reads 50 fields from a 50-key object per row
    • Exercises the exact workload this optimization targets

Why this is safe:

  • The Tag byte layout is never persisted (only ProtoRow and Arrow/Parquet are durable)
  • Row equality is byte-equality, and the index is a pure, deterministic function of the sorted entries, so equal maps still produce identical bytes
  • Row sort order is implementation-defined, so changing the map layout has no correctness impact

Verification

  • Added comprehensive unit test test_datum_map_get covering maps of various sizes (0, 1, 2, 3, 7, 16, 50 entries), verifying that get() agrees with linear scan for both present keys and misses, and that len()/is_empty() report correct cardinality
  • Test also verifies nested values (lists) are returned intact via get()
  • Existing tests remain unaffected since iter() skips the header

https://claude.ai/code/session_018oZXhtehttk4y1RtuSd8jD

Accessing a field of a `jsonb` value scanned the underlying `DatumMap`
linearly, so a "JSON to columns" query pulling `k` fields from an `n`-key
object did O(n*k) work per row.

Maps are already key-sorted, so add a small, deterministic index to the
in-memory map encoding and binary search it in a new `DatumMap::get`. Single
field access drops to O(log n) and the `->`, `->>`, `#>`, `#>>` operators use
it transparently. The index is built in `push_dict_with` and skipped by
`iter`, so equality, hashing, ordering, columnar, and proto paths are
unaffected.

This relies on three properties of the Tag-based Row encoding: it is not
persisted (no migration), Row sort order is implementation-defined, and Row
equality is byte equality (the index is a deterministic function of the
sorted entries, so equal maps still encode identically).

Adds a `JsonbToColumns` feature benchmark and a design doc.

https://claude.ai/code/session_018oZXhtehttk4y1RtuSd8jD
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants