perf: accelerate ordered string scans with a regex byte-class prefilter#51
Merged
Merged
Conversation
…refilter Ordered string comparisons (>, <, >=, <=, between) had no C-level shortcut like the EXACT path's bytes.find — they stepped every byte in Python and decoded a window per offset, so a single search_by_value_between over a process address space took minutes. A string window can only satisfy an ordered comparison when its first byte falls in a known range (strings compare big-endian, so the first byte dominates the order). Locate those candidates with a regex byte class, whose C engine skips the long NUL runs of reserved/zeroed memory; accept a candidate outright unless its first byte ties a bound, where the full window is decoded and checked. Measured: 31x on a sparse 150 MB region, and still 1.24x (no regression) on a pathological full-range scan that matches every byte. test_search_by_string_between drops from ~300s to ~13s. Correctness: - Width is guaranteed: value_to_bytes pads the target to exactly bufflength bytes, so the first-byte-dominance shortcut is sound. - Reversed VALUE_BETWEEN (start > end) would compile to a '[hi-lo]' class and raise re.error; guard it to return empty, matching the byte-by-byte loop. - Compiled with no flags (notably never re.IGNORECASE, which folds ASCII case inside a class and would over-match). - Output verified byte-for-byte against the reference over 20k randomized cases (incl. reversed ranges, regex-special bytes, sizes 1-20), plus deterministic and hypothesis property tests.
5c604ec to
ea432f6
Compare
The string optimization made two claims in docs/guide/searching.md stale: - the ordered-comparison loop is no longer uniformly pure-Python — ordered string scans (>, <, between) now run through the regex byte-class prefilter in C, independent of the NumPy speed extra; - the 'str/bytes scans: pure-Python loop' table row over-generalized. Also document the string ordering semantics that the fast path relies on: str compares UTF-8 bytes lexicographically (big-endian), the shorter value is NUL-padded to bufflength, and a reversed VALUE_BETWEEN range matches nothing.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Ordered string comparisons —
BIGGER_THAN,SMALLER_THAN,*_OR_EXACT_VALUE, andVALUE_BETWEEN— had no C-level shortcut. Unlike theEXACT_VALUEpath (which usesbytes.find), they fell into the pure-Python fallback inscan_memory, stepping every byte (step = 1for strings) and decoding a window withint.from_bytesper offset.Over a real process's address space that's ~1 billion Python iterations, so a single
search_by_value_between(str, …)took minutes. In the test suite,test_search_by_string_betweenalone was ~300s and gated the whole macOS run (xdist--dist=loadfilepinstest_editor.pyto one worker).Fix
Strings compare big-endian, so a fixed-width window can only satisfy an ordered comparison when its first byte lies in a known range. The fast path:
[lo-hi]for the candidate first bytes and locates them withre.finditer— whose C engine skips the long NUL runs of reserved/zeroed memory, exactly likebytes.finddoes for EXACT.NOT_EXACT_VALUE/NOT_VALUE_BETWEENkeep the byte-by-byte loop (their match set is dense, so a prefilter wouldn't help), and numerics with unusual sizes (3/6/7 bytes, little-endian) are untouched.Results
test_search_by_string_betweenEven the pathological full-range case is faster, because the "accept outright" path skips the per-window
int.from_bytesthe old loop always paid — so no density guard is needed.Correctness review
The first-byte-dominance shortcut is only valid under two preconditions, both verified:
value_to_bytesbuilds a ctypes buffer of exactlybufflengthbytes and NUL-pads, so the target reachingscan_memoryis always exactlytarget_value_sizewide (searching"AB"withbufflength=20→b"AB" + b"\x00"*18). With equal width, a strictly-greater/smaller first byte determines the full comparison — no false positives or negatives.lo <= hi. A reversedVALUE_BETWEEN(start > end) would compile to a[hi-lo]class and raisere.error: bad character range. The byte-by-byte loop returns[]forstart > end, so the fast path now guardslo_byte > hi_byteand returns empty to match. (Caught during review; covered by a regression test.)re.IGNORECASE, which folds ASCII case inside a class and would over-match a range overlappingA-Z/a-z. Documented inline so it isn't introduced later.Output is byte-for-byte identical to the previous loop:
0x00/0xffextremes, boundary ties, reversed ranges, and regex-special bytes[ ] ^ - \ & ~ |), withFutureWarningtreated as an error.re.escapeconfirmed to yield an exact inclusive ordinal class for all 256 byte values.test_scan.py(incl. reversed-range and regex-special bounds) + hypothesis property tests intest_scan_properties.py.References consulted: Python
redocs (bytes patterns are matched by ordinal and are locale-independent withoutre.LOCALE;IGNORECASEASCII-folds inside classes;DOTALL/MULTILINEdon't affect class membership), CPython issue #74534 / bpo-30349 (the nested-setFutureWarning— not scheduled to become an error, and avoided entirely by escaping endpoints).348 passed, 13 skipped; flake8 and mypy clean. Independent of the dyld-shared-cache fix (#50) — different files, composes cleanly.