refactor(init): migrate grep/glob tools to src/lib/scan/#797
Open
refactor(init): migrate grep/glob tools to src/lib/scan/#797
Conversation
Replace the rg → git grep → fs walk fallback chain in the init-wizard grep and glob tools with direct calls to the pure-TS `collectGrep` / `collectGlob` helpers from `src/lib/scan/` (shipped in PR #791). The Mastra wire contract is preserved verbatim on all existing fields. ## Changes - **`src/lib/init/tools/grep.ts`**: 299 → 114 LOC. Drops `rgGrepSearch`, `gitGrepSearch`, `fsGrepSearch`, `parseRgGrepOutput`, `parseGrepOutput`, `findRegexMatches`, `readSearchableFile`, `truncateMatchLine`, `limitMatches`, `compilePattern`. Replaces with a thin adapter that forwards each search to `collectGrep`, strips the `absolutePath` field (not part of the wire contract), and catches `ValidationError` from bad regexes so a single bad pattern doesn't abort the whole payload. - **`src/lib/init/tools/glob.ts`**: 145 → 73 LOC. Drops `rgGlobSearch`, `gitLsFiles`, `fsGlobSearch`. Replaces with a thin adapter that calls `collectGlob` per pattern (preserving per-pattern attribution, which `collectGlob`'s unioning would lose). - **`src/lib/init/tools/search-utils.ts`**: DELETED (146 LOC). Zero callers after the grep/glob rewrites. - **`src/lib/init/types.ts::GrepSearch`**: two optional fields added: `caseInsensitive?: boolean` and `multiline?: boolean`. No current Mastra server invocation sets them; both default to what the scan engine defaults to (case-sensitive, line-boundary anchoring — rg semantics). Future-proofing. - **`src/lib/scan/glob.ts`**: updated a stale doc comment that referenced the deleted `search-utils.ts::matchGlob`. ## Test changes - Drop 3 obsolete subprocess-fallback tests that shadowed `rg` in `PATH` to force the fallback chain. With the pure-TS adapter there's no fallback chain to exercise and the tests had become tautological. - Drop the `writeExecutable`, `setPath`, `helperBinDir`, `savedPath` scaffolding those tests depended on. - Keep 3 pre-existing wire-behavior/sandbox tests unchanged. - Add 3 adapter-specific tests: - `grep result matches MUST NOT include absolutePath` — pins the adapter's strip behavior (scan returns it; Mastra must not see it). - `grep bad regex yields empty matches without crashing the payload` — documents the `ValidationError` catch contract. - `grep caseInsensitive flag enables case-insensitive matching` — end-to-end coverage of the new wire field. ## Behavior changes (intentional, from scan module defaults) Three behavior shifts land for the pre-PR-791 fs-walk fallback (users without `rg` or `git` installed): - **Nested `.gitignore` now honored.** Old fs-walk fallback ignored gitignore entirely; scan respects cumulative gitignore semantics (matches what `rg` / git grep already did). - **Wider skip-dir list.** Scan skips `.next`, `target`, `vendor`, `coverage`, `.cache`, `.turbo` in addition to the old skip set — matches rg's built-in skips. - **Binary files filtered.** Scan runs an 8 KB NUL-byte sniff before emitting a file to grep; binary matches (e.g. inside a `.png`) no longer appear. Again, matches `rg`'s default. Users with `rg` installed see zero change — they never took the fallback path anyway. Users without rg/git get rg-like behavior instead of the old naive fs walk. ## Net LOC Before: 299 + 145 + 146 = 590 LOC across three files. After: 114 + 73 + 0 (+17 on types.ts) = 204 LOC across two files. **Net: −386 LOC of production code.** ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean (1 pre-existing warning in `markdown.ts`) - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5507 pass, 0 fail** (+1 net: −3 obsolete, +4 new adapter tests, +absolutePath regression) - [x] `bun test test/isolated` — 138 pass - [x] Manual: `bun test test/lib/init/tools/` — 30 pass, wire contract preserved end-to-end via `executeTool` Follow-up to PR #791.
Contributor
Semver Impact of This PR🟢 Patch (bug fixes) 📋 Changelog PreviewThis is how your changes will appear in the changelog. New Features ✨
Bug Fixes 🐛
Documentation 📚
Internal Changes 🔧Init
Other
🤖 This preview updates automatically when you update the PR. |
Contributor
|
Contributor
Codecov Results 📊✅ 138 passed | Total: 138 | Pass Rate: 100% | Execution Time: 0ms 📊 Comparison with Base Branch
✨ No test changes detected All tests are passing successfully. ✅ Patch coverage is 100.00%. Project has 1738 uncovered lines. Coverage diff@@ Coverage Diff @@
## main #PR +/-##
==========================================
- Coverage 95.66% 95.63% -0.03%
==========================================
Files 280 279 -1
Lines 40060 39794 -266
Branches 0 0 —
==========================================
+ Hits 38322 38056 -266
- Misses 1738 1738 —
- Partials 0 0 —Generated by Codecov Action |
betegon
approved these changes
Apr 21, 2026
Member
betegon
left a comment
There was a problem hiding this comment.
thanks for this. works great with init!
6 tasks
BYK
added a commit
that referenced
this pull request
Apr 21, 2026
## Summary Two regex-level optimizations to narrow the perf gap with ripgrep on our pure-TS `collectGrep`/`grepFiles`. Follow-up to PR #791 and #797. - **Literal prefilter** — ripgrep-style: extract a literal substring from the regex source (e.g., `import` from `import.*from`), scan the buffer with `indexOf` to locate candidate lines, only invoke the regex engine on lines that contain the literal. V8's `indexOf` is roughly SIMD-speed; skipping the regex engine on non-candidate lines is where most of the win comes from. - **Lazy line counting** — swapped `charCodeAt`-walk for `indexOf("\n", cursor)` hops. 2-5× faster on the line-counting sub-loop because V8 implements `indexOf` in C++ without per-iteration JS interop. ## Perf impact (synthetic/large, 10k files, Bun 1.3.11, 4-core) | Op | Before | After | Δ | |---|---:|---:|---:| | `scan.grepFiles` (DSN pattern) | 370 ms | **318 ms** | **−14%** | | `detectAllDsns.cold` | 363 ms | **313 ms** | **−14%** | | `detectDsn.cold` | 7.73 ms | **5.61 ms** | **−27%** | | `scanCodeForFirstDsn` | 2.91 ms | **2.13 ms** | **−27%** | | `scanCodeForDsns` | 342 ms | 333 ms | −3% (noise-equivalent) | | `import.*from` uncapped (bench) | 1489 ms | **1178 ms** | **−21%** | The DSN workloads improve because `DSN_PATTERN` extracts `http` as its literal — most source files don't contain `http` at all, so the prefilter short-circuits before the regex runs. No regressions on any benchmark. Pure-literal patterns (e.g., `SENTRY_DSN`, `NONEXISTENT_TOKEN_XYZ`) continue through the whole-buffer path unchanged. ## What changed ### New file: `src/lib/scan/literal-extract.ts` (~300 LOC) Conservative literal extractor. Walks a regex source looking for the longest contiguous run of literal bytes that every match must contain. Bails out safely on top-level alternation, character classes, groups, lookarounds, quantifiers, and escape classes. Handles escaped metacharacters intelligently: `Sentry\.init` yields `Sentry.init` (extracted via literal `\.` → `.`), while `\bfoo\b` yields `foo` (escape `\b` is an anchor, not a literal `b`). Exports: - `extractInnerLiteral(source, flags)` — returns the literal, or null if no safe extraction possible. Honors `/i` by lowercasing. - `isPureLiteral(source, flags)` — true when the pattern IS a bare literal with no metacharacters. Used by the grep pipeline to route pure-literals to the whole-buffer path (V8's regex engine is hyper-optimized for pure-literal patterns; the prefilter adds overhead without benefit there). ### Modified: `src/lib/scan/grep.ts` (~240 LOC changes) Three-way dispatch in `readAndGrep` based on the extracted literal: 1. **`grepByLiteralPrefilter`** (new) — regex with extractable literal + `multiline: true`. Uses `indexOf(literal)` to find candidate lines, runs the regex engine only on those. This is the main perf win. 2. **`grepByWholeBuffer`** — existing path, used for: - Pure-literal patterns (V8 handles them optimally) - Patterns with no extractable literal (complex regex, top-level alternation) - `multiline: false` mode (the fast path requires per-line semantics) Also: replaced the `charCodeAt`-walk that counted newlines char-by-char with an `indexOf("\n", cursor)` hop loop. Extracted `buildMatch(ctx, bounds)` as a shared helper to bundle the match-construction arguments. ### Tests added - `test/lib/scan/literal-extract.test.ts` — **39 tests** covering the extractor's rules (escape handling, quantifier drop, alternation bail, case-insensitive, minimum length). - `test/lib/scan/grep.test.ts` — **7 new tests** for the prefilter fast path: correctness vs whole-buffer, escaped-literal extraction, case-insensitive flag, zero-literal-hit short-circuit, routing of pure literals to whole-buffer, and alternation routing. ## Why this approach From the ripgrep research (attached to PR #791): rg's central perf trick is extracting a literal from each regex and prefiltering with SIMD memchr. V8 doesn't expose SIMD directly but its `String.prototype.indexOf` is compiled to a tight byte-level loop with internal SIMD on x64 — functionally equivalent for our use case. Three of the five techniques in the Loggly regex-perf guide were evaluated: - **Character classes over `.*`** — `DSN_PATTERN` already uses `[a-z0-9]+`, no change needed. - **Alternation order** — `DSN_PATTERN`'s `(?:\.[a-z]+|:[0-9]+)` is already correctly ordered (`.` more common than `:` in DSN hosts); swapping regressed perf by noise. - **Anchors/word boundaries** — adding `\b` to `DSN_PATTERN` *regressed* perf 2.8× on our workload. V8's existing fast character-mismatch rejection on the first byte outperforms the boundary check overhead. The remaining gap with rg is now primarily orchestration overhead (async/await, `mapFilesConcurrent`, walker correctness features) rather than regex speed. A worker-pool exploration may follow. ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean (1 pre-existing warning in `src/lib/formatters/markdown.ts` unrelated to this PR) - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5610 pass, 0 fail** (+58 new) - [x] `bun test test/isolated` — 138 pass, 0 fail - [x] `bun run bench --size large --runs 5` — all scan ops at or below previous baseline - [x] Manually verified semantic parity: `collectGrep` returns identical `GrepMatch[]` on prefilter vs whole-buffer paths for patterns where the prefilter fires 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to PR #791. Replaces the init-wizard's
rg → git grep → fs walkfallback chain insrc/lib/init/tools/grep.ts+glob.tswith thin adapters over the pure-TScollectGrep/collectGlobhelpers fromsrc/lib/scan/.The Mastra wire contract is preserved byte-identical on all existing fields. Two optional fields (
caseInsensitive,multiline) are added toGrepSearchfor future-proofing — no current server invocation sets them.What changed
Adapters (net −386 LOC of production code)
src/lib/init/tools/grep.tssrc/lib/init/tools/glob.tssrc/lib/init/tools/search-utils.tssrc/lib/init/types.tsThe adapters now just:
search.pathviasafePath.collectGrep/collectGlobwith the wire-level constants (maxResults,maxLineLength) plumbed through.absolutePathfrom eachGrepMatch— the Mastra wire has never included it.ValidationErrorfrom bad regex so a single bad pattern surfaces as an empty per-search row rather than aborting the whole payload.New optional
GrepSearchfieldsNo current Mastra server invocation sets these. Adding them now means the server can start sending them without a CLI update. The underlying scan engine natively supports both.
Tests
rginPATHto force the fallback chain. With the pure-TS adapter there's no fallback chain to exercise; the tests had become tautological (they passed whether or not they actually force-exercised any specific code path, because the pure-TS implementation doesn't care aboutPATH).writeExecutable,setPath,helperBinDir,savedPath— only used by the deleted tests.grep result matches MUST NOT include absolutePath— pins the strip behavior soabsolutePathnever leaks to the Mastra agent.grep bad regex yields empty matches without crashing the payload— documents theValidationErrorcatch contract so a regression is caught here.grep caseInsensitive flag enables case-insensitive matching— end-to-end coverage for the new wire field.Behavior changes (intentional, only affects users without
rg/git)Before PR #791, the init-wizard fs-walk fallback was naive: no
.gitignorehandling, narrow skip list, no binary detection. Users withrgorgitinstalled never took this path. After this PR, every user gets rg-like behavior via the pure-TS scanner:.gitignorehonored (cumulative semantics, matching git + rg)..next,target,vendor,coverage,.cache,.turboin addition to the old skip set..png,.zip, etc.).Users with
rginstalled see zero change. Users without it get the same rg-like behavior instead of the old naive fs walk.Benchmarks
TL;DR: For the init wizard's actual workload (
maxResults: 100on patterns with matches) the new adapter is dramatically faster — but that's entirely because early-exit + concurrent fan-out reach 100 matches by scanning fewer files. For exhaustive scans the new adapter is slower than the old fs-walk fallback (correctness tax: nested.gitignore, binary sniff, extension filter). This is an acceptable trade for the init wizard but worth flagging for other potentialcollectGrepconsumers.Fixture: 10k files (~80 MB), 3 monorepo packages, mix of text + binary.
Config:
maxLineLength: 2000, 5 runs (after 2 warmup).Machine: linux/x64, 4 CPUs, Bun 1.3.11. Pre-warmed OS page cache.
Four implementations:
rg— gold standard, ripgrep 14.1.0 subprocess.gitignore, narrow skip list, no binary detectioncollectGrepadapterApples-to-apples: uncapped, no early-exit
Every impl scans the whole tree and returns all matches. This isolates per-file throughput — no early-exit can skew the numbers.
import.*from(215k matches)SENTRY_DSN(677 matches)NONEXISTENT_…)NEW is ~1.6× slower than OLD fs-walk on the many-matches case. The extra cost is per-match emission overhead from the whole-buffer
regex.execloop (vs OLD's simplersplit("\n") + regex.test). Rare/zero-match cases are at parity.Neither OLD nor NEW can match
rg's raw speed; ripgrep's SIMD + Rust gives it ~4-8× headroom no pure-TS implementation will close. We're not competing withrgon throughput — we're replacing a subprocess dependency with an in-process one so users withoutrginstalled get rg-like behavior (nested gitignore, binary skipping, expanded skip list).Realistic init wizard: capped at 100
The Mastra agent always caps at 100 matches. Early-exit kicks in. The
filescolumn counts how many files each impl actually read before stopping.import.*fromSENTRY_DSNThe
filescolumn is the real story:import.*fromhas 215k matches spread across 9k files. The first 9 files the NEW walker yields already contain 100+ matches each — workers fan out concurrently, whichever finishes first wins, early-exit fires. OLD fs-walk walks serially and needs 55 files.rgdoesn't stop at 100 unless told (which would add subprocess-pipe plumbing the old adapter didn't do).SENTRY_DSNis rare enough that both impls scan most of the tree; NEW and OLD do comparable work.The 160× headline from the earlier version of this PR body was misleading: it implied raw grep speed. The honest claim is "NEW reaches 100 matches in 2.3ms because early-exit + concurrent fan-out beats serial walk to the cap; per-file regex work is ~1.6× slower than OLD fs-walk."
Why ship this?
Init wizard grep has exactly one caller: the Mastra server, which always sends
maxResults: 100against patterns it expects to find. Early-exit always fires. For the one workload that matters, the NEW adapter is 5× faster than the OLD fs-walk (2.3 ms vs 12 ms) and completely sidesteps the subprocess-dependency problem:rgno longer fall back to a naive walk that ignores gitignore and scans binaries.GrepSearchgainscaseInsensitive+multiline— passthrough to the scan engine's native support.The correctness tax (nested
.gitignorerespected, wider skip list, binaries filtered) is paid once per scan regardless of match density. On the init wizard's capped workload it's noise; on exhaustive scans it's a ~270 ms cost on 10k files. Acceptable for the use case; callers who need exhaustive speed should pick a different tool.Test plan
bunx tsc --noEmit— cleanbun run lint— clean (1 pre-existing warning insrc/lib/formatters/markdown.ts:281, not touched by this PR)bun test --timeout 15000 test/lib test/commands test/types— 5507 pass, 0 failbun test test/isolated— 138 passbun test test/lib/init/tools/— 30 pass (wire contract preserved end-to-end viaexecuteTool)What this PR does NOT change
GrepPayload/GlobPayload— unchanged structurally, onlyGrepSearchgains optional fieldssrc/lib/init/tools/registry.ts— tool dispatch unchangedsrc/lib/init/tools/shared.ts—safePath+validateToolSandboxunchanged{ ok: true, data: { results: [{pattern, matches, truncated}] } }identical🤖 Generated with Claude Code
Co-authored-by: Claude Opus 4.7 (1M context) noreply@anthropic.com