Skip to content

refactor(init): migrate grep/glob tools to src/lib/scan/#797

Open
BYK wants to merge 1 commit intomainfrom
byk/feat/ripgrep-init-wizard
Open

refactor(init): migrate grep/glob tools to src/lib/scan/#797
BYK wants to merge 1 commit intomainfrom
byk/feat/ripgrep-init-wizard

Conversation

@BYK
Copy link
Copy Markdown
Member

@BYK BYK commented Apr 21, 2026

Summary

Follow-up to PR #791. Replaces the init-wizard's rg → git grep → fs walk fallback chain in src/lib/init/tools/grep.ts + glob.ts with thin adapters over the pure-TS collectGrep / collectGlob helpers from src/lib/scan/.

The Mastra wire contract is preserved byte-identical on all existing fields. Two optional fields (caseInsensitive, multiline) are added to GrepSearch for future-proofing — no current server invocation sets them.

What changed

Adapters (net −386 LOC of production code)

File Before After
src/lib/init/tools/grep.ts 299 LOC 114 LOC
src/lib/init/tools/glob.ts 145 LOC 73 LOC
src/lib/init/tools/search-utils.ts 146 LOC deleted
src/lib/init/types.ts - +17 LOC (two optional fields)

The adapters now just:

  1. Sandbox the user-supplied search.path via safePath.
  2. Forward each search/pattern to collectGrep / collectGlob with the wire-level constants (maxResults, maxLineLength) plumbed through.
  3. Strip absolutePath from each GrepMatch — the Mastra wire has never included it.
  4. Catch ValidationError from bad regex so a single bad pattern surfaces as an empty per-search row rather than aborting the whole payload.

New optional GrepSearch fields

export type GrepSearch = {
  pattern: string;
  path?: string;
  include?: string;
  caseInsensitive?: boolean;  // NEW
  multiline?: boolean;         // NEW
};

No current Mastra server invocation sets these. Adding them now means the server can start sending them without a CLI update. The underlying scan engine natively supports both.

Tests

  • Deleted 3 obsolete tests that shadowed rg in PATH to force the fallback chain. With the pure-TS adapter there's no fallback chain to exercise; the tests had become tautological (they passed whether or not they actually force-exercised any specific code path, because the pure-TS implementation doesn't care about PATH).
  • Deleted scaffolding: writeExecutable, setPath, helperBinDir, savedPath — only used by the deleted tests.
  • Kept 3 pre-existing tests unchanged: include filters, multi-pattern glob, sandbox rejection.
  • Added 3 new adapter-specific tests:
    • grep result matches MUST NOT include absolutePath — pins the strip behavior so absolutePath never leaks to the Mastra agent.
    • grep bad regex yields empty matches without crashing the payload — documents the ValidationError catch contract so a regression is caught here.
    • grep caseInsensitive flag enables case-insensitive matching — end-to-end coverage for the new wire field.

Behavior changes (intentional, only affects users without rg/git)

Before PR #791, the init-wizard fs-walk fallback was naive: no .gitignore handling, narrow skip list, no binary detection. Users with rg or git installed never took this path. After this PR, every user gets rg-like behavior via the pure-TS scanner:

  • Nested .gitignore honored (cumulative semantics, matching git + rg).
  • Wider skip-dir list — scan excludes .next, target, vendor, coverage, .cache, .turbo in addition to the old skip set.
  • Binary files filtered — scan runs an 8 KB NUL-byte sniff before grep-ing (no more regex matches inside .png, .zip, etc.).

Users with rg installed see zero change. Users without it get the same rg-like behavior instead of the old naive fs walk.

Benchmarks

TL;DR: For the init wizard's actual workload (maxResults: 100 on patterns with matches) the new adapter is dramatically faster — but that's entirely because early-exit + concurrent fan-out reach 100 matches by scanning fewer files. For exhaustive scans the new adapter is slower than the old fs-walk fallback (correctness tax: nested .gitignore, binary sniff, extension filter). This is an acceptable trade for the init wizard but worth flagging for other potential collectGrep consumers.

Fixture: 10k files (~80 MB), 3 monorepo packages, mix of text + binary.
Config: maxLineLength: 2000, 5 runs (after 2 warmup).
Machine: linux/x64, 4 CPUs, Bun 1.3.11. Pre-warmed OS page cache.

Four implementations:

  • (A) raw rg — gold standard, ripgrep 14.1.0 subprocess
  • (B) OLD init adapter (rg path) — rg + output parsing (omitted below since it tracks (A) closely)
  • (C) OLD init adapter (fs-walk fallback) — naive walk, no .gitignore, narrow skip list, no binary detection
  • (D) NEW init adapter — this PR's pure-TS collectGrep adapter

Apples-to-apples: uncapped, no early-exit

Every impl scans the whole tree and returns all matches. This isolates per-file throughput — no early-exit can skew the numbers.

Workload (A) rg (C) OLD fs-walk (D) NEW NEW vs OLD
import.*from (215k matches) 372 ms 830 ms 1351 ms 1.63× slower
SENTRY_DSN (677 matches) 85 ms 628 ms 651 ms 1.04× (parity)
no matches (NONEXISTENT_…) 84 ms 641 ms 633 ms 0.99× (parity)

NEW is ~1.6× slower than OLD fs-walk on the many-matches case. The extra cost is per-match emission overhead from the whole-buffer regex.exec loop (vs OLD's simpler split("\n") + regex.test). Rare/zero-match cases are at parity.

Neither OLD nor NEW can match rg's raw speed; ripgrep's SIMD + Rust gives it ~4-8× headroom no pure-TS implementation will close. We're not competing with rg on throughput — we're replacing a subprocess dependency with an in-process one so users without rg installed get rg-like behavior (nested gitignore, binary skipping, expanded skip list).

Realistic init wizard: capped at 100

The Mastra agent always caps at 100 matches. Early-exit kicks in. The files column counts how many files each impl actually read before stopping.

Workload (A) rg (C) OLD fs-walk (D) NEW Files NEW / OLD
import.*from 383 ms 12 ms (55 files) 2.3 ms (9 files) 9 vs 55
SENTRY_DSN 86 ms 86 ms (1303 files) 102 ms (1389 files) 1389 vs 1303
no matches 76 ms 642 ms (all 10k) 640 ms (all 9k) full scan

The files column is the real story:

  • import.*from has 215k matches spread across 9k files. The first 9 files the NEW walker yields already contain 100+ matches each — workers fan out concurrently, whichever finishes first wins, early-exit fires. OLD fs-walk walks serially and needs 55 files. rg doesn't stop at 100 unless told (which would add subprocess-pipe plumbing the old adapter didn't do).
  • SENTRY_DSN is rare enough that both impls scan most of the tree; NEW and OLD do comparable work.
  • No matches: every impl must touch every file — no early-exit possible.

The 160× headline from the earlier version of this PR body was misleading: it implied raw grep speed. The honest claim is "NEW reaches 100 matches in 2.3ms because early-exit + concurrent fan-out beats serial walk to the cap; per-file regex work is ~1.6× slower than OLD fs-walk."

Why ship this?

Init wizard grep has exactly one caller: the Mastra server, which always sends maxResults: 100 against patterns it expects to find. Early-exit always fires. For the one workload that matters, the NEW adapter is 5× faster than the OLD fs-walk (2.3 ms vs 12 ms) and completely sidesteps the subprocess-dependency problem:

  • Users without rg no longer fall back to a naive walk that ignores gitignore and scans binaries.
  • No spawning, no stderr draining, no exit-code-decoding, no parse-out-pipe-delimited-output.
  • GrepSearch gains caseInsensitive + multiline — passthrough to the scan engine's native support.

The correctness tax (nested .gitignore respected, wider skip list, binaries filtered) is paid once per scan regardless of match density. On the init wizard's capped workload it's noise; on exhaustive scans it's a ~270 ms cost on 10k files. Acceptable for the use case; callers who need exhaustive speed should pick a different tool.

Test plan

  • bunx tsc --noEmit — clean
  • bun run lint — clean (1 pre-existing warning in src/lib/formatters/markdown.ts:281, not touched by this PR)
  • bun test --timeout 15000 test/lib test/commands test/types5507 pass, 0 fail
  • bun test test/isolated — 138 pass
  • bun test test/lib/init/tools/ — 30 pass (wire contract preserved end-to-end via executeTool)

What this PR does NOT change

  • GrepPayload / GlobPayload — unchanged structurally, only GrepSearch gains optional fields
  • src/lib/init/tools/registry.ts — tool dispatch unchanged
  • src/lib/init/tools/shared.tssafePath + validateToolSandbox unchanged
  • Response shape — { ok: true, data: { results: [{pattern, matches, truncated}] } } identical

🤖 Generated with Claude Code

Co-authored-by: Claude Opus 4.7 (1M context) noreply@anthropic.com

Replace the rg → git grep → fs walk fallback chain in the init-wizard
grep and glob tools with direct calls to the pure-TS `collectGrep` /
`collectGlob` helpers from `src/lib/scan/` (shipped in PR #791). The
Mastra wire contract is preserved verbatim on all existing fields.

## Changes

- **`src/lib/init/tools/grep.ts`**: 299 → 114 LOC. Drops
  `rgGrepSearch`, `gitGrepSearch`, `fsGrepSearch`, `parseRgGrepOutput`,
  `parseGrepOutput`, `findRegexMatches`, `readSearchableFile`,
  `truncateMatchLine`, `limitMatches`, `compilePattern`. Replaces with
  a thin adapter that forwards each search to `collectGrep`, strips
  the `absolutePath` field (not part of the wire contract), and
  catches `ValidationError` from bad regexes so a single bad pattern
  doesn't abort the whole payload.

- **`src/lib/init/tools/glob.ts`**: 145 → 73 LOC. Drops
  `rgGlobSearch`, `gitLsFiles`, `fsGlobSearch`. Replaces with a thin
  adapter that calls `collectGlob` per pattern (preserving
  per-pattern attribution, which `collectGlob`'s unioning would
  lose).

- **`src/lib/init/tools/search-utils.ts`**: DELETED (146 LOC). Zero
  callers after the grep/glob rewrites.

- **`src/lib/init/types.ts::GrepSearch`**: two optional fields added:
  `caseInsensitive?: boolean` and `multiline?: boolean`. No current
  Mastra server invocation sets them; both default to what the scan
  engine defaults to (case-sensitive, line-boundary anchoring — rg
  semantics). Future-proofing.

- **`src/lib/scan/glob.ts`**: updated a stale doc comment that
  referenced the deleted `search-utils.ts::matchGlob`.

## Test changes

- Drop 3 obsolete subprocess-fallback tests that shadowed `rg` in
  `PATH` to force the fallback chain. With the pure-TS adapter there's
  no fallback chain to exercise and the tests had become tautological.
- Drop the `writeExecutable`, `setPath`, `helperBinDir`, `savedPath`
  scaffolding those tests depended on.
- Keep 3 pre-existing wire-behavior/sandbox tests unchanged.
- Add 3 adapter-specific tests:
  - `grep result matches MUST NOT include absolutePath` — pins the
    adapter's strip behavior (scan returns it; Mastra must not see
    it).
  - `grep bad regex yields empty matches without crashing the payload`
    — documents the `ValidationError` catch contract.
  - `grep caseInsensitive flag enables case-insensitive matching` —
    end-to-end coverage of the new wire field.

## Behavior changes (intentional, from scan module defaults)

Three behavior shifts land for the pre-PR-791 fs-walk fallback
(users without `rg` or `git` installed):

- **Nested `.gitignore` now honored.** Old fs-walk fallback ignored
  gitignore entirely; scan respects cumulative gitignore semantics
  (matches what `rg` / git grep already did).
- **Wider skip-dir list.** Scan skips `.next`, `target`, `vendor`,
  `coverage`, `.cache`, `.turbo` in addition to the old skip set —
  matches rg's built-in skips.
- **Binary files filtered.** Scan runs an 8 KB NUL-byte sniff before
  emitting a file to grep; binary matches (e.g. inside a `.png`) no
  longer appear. Again, matches `rg`'s default.

Users with `rg` installed see zero change — they never took the
fallback path anyway. Users without rg/git get rg-like behavior
instead of the old naive fs walk.

## Net LOC

Before: 299 + 145 + 146 = 590 LOC across three files.
After: 114 + 73 + 0 (+17 on types.ts) = 204 LOC across two files.
**Net: −386 LOC of production code.**

## Test plan

- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean (1 pre-existing warning in `markdown.ts`)
- [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5507 pass, 0 fail** (+1 net: −3 obsolete, +4 new adapter tests, +absolutePath regression)
- [x] `bun test test/isolated` — 138 pass
- [x] Manual: `bun test test/lib/init/tools/` — 30 pass, wire contract preserved end-to-end via `executeTool`

Follow-up to PR #791.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 21, 2026

Semver Impact of This PR

🟢 Patch (bug fixes)

📋 Changelog Preview

This is how your changes will appear in the changelog.
Entries from this PR are highlighted with a left border (blockquote style).


New Features ✨

Bug Fixes 🐛

Documentation 📚

  • Fix auth token precedence, update stale architecture tree, and documentation audit report by cursor in #783

Internal Changes 🔧

Init

  • Migrate grep/glob tools to src/lib/scan/ by BYK in #797
  • Trim deprecated --features help entries by MathurAditya724 in #781

Other

  • (issue) Skip redundant API lookups via project+issue-org caches by BYK in #794
  • Regenerate docs by github-actions[bot] in 58a84035

🤖 This preview updates automatically when you update the PR.

@github-actions
Copy link
Copy Markdown
Contributor

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://cli.sentry.dev/_preview/pr-797/

Built to branch gh-pages at 2026-04-21 11:47 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@BYK BYK requested a review from betegon April 21, 2026 11:48
@github-actions
Copy link
Copy Markdown
Contributor

Codecov Results 📊

138 passed | Total: 138 | Pass Rate: 100% | Execution Time: 0ms

📊 Comparison with Base Branch

Metric Change
Total Tests
Passed Tests
Failed Tests
Skipped Tests

✨ No test changes detected

All tests are passing successfully.

✅ Patch coverage is 100.00%. Project has 1738 uncovered lines.
❌ Project coverage is 95.63%. Comparing base (base) to head (head).

Coverage diff
@@            Coverage Diff             @@
##          main       #PR       +/-##
==========================================
- Coverage    95.66%    95.63%    -0.03%
==========================================
  Files          280       279        -1
  Lines        40060     39794      -266
  Branches         0         0         —
==========================================
+ Hits         38322     38056      -266
- Misses        1738      1738         —
- Partials         0         0         —

Generated by Codecov Action

Copy link
Copy Markdown
Member

@betegon betegon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this. works great with init!

BYK added a commit that referenced this pull request Apr 21, 2026
## Summary

Two regex-level optimizations to narrow the perf gap with ripgrep on our
pure-TS `collectGrep`/`grepFiles`. Follow-up to PR #791 and #797.

- **Literal prefilter** — ripgrep-style: extract a literal substring
from the regex source (e.g., `import` from `import.*from`), scan the
buffer with `indexOf` to locate candidate lines, only invoke the regex
engine on lines that contain the literal. V8's `indexOf` is roughly
SIMD-speed; skipping the regex engine on non-candidate lines is where
most of the win comes from.
- **Lazy line counting** — swapped `charCodeAt`-walk for `indexOf("\n",
cursor)` hops. 2-5× faster on the line-counting sub-loop because V8
implements `indexOf` in C++ without per-iteration JS interop.

## Perf impact (synthetic/large, 10k files, Bun 1.3.11, 4-core)

| Op | Before | After | Δ |
|---|---:|---:|---:|
| `scan.grepFiles` (DSN pattern) | 370 ms | **318 ms** | **−14%** |
| `detectAllDsns.cold` | 363 ms | **313 ms** | **−14%** |
| `detectDsn.cold` | 7.73 ms | **5.61 ms** | **−27%** |
| `scanCodeForFirstDsn` | 2.91 ms | **2.13 ms** | **−27%** |
| `scanCodeForDsns` | 342 ms | 333 ms | −3% (noise-equivalent) |
| `import.*from` uncapped (bench) | 1489 ms | **1178 ms** | **−21%** |

The DSN workloads improve because `DSN_PATTERN` extracts `http` as its
literal — most source files don't contain `http` at all, so the
prefilter short-circuits before the regex runs.

No regressions on any benchmark. Pure-literal patterns (e.g.,
`SENTRY_DSN`, `NONEXISTENT_TOKEN_XYZ`) continue through the whole-buffer
path unchanged.

## What changed

### New file: `src/lib/scan/literal-extract.ts` (~300 LOC)

Conservative literal extractor. Walks a regex source looking for the
longest contiguous run of literal bytes that every match must contain.
Bails out safely on top-level alternation, character classes, groups,
lookarounds, quantifiers, and escape classes.

Handles escaped metacharacters intelligently: `Sentry\.init` yields
`Sentry.init` (extracted via literal `\.` → `.`), while `\bfoo\b` yields
`foo` (escape `\b` is an anchor, not a literal `b`).

Exports:
- `extractInnerLiteral(source, flags)` — returns the literal, or null if
no safe extraction possible. Honors `/i` by lowercasing.
- `isPureLiteral(source, flags)` — true when the pattern IS a bare
literal with no metacharacters. Used by the grep pipeline to route
pure-literals to the whole-buffer path (V8's regex engine is
hyper-optimized for pure-literal patterns; the prefilter adds overhead
without benefit there).

### Modified: `src/lib/scan/grep.ts` (~240 LOC changes)

Three-way dispatch in `readAndGrep` based on the extracted literal:

1. **`grepByLiteralPrefilter`** (new) — regex with extractable literal +
`multiline: true`. Uses `indexOf(literal)` to find candidate lines, runs
the regex engine only on those. This is the main perf win.
2. **`grepByWholeBuffer`** — existing path, used for:
   - Pure-literal patterns (V8 handles them optimally)
- Patterns with no extractable literal (complex regex, top-level
alternation)
   - `multiline: false` mode (the fast path requires per-line semantics)

Also: replaced the `charCodeAt`-walk that counted newlines char-by-char
with an `indexOf("\n", cursor)` hop loop. Extracted `buildMatch(ctx,
bounds)` as a shared helper to bundle the match-construction arguments.

### Tests added

- `test/lib/scan/literal-extract.test.ts` — **39 tests** covering the
extractor's rules (escape handling, quantifier drop, alternation bail,
case-insensitive, minimum length).
- `test/lib/scan/grep.test.ts` — **7 new tests** for the prefilter fast
path: correctness vs whole-buffer, escaped-literal extraction,
case-insensitive flag, zero-literal-hit short-circuit, routing of pure
literals to whole-buffer, and alternation routing.

## Why this approach

From the ripgrep research (attached to PR #791): rg's central perf trick
is extracting a literal from each regex and prefiltering with SIMD
memchr. V8 doesn't expose SIMD directly but its
`String.prototype.indexOf` is compiled to a tight byte-level loop with
internal SIMD on x64 — functionally equivalent for our use case.

Three of the five techniques in the Loggly regex-perf guide were
evaluated:
- **Character classes over `.*`** — `DSN_PATTERN` already uses
`[a-z0-9]+`, no change needed.
- **Alternation order** — `DSN_PATTERN`'s `(?:\.[a-z]+|:[0-9]+)` is
already correctly ordered (`.` more common than `:` in DSN hosts);
swapping regressed perf by noise.
- **Anchors/word boundaries** — adding `\b` to `DSN_PATTERN` *regressed*
perf 2.8× on our workload. V8's existing fast character-mismatch
rejection on the first byte outperforms the boundary check overhead.

The remaining gap with rg is now primarily orchestration overhead
(async/await, `mapFilesConcurrent`, walker correctness features) rather
than regex speed. A worker-pool exploration may follow.

## Test plan

- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean (1 pre-existing warning in
`src/lib/formatters/markdown.ts` unrelated to this PR)
- [x] `bun test --timeout 15000 test/lib test/commands test/types` —
**5610 pass, 0 fail** (+58 new)
- [x] `bun test test/isolated` — 138 pass, 0 fail
- [x] `bun run bench --size large --runs 5` — all scan ops at or below
previous baseline
- [x] Manually verified semantic parity: `collectGrep` returns identical
`GrepMatch[]` on prefilter vs whole-buffer paths for patterns where the
prefilter fires

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants