refactor(init): migrate grep/glob tools to src/lib/scan/ by BYK · Pull Request #797 · getsentry/cli

BYK · 2026-04-21T11:46:28Z

Summary

Follow-up to PR #791. Replaces the init-wizard's rg → git grep → fs walk fallback chain in src/lib/init/tools/grep.ts + glob.ts with thin adapters over the pure-TS collectGrep / collectGlob helpers from src/lib/scan/.

The Mastra wire contract is preserved byte-identical on all existing fields. Two optional fields (caseInsensitive, multiline) are added to GrepSearch for future-proofing — no current server invocation sets them.

What changed

Adapters (net −386 LOC of production code)

File	Before	After
`src/lib/init/tools/grep.ts`	299 LOC	114 LOC
`src/lib/init/tools/glob.ts`	145 LOC	73 LOC
`src/lib/init/tools/search-utils.ts`	146 LOC	deleted
`src/lib/init/types.ts`	-	+17 LOC (two optional fields)

The adapters now just:

Sandbox the user-supplied search.path via safePath.
Forward each search/pattern to collectGrep / collectGlob with the wire-level constants (maxResults, maxLineLength) plumbed through.
Strip absolutePath from each GrepMatch — the Mastra wire has never included it.
Catch ValidationError from bad regex so a single bad pattern surfaces as an empty per-search row rather than aborting the whole payload.

New optional `GrepSearch` fields

export type GrepSearch = {
  pattern: string;
  path?: string;
  include?: string;
  caseInsensitive?: boolean;  // NEW
  multiline?: boolean;         // NEW
};

No current Mastra server invocation sets these. Adding them now means the server can start sending them without a CLI update. The underlying scan engine natively supports both.

Tests

Deleted 3 obsolete tests that shadowed rg in PATH to force the fallback chain. With the pure-TS adapter there's no fallback chain to exercise; the tests had become tautological (they passed whether or not they actually force-exercised any specific code path, because the pure-TS implementation doesn't care about PATH).
Deleted scaffolding: writeExecutable, setPath, helperBinDir, savedPath — only used by the deleted tests.
Kept 3 pre-existing tests unchanged: include filters, multi-pattern glob, sandbox rejection.
Added 3 new adapter-specific tests:
- grep result matches MUST NOT include absolutePath — pins the strip behavior so absolutePath never leaks to the Mastra agent.
- grep bad regex yields empty matches without crashing the payload — documents the ValidationError catch contract so a regression is caught here.
- grep caseInsensitive flag enables case-insensitive matching — end-to-end coverage for the new wire field.

Behavior changes (intentional, only affects users without `rg`/`git`)

Before PR #791, the init-wizard fs-walk fallback was naive: no .gitignore handling, narrow skip list, no binary detection. Users with rg or git installed never took this path. After this PR, every user gets rg-like behavior via the pure-TS scanner:

Nested .gitignore honored (cumulative semantics, matching git + rg).
Wider skip-dir list — scan excludes .next, target, vendor, coverage, .cache, .turbo in addition to the old skip set.
Binary files filtered — scan runs an 8 KB NUL-byte sniff before grep-ing (no more regex matches inside .png, .zip, etc.).

Users with rg installed see zero change. Users without it get the same rg-like behavior instead of the old naive fs walk.

Benchmarks

TL;DR: For the init wizard's actual workload (maxResults: 100 on patterns with matches) the new adapter is dramatically faster — but that's entirely because early-exit + concurrent fan-out reach 100 matches by scanning fewer files. For exhaustive scans the new adapter is slower than the old fs-walk fallback (correctness tax: nested .gitignore, binary sniff, extension filter). This is an acceptable trade for the init wizard but worth flagging for other potential collectGrep consumers.

Fixture: 10k files (~80 MB), 3 monorepo packages, mix of text + binary.
Config: maxLineLength: 2000, 5 runs (after 2 warmup).
Machine: linux/x64, 4 CPUs, Bun 1.3.11. Pre-warmed OS page cache.

Four implementations:

(A) raw rg — gold standard, ripgrep 14.1.0 subprocess
(B) OLD init adapter (rg path) — rg + output parsing (omitted below since it tracks (A) closely)
(C) OLD init adapter (fs-walk fallback) — naive walk, no .gitignore, narrow skip list, no binary detection
(D) NEW init adapter — this PR's pure-TS collectGrep adapter

Apples-to-apples: uncapped, no early-exit

Every impl scans the whole tree and returns all matches. This isolates per-file throughput — no early-exit can skew the numbers.

Workload	(A) rg	(C) OLD fs-walk	(D) NEW	NEW vs OLD
`import.*from` (215k matches)	372 ms	830 ms	1351 ms	1.63× slower
`SENTRY_DSN` (677 matches)	85 ms	628 ms	651 ms	1.04× (parity)
no matches (`NONEXISTENT_…`)	84 ms	641 ms	633 ms	0.99× (parity)

NEW is ~1.6× slower than OLD fs-walk on the many-matches case. The extra cost is per-match emission overhead from the whole-buffer regex.exec loop (vs OLD's simpler split("\n") + regex.test). Rare/zero-match cases are at parity.

Neither OLD nor NEW can match rg's raw speed; ripgrep's SIMD + Rust gives it ~4-8× headroom no pure-TS implementation will close. We're not competing with rg on throughput — we're replacing a subprocess dependency with an in-process one so users without rg installed get rg-like behavior (nested gitignore, binary skipping, expanded skip list).

Realistic init wizard: capped at 100

The Mastra agent always caps at 100 matches. Early-exit kicks in. The files column counts how many files each impl actually read before stopping.

Workload	(A) rg	(C) OLD fs-walk	(D) NEW	Files NEW / OLD
`import.*from`	383 ms	12 ms (55 files)	2.3 ms (9 files)	9 vs 55
`SENTRY_DSN`	86 ms	86 ms (1303 files)	102 ms (1389 files)	1389 vs 1303
no matches	76 ms	642 ms (all 10k)	640 ms (all 9k)	full scan

The files column is the real story:

import.*from has 215k matches spread across 9k files. The first 9 files the NEW walker yields already contain 100+ matches each — workers fan out concurrently, whichever finishes first wins, early-exit fires. OLD fs-walk walks serially and needs 55 files. rg doesn't stop at 100 unless told (which would add subprocess-pipe plumbing the old adapter didn't do).
SENTRY_DSN is rare enough that both impls scan most of the tree; NEW and OLD do comparable work.
No matches: every impl must touch every file — no early-exit possible.

The 160× headline from the earlier version of this PR body was misleading: it implied raw grep speed. The honest claim is "NEW reaches 100 matches in 2.3ms because early-exit + concurrent fan-out beats serial walk to the cap; per-file regex work is ~1.6× slower than OLD fs-walk."

Why ship this?

Init wizard grep has exactly one caller: the Mastra server, which always sends maxResults: 100 against patterns it expects to find. Early-exit always fires. For the one workload that matters, the NEW adapter is 5× faster than the OLD fs-walk (2.3 ms vs 12 ms) and completely sidesteps the subprocess-dependency problem:

Users without rg no longer fall back to a naive walk that ignores gitignore and scans binaries.
No spawning, no stderr draining, no exit-code-decoding, no parse-out-pipe-delimited-output.
GrepSearch gains caseInsensitive + multiline — passthrough to the scan engine's native support.

The correctness tax (nested .gitignore respected, wider skip list, binaries filtered) is paid once per scan regardless of match density. On the init wizard's capped workload it's noise; on exhaustive scans it's a ~270 ms cost on 10k files. Acceptable for the use case; callers who need exhaustive speed should pick a different tool.

Test plan

bunx tsc --noEmit — clean
bun run lint — clean (1 pre-existing warning in src/lib/formatters/markdown.ts:281, not touched by this PR)
bun test --timeout 15000 test/lib test/commands test/types — 5507 pass, 0 fail
bun test test/isolated — 138 pass
bun test test/lib/init/tools/ — 30 pass (wire contract preserved end-to-end via executeTool)

What this PR does NOT change

GrepPayload / GlobPayload — unchanged structurally, only GrepSearch gains optional fields
src/lib/init/tools/registry.ts — tool dispatch unchanged
src/lib/init/tools/shared.ts — safePath + validateToolSandbox unchanged
Response shape — { ok: true, data: { results: [{pattern, matches, truncated}] } } identical

🤖 Generated with Claude Code

Co-authored-by: Claude Opus 4.7 (1M context) noreply@anthropic.com

Replace the rg → git grep → fs walk fallback chain in the init-wizard grep and glob tools with direct calls to the pure-TS `collectGrep` / `collectGlob` helpers from `src/lib/scan/` (shipped in PR #791). The Mastra wire contract is preserved verbatim on all existing fields. ## Changes - **`src/lib/init/tools/grep.ts`**: 299 → 114 LOC. Drops `rgGrepSearch`, `gitGrepSearch`, `fsGrepSearch`, `parseRgGrepOutput`, `parseGrepOutput`, `findRegexMatches`, `readSearchableFile`, `truncateMatchLine`, `limitMatches`, `compilePattern`. Replaces with a thin adapter that forwards each search to `collectGrep`, strips the `absolutePath` field (not part of the wire contract), and catches `ValidationError` from bad regexes so a single bad pattern doesn't abort the whole payload. - **`src/lib/init/tools/glob.ts`**: 145 → 73 LOC. Drops `rgGlobSearch`, `gitLsFiles`, `fsGlobSearch`. Replaces with a thin adapter that calls `collectGlob` per pattern (preserving per-pattern attribution, which `collectGlob`'s unioning would lose). - **`src/lib/init/tools/search-utils.ts`**: DELETED (146 LOC). Zero callers after the grep/glob rewrites. - **`src/lib/init/types.ts::GrepSearch`**: two optional fields added: `caseInsensitive?: boolean` and `multiline?: boolean`. No current Mastra server invocation sets them; both default to what the scan engine defaults to (case-sensitive, line-boundary anchoring — rg semantics). Future-proofing. - **`src/lib/scan/glob.ts`**: updated a stale doc comment that referenced the deleted `search-utils.ts::matchGlob`. ## Test changes - Drop 3 obsolete subprocess-fallback tests that shadowed `rg` in `PATH` to force the fallback chain. With the pure-TS adapter there's no fallback chain to exercise and the tests had become tautological. - Drop the `writeExecutable`, `setPath`, `helperBinDir`, `savedPath` scaffolding those tests depended on. - Keep 3 pre-existing wire-behavior/sandbox tests unchanged. - Add 3 adapter-specific tests: - `grep result matches MUST NOT include absolutePath` — pins the adapter's strip behavior (scan returns it; Mastra must not see it). - `grep bad regex yields empty matches without crashing the payload` — documents the `ValidationError` catch contract. - `grep caseInsensitive flag enables case-insensitive matching` — end-to-end coverage of the new wire field. ## Behavior changes (intentional, from scan module defaults) Three behavior shifts land for the pre-PR-791 fs-walk fallback (users without `rg` or `git` installed): - **Nested `.gitignore` now honored.** Old fs-walk fallback ignored gitignore entirely; scan respects cumulative gitignore semantics (matches what `rg` / git grep already did). - **Wider skip-dir list.** Scan skips `.next`, `target`, `vendor`, `coverage`, `.cache`, `.turbo` in addition to the old skip set — matches rg's built-in skips. - **Binary files filtered.** Scan runs an 8 KB NUL-byte sniff before emitting a file to grep; binary matches (e.g. inside a `.png`) no longer appear. Again, matches `rg`'s default. Users with `rg` installed see zero change — they never took the fallback path anyway. Users without rg/git get rg-like behavior instead of the old naive fs walk. ## Net LOC Before: 299 + 145 + 146 = 590 LOC across three files. After: 114 + 73 + 0 (+17 on types.ts) = 204 LOC across two files. **Net: −386 LOC of production code.** ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean (1 pre-existing warning in `markdown.ts`) - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5507 pass, 0 fail** (+1 net: −3 obsolete, +4 new adapter tests, +absolutePath regression) - [x] `bun test test/isolated` — 138 pass - [x] Manual: `bun test test/lib/init/tools/` — 30 pass, wire contract preserved end-to-end via `executeTool` Follow-up to PR #791.

github-actions · 2026-04-21T11:46:43Z

Semver Impact of This PR

🟢 Patch (bug fixes)

📋 Changelog Preview

This is how your changes will appear in the changelog.
Entries from this PR are highlighted with a left border (blockquote style).

New Features ✨

(auth) Hint when env token is shadowed by stored OAuth (CLI UX: cache invalidation, env var discoverability, and error message improvements #785) by BYK in #790
(help) Add Environment Variables section to branded help (CLI UX: cache invalidation, env var discoverability, and error message improvements #785) by BYK in #786
(issue) Add resolve, unresolve (reopen), and merge commands by BYK in #778
(scan) Pure-TS ripgrep-compatible scanner + DSN migration + perf overhaul by BYK in #791

Bug Fixes 🐛

(arg-parsing) Throw ValidationError (not raw Error) for format errors by BYK in #793
(error-reporting) Fall back to message prefix for ValidationError without field by BYK in #776
(errors) Surface real API errors and granular scopes in 403s (CLI UX: cache invalidation, env var discoverability, and error message improvements #785) by BYK in #789
(hex-id) Auto-recover malformed hex IDs in view commands (CLI-16G) by BYK in #777
(init) Force process exit after wizard completes by betegon in #782
(ux) Rate-limit update banner and silence /api/0/ auto-fix (CLI UX: cache invalidation, env var discoverability, and error message improvements #785) by BYK in #787

Documentation 📚

Fix auth token precedence, update stale architecture tree, and documentation audit report by cursor in #783

Internal Changes 🔧

Init

Migrate grep/glob tools to src/lib/scan/ by BYK in #797

Trim deprecated --features help entries by MathurAditya724 in #781

Other

(issue) Skip redundant API lookups via project+issue-org caches by BYK in #794
Regenerate docs by github-actions[bot] in 58a84035

_{🤖 This preview updates automatically when you update the PR.}

github-actions · 2026-04-21T11:47:05Z

PR Preview Action v1.8.1
🚀 View preview at https://cli.sentry.dev/_preview/pr-797/
Built to branch `gh-pages` at 2026-04-21 11:47 UTC. Preview will be ready when the GitHub Pages deployment is complete.

github-actions · 2026-04-21T11:48:45Z

Codecov Results 📊

✅ 138 passed | Total: 138 | Pass Rate: 100% | Execution Time: 0ms

📊 Comparison with Base Branch

Metric	Change
Total Tests	—
Passed Tests	—
Failed Tests	—
Skipped Tests	—

✨ No test changes detected

All tests are passing successfully.

✅ Patch coverage is 100.00%. Project has 1738 uncovered lines.
❌ Project coverage is 95.63%. Comparing base (base) to head (head).

Coverage diff

@@            Coverage Diff             @@
##          main       #PR       +/-##
==========================================
- Coverage    95.66%    95.63%    -0.03%
==========================================
  Files          280       279        -1
  Lines        40060     39794      -266
  Branches         0         0         —
==========================================
+ Hits         38322     38056      -266
- Misses        1738      1738         —
- Partials         0         0         —

Generated by Codecov Action

betegon

thanks for this. works great with init!

## Summary Two regex-level optimizations to narrow the perf gap with ripgrep on our pure-TS `collectGrep`/`grepFiles`. Follow-up to PR #791 and #797. - **Literal prefilter** — ripgrep-style: extract a literal substring from the regex source (e.g., `import` from `import.*from`), scan the buffer with `indexOf` to locate candidate lines, only invoke the regex engine on lines that contain the literal. V8's `indexOf` is roughly SIMD-speed; skipping the regex engine on non-candidate lines is where most of the win comes from. - **Lazy line counting** — swapped `charCodeAt`-walk for `indexOf("\n", cursor)` hops. 2-5× faster on the line-counting sub-loop because V8 implements `indexOf` in C++ without per-iteration JS interop. ## Perf impact (synthetic/large, 10k files, Bun 1.3.11, 4-core) | Op | Before | After | Δ | |---|---:|---:|---:| | `scan.grepFiles` (DSN pattern) | 370 ms | **318 ms** | **−14%** | | `detectAllDsns.cold` | 363 ms | **313 ms** | **−14%** | | `detectDsn.cold` | 7.73 ms | **5.61 ms** | **−27%** | | `scanCodeForFirstDsn` | 2.91 ms | **2.13 ms** | **−27%** | | `scanCodeForDsns` | 342 ms | 333 ms | −3% (noise-equivalent) | | `import.*from` uncapped (bench) | 1489 ms | **1178 ms** | **−21%** | The DSN workloads improve because `DSN_PATTERN` extracts `http` as its literal — most source files don't contain `http` at all, so the prefilter short-circuits before the regex runs. No regressions on any benchmark. Pure-literal patterns (e.g., `SENTRY_DSN`, `NONEXISTENT_TOKEN_XYZ`) continue through the whole-buffer path unchanged. ## What changed ### New file: `src/lib/scan/literal-extract.ts` (~300 LOC) Conservative literal extractor. Walks a regex source looking for the longest contiguous run of literal bytes that every match must contain. Bails out safely on top-level alternation, character classes, groups, lookarounds, quantifiers, and escape classes. Handles escaped metacharacters intelligently: `Sentry\.init` yields `Sentry.init` (extracted via literal `\.` → `.`), while `\bfoo\b` yields `foo` (escape `\b` is an anchor, not a literal `b`). Exports: - `extractInnerLiteral(source, flags)` — returns the literal, or null if no safe extraction possible. Honors `/i` by lowercasing. - `isPureLiteral(source, flags)` — true when the pattern IS a bare literal with no metacharacters. Used by the grep pipeline to route pure-literals to the whole-buffer path (V8's regex engine is hyper-optimized for pure-literal patterns; the prefilter adds overhead without benefit there). ### Modified: `src/lib/scan/grep.ts` (~240 LOC changes) Three-way dispatch in `readAndGrep` based on the extracted literal: 1. **`grepByLiteralPrefilter`** (new) — regex with extractable literal + `multiline: true`. Uses `indexOf(literal)` to find candidate lines, runs the regex engine only on those. This is the main perf win. 2. **`grepByWholeBuffer`** — existing path, used for: - Pure-literal patterns (V8 handles them optimally) - Patterns with no extractable literal (complex regex, top-level alternation) - `multiline: false` mode (the fast path requires per-line semantics) Also: replaced the `charCodeAt`-walk that counted newlines char-by-char with an `indexOf("\n", cursor)` hop loop. Extracted `buildMatch(ctx, bounds)` as a shared helper to bundle the match-construction arguments. ### Tests added - `test/lib/scan/literal-extract.test.ts` — **39 tests** covering the extractor's rules (escape handling, quantifier drop, alternation bail, case-insensitive, minimum length). - `test/lib/scan/grep.test.ts` — **7 new tests** for the prefilter fast path: correctness vs whole-buffer, escaped-literal extraction, case-insensitive flag, zero-literal-hit short-circuit, routing of pure literals to whole-buffer, and alternation routing. ## Why this approach From the ripgrep research (attached to PR #791): rg's central perf trick is extracting a literal from each regex and prefiltering with SIMD memchr. V8 doesn't expose SIMD directly but its `String.prototype.indexOf` is compiled to a tight byte-level loop with internal SIMD on x64 — functionally equivalent for our use case. Three of the five techniques in the Loggly regex-perf guide were evaluated: - **Character classes over `.*`** — `DSN_PATTERN` already uses `[a-z0-9]+`, no change needed. - **Alternation order** — `DSN_PATTERN`'s `(?:\.[a-z]+|:[0-9]+)` is already correctly ordered (`.` more common than `:` in DSN hosts); swapping regressed perf by noise. - **Anchors/word boundaries** — adding `\b` to `DSN_PATTERN` *regressed* perf 2.8× on our workload. V8's existing fast character-mismatch rejection on the first byte outperforms the boundary check overhead. The remaining gap with rg is now primarily orchestration overhead (async/await, `mapFilesConcurrent`, walker correctness features) rather than regex speed. A worker-pool exploration may follow. ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean (1 pre-existing warning in `src/lib/formatters/markdown.ts` unrelated to this PR) - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5610 pass, 0 fail** (+58 new) - [x] `bun test test/isolated` — 138 pass, 0 fail - [x] `bun run bench --size large --runs 5` — all scan ops at or below previous baseline - [x] Manually verified semantic parity: `collectGrep` returns identical `GrepMatch[]` on prefilter vs whole-buffer paths for patterns where the prefilter fires 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

BYK requested a review from betegon April 21, 2026 11:48

betegon approved these changes Apr 21, 2026

View reviewed changes

BYK mentioned this pull request Apr 21, 2026

perf(scan): literal prefilter + lazy line counting in readAndGrep #804

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(init): migrate grep/glob tools to src/lib/scan/#797

refactor(init): migrate grep/glob tools to src/lib/scan/#797
BYK wants to merge 1 commit intomainfrom
byk/feat/ripgrep-init-wizard

BYK commented Apr 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 21, 2026 •

edited

Loading

New Features ✨

Bug Fixes 🐛

Documentation 📚

Internal Changes 🔧

Init

Other

Uh oh!

github-actions Bot commented Apr 21, 2026

Built to branch `gh-pages` at 2026-04-21 11:47 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

betegon left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

BYK commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Adapters (net −386 LOC of production code)

New optional GrepSearch fields

Tests

Behavior changes (intentional, only affects users without rg/git)

Benchmarks

Apples-to-apples: uncapped, no early-exit

Realistic init wizard: capped at 100

Why ship this?

Test plan

What this PR does NOT change

Uh oh!

github-actions Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Semver Impact of This PR

New Features ✨

Bug Fixes 🐛

Documentation 📚

Internal Changes 🔧

Init

Other

Uh oh!

github-actions Bot commented Apr 21, 2026

Built to branch gh-pages at 2026-04-21 11:47 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

github-actions Bot commented Apr 21, 2026

Codecov Results 📊

📊 Comparison with Base Branch

Uh oh!

betegon left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BYK commented Apr 21, 2026 •

edited

Loading

New optional `GrepSearch` fields

Behavior changes (intentional, only affects users without `rg`/`git`)

github-actions Bot commented Apr 21, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-04-21 11:47 UTC.
Preview will be ready when the GitHub Pages deployment is complete.