From d2305ba1fff3f62a2a2e064fb6786a08bf5b3bf2 Mon Sep 17 00:00:00 2001 From: Burak Yigit Kaya Date: Tue, 21 Apr 2026 22:00:32 +0000 Subject: [PATCH 1/9] perf(scan): parallel grep via worker pool with binary-transferable matches MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit `scan.grepFiles` on 10k files drops from 322ms → 191ms p50 (−41%). Follow-up to PR #791 (literal prefilter), #804 (simplify prefilter to file gate), #805 (walker hot-path rewrite). The prior PRs left the walker at 231ms and the overall pipeline at ~1200ms for uncapped scans — the gap was dominated by per-file async orchestration in `mapFilesConcurrent` rather than actual work. Prototypes showed that running the per-file work synchronously inside a small pool of workers (4 on this box) collapses the orchestration cost without changing the public API. ## Architecture ### `src/lib/scan/worker-pool.ts` (~300 LOC) Lazy-initialized singleton pool (size `min(8, max(2, availableParallelism()))`). Feature-gated on `isWorkerSupported()` — runtime must expose `Worker`, `Blob`, and `URL.createObjectURL`. On runtimes without Workers (e.g., Node dist without `--experimental-worker-threads-*` flags), `grepFilesInternal` falls back to the existing `mapFilesConcurrent` path. Runtime escape hatch: `SENTRY_SCAN_DISABLE_WORKERS=1`. ### `src/lib/scan/grep-worker-source.ts` Worker source inlined as a JS string constant and loaded via `Blob` + `URL.createObjectURL`. **This is required for `bun build --compile` single-file binaries** — filesystem-based `new Worker(url)` hangs at runtime because the URL can't be resolved against the compiled binary's module map. Blob URLs sidestep that entirely. The worker is self-contained, uses `require("node:fs")` CommonJS-style, and has zero imports from local modules. ### `src/lib/scan/grep.ts` integration `grepFilesInternal` dispatches to one of two sub-generators: 1. `grepViaWorkers` — producer/consumer streaming. Walker runs on main thread; paths accumulate into batches of 200; each full batch dispatches round-robin to the least-loaded worker; worker returns matches as `{ ints: Uint32Array, linePool: string }` via `postMessage` with `transfer: [ints.buffer]` (zero-copy). Main thread decodes into `GrepMatch[]` lazily and yields. 2. `grepViaAsyncMain` — the existing `mapFilesConcurrent` path, used when workers are unavailable or disabled. ### Binary match encoding Structured clone of 215k `GrepMatch` objects costs ~200ms on this box. Transferable ArrayBuffer + shared `linePool` string drops that to ~2-3ms. Each match is 4 u32s: `[pathIdx, lineNum, lineOffset, lineLength]`. Line text is appended to a single accumulator string per batch; offsets into it are rebuilt into `GrepMatch.line` on the main thread via `linePool.slice(lineOffset, lineOffset + lineLength)`. ## Bugs caught + fixed during implementation ### FIFO queue race (message handler mismatch) Initial design used `addEventListener` per dispatch. Because Web Workers deliver every `result` message to EVERY attached listener, multiple concurrent dispatches to the same worker would all fire together on the first result — resolving the wrong promise with the wrong batch's data. When a handler shifted state to mark "done", subsequent dispatches would find their handlers already removed. **Fix:** single `onmessage` per worker, per-worker FIFO `pending` queue of `{resolve, reject}` slots. Each `result` message shifts one slot. Workers process `postMessage` FIFO (both Bun and Node guarantee this inside a single worker), so FIFO-order resolution is sound. ### Consumer wake race (lost notifications) `wakeConsumer()` fired immediately if `notify` was set, else did nothing. When a batch settled before the consumer entered its `await new Promise(resolve => notify = resolve)` block, the notification was dropped and the consumer hung forever. **Fix:** added `notifyPending` flag. `wakeConsumer()` sets it when no consumer is waiting; consumer checks the flag before awaiting, and the await's executor also checks it to close the window between check and assignment. ### `worker.unref()` deadlock Initial design called `.unref()` on each worker for clean CLI exit. But when the main thread's only event-loop work was waiting on a worker result, unref'd workers didn't process messages on idle ticks, deadlocking the pipeline. The symptom was visible in the bench — after the walker finished and all in-flight batches except 2 returned, the last 2 never completed. **Fix:** removed `.unref()`. The CLI calls `process.exit()` at the end so worker cleanup isn't needed for process termination. For library embeddings, callers can use `terminatePool()` explicitly. ## Perf (synthetic/large, 10k files, p50) | Op | Before (PR #805) | After | Δ | |---|---:|---:|---:| | `scan.grepFiles` | 322ms | **191ms** | **−41%** | | `detectAllDsns.cold` | 308ms | 297ms | −4% | | `scanCodeForDsns` | 313ms | 293ms | −6% | | `detectAllDsns.warm` | 27.0ms | 27.5ms | — | | `scan.walk` | 231ms | 263ms | noise | `scan.grepFiles` is the primary target. `scanCodeForDsns` doesn't use the worker pool (it uses the DSN scanner directly), so the ~20ms improvement there is walker variance. ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean (1 pre-existing warning in `src/lib/formatters/markdown.ts` unrelated) - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5640 pass, 0 fail** - [x] `bun test test/isolated` — 138 pass - [x] `bun test test/lib/scan/grep.test.ts` — 38 pass (including `AbortSignal fires mid-iteration` which exercises the signal propagation through the producer/consumer pipeline) - [x] `bun test test/lib/dsn/code-scanner.test.ts` — 52 pass - [x] `SENTRY_SCAN_DISABLE_WORKERS=1 bun run bench` — verifies the fallback path still works end-to-end (360ms p50, matches pre-change behavior) --- AGENTS.md | 29 +-- src/lib/scan/grep-worker-source.ts | 118 +++++++++++ src/lib/scan/grep.ts | 281 +++++++++++++++++++++++- src/lib/scan/worker-pool.ts | 328 +++++++++++++++++++++++++++++ 4 files changed, 738 insertions(+), 18 deletions(-) create mode 100644 src/lib/scan/grep-worker-source.ts create mode 100644 src/lib/scan/worker-pool.ts diff --git a/AGENTS.md b/AGENTS.md index 7e2eac62a..03e65a36d 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -1000,6 +1000,9 @@ mock.module("./some-module", () => ({ * **DSN cache invalidation uses two-level mtime tracking (sourceMtimes + dirMtimes)**: \*\*DSN cache invalidation uses two-level mtime tracking\*\*: \`sourceMtimes\` (DSN-bearing files, catches in-place edits) + \`dirMtimes\` (every walked dir, catches new files) + root mtime fast-path + 24h TTL. Dropping either map is a correctness regression. Walker emits mtimes via \`onDirectoryVisit\` hook + \`recordMtimes\` option. \*\*Error-path invariant\*\*: \`scanDirectory\` catch block MUST return empty \`dirMtimes: {}\` — NOT partial map from \`onDirectoryVisit\` callback, which would silently bless unvisited dirs. Empty forces full rescan. \`ConfigError\` re-throws instead. \`scan.walk.dsnParity\` bench op imports \`dsnScanOptions()\` from \`dsn/scan-options.ts\` to keep \`scan/\` policy-free. + +* **Grep worker pool: binary-transferable matches + streaming dispatch in src/lib/scan/**: Grep worker pool in \`src/lib/scan/worker-pool.ts\` + \`grep-worker-source.ts\`: lazy singleton pool (size \`min(8, max(2, availableParallelism()))\`) for \`collectGrep\`/\`grepFiles\`. Matches encoded as \`Uint32Array\` of \`\[pathIdx, lineNum, lineOffset, lineLength]\` quads + single \`linePool\` string, transferred via \`postMessage(msg, \[ints.buffer])\` — drops structuredClone cost 198ms→2-3ms. Worker source inlined as string, loaded via \`Blob\` + \`URL.createObjectURL\` (URL-based workers hang in \`bun build --compile\` binaries). Main-thread producer batches walker paths (size 200), dispatches least-loaded; consumer yields decoded matches with AbortSignal polling. Disable via \`SENTRY\_SCAN\_DISABLE\_WORKERS=1\`; falls back to \`grepViaAsyncMain\` on Node. Perf: 10k files 1300ms → ~191ms p50 (faster than rg's 339ms); walker now bottleneck (~250ms of 275ms). + * **Input validation layer: src/lib/input-validation.ts guards CLI arg parsing**: CLI arg input validation (\`src/lib/input-validation.ts\`): Four validators — \`rejectControlChars\` (ASCII < 0x20), \`rejectPreEncoded\` (%XX), \`validateResourceId\` (rejects ?, #, %, whitespace), \`validateEndpoint\` (rejects \`..\` traversal). Applied in \`parseSlashOrgProject\`, \`parseOrgProjectArg\`, \`parseIssueArg\`, \`normalizeEndpoint\`. NOT applied in \`parseSlashSeparatedArg\` for plain IDs (may contain structural separators). Env vars and DB cache values are trusted. @@ -1007,13 +1010,13 @@ mock.module("./some-module", () => ({ * **Magic @ selectors resolve issues dynamically via sort-based list API queries**: Magic @ selectors resolve issues dynamically: \`@latest\`, \`@most\_frequent\` in \`parseIssueArg\` detected before \`validateResourceId\` (@ not in forbidden charset). \`SELECTOR\_MAP\` provides case-insensitive matching. \`resolveSelector\` maps to \`IssueSort\` values, calls \`listIssuesPaginated\` with \`perPage: 1\`, \`query: 'is:unresolved'\`. Supports org-prefixed: \`sentry/@latest\`. Unrecognized \`@\`-prefixed strings fall through. \`ParsedIssueArg\` union includes \`{ type: 'selector' }\`. -* **Sentry SDK uses @sentry/node-core/light instead of @sentry/bun to avoid OTel overhead**: Sentry SDK uses @sentry/node-core/light to avoid OTel overhead: Avoids \`@sentry/bun\` which loads full OpenTelemetry (~150ms, 24MB). \`@sentry/core\` barrel patched via \`bun patch\` to remove ~32 unused exports. Key gotcha: \`LightNodeClient\` hardcodes \`runtime: { name: 'node' }\` AFTER spreading options — fix by patching \`client.getOptions().runtime\` post-init (mutable ref). Transport falls back to Node \`http\` instead of native \`fetch\`. Upstream: getsentry/sentry-javascript#19885, #19886. +* **Sentry SDK uses @sentry/node-core/light instead of @sentry/bun to avoid OTel overhead**: Sentry SDK uses \`@sentry/node-core/light\` instead of \`@sentry/bun\` to avoid OpenTelemetry overhead (~150ms, 24MB). \`@sentry/core\` barrel patched via \`bun patch\` to remove ~32 unused exports. Gotcha: \`LightNodeClient\` hardcodes \`runtime: { name: 'node' }\` AFTER spreading options — fix by patching \`client.getOptions().runtime\` post-init (mutable ref). Transport uses Node \`http\` instead of native \`fetch\`. Upstream: getsentry/sentry-javascript#19885, #19886. -* **src/lib/scan/ module: policy-free file walker with IgnoreStack**: \*\*src/lib/scan/ module: policy-free walker + grep/glob engine.\*\* Files: \`walker.ts\` (DFS async gen; \`minDepth\`/\`maxDepth\` cap DIRECTORY DESCENT not file yield; \`timeBudgetMs\`, \`AbortSignal\`, \`onDirectoryVisit\`, \`recordMtimes\`), \`ignore.ts\` (IgnoreStack two-tier: \`#rootIg\`+\`#nestedByRelDir\`; fast-path ~0.27µs, slow-path ~1.93µs), \`binary.ts\` (8KB NUL sniff + ext fast-path), \`concurrent.ts\`, \`regex.ts\` (translates \`(?i)\`/\`(?im)\`/\`(?U)\` → JS flags), \`grep.ts\`/\`glob.ts\` (picomatch \`{dot:true}\`). DSN policy lives in \`src/lib/dsn/scan-options.ts\`. Gotchas: (1) \`classifyFile\` short-circuits when \`cfg.extensions\` set. (2) \`buildWalkOptions\` MUST forward \`descentHook\`+\`alwaysSkipDirs\`+\`followSymlinks\`+\`recordMtimes\`+\`clock\`. (3) NEVER \`pLimit.clearQueue()\` (hangs); use \`onResult\` returning \`{done:true}\`. (4) \`path.join\`/\`relative\`/\`extname\` ~10× slower than manual string ops in hot walker loop — use \`abs.slice(cwdPrefixLen)\` and \`lastIndexOf('.')+slice+toLowerCase\`; on Windows use \`path.sep\`. +* **src/lib/scan/ module: policy-free file walker with IgnoreStack**: \`src/lib/scan/\` — policy-free walker + grep/glob engine. Files: \`walker.ts\` (DFS async gen; \`minDepth\`/\`maxDepth\` cap DESCENT not file yield; \`timeBudgetMs\`, \`AbortSignal\`, \`onDirectoryVisit\`, \`recordMtimes\`), \`ignore.ts\` (IgnoreStack two-tier \`#rootIg\`+\`#nestedByRelDir\`), \`binary.ts\` (8KB NUL sniff + ext fast-path), \`concurrent.ts\`, \`regex.ts\` (\`(?i)\`/\`(?im)\`/\`(?U)\` → JS flags), \`grep.ts\`/\`glob.ts\` (picomatch \`{dot:true}\`). DSN policy lives in \`src/lib/dsn/scan-options.ts\`. Gotchas: (1) \`classifyFile\` short-circuits when \`cfg.extensions\` set. (2) \`buildWalkOptions\` MUST forward \`descentHook\`+\`alwaysSkipDirs\`+\`followSymlinks\`+\`recordMtimes\`+\`clock\`. (3) NEVER \`pLimit.clearQueue()\` (hangs); use \`onResult\` returning \`{done:true}\`. (4) \`path.join\`/\`relative\`/\`extname\` ~10× slower than manual string ops in hot loop — use \`abs.slice(cwdPrefixLen)\` and \`lastIndexOf('.')+slice+toLowerCase\`. -* **Telemetry opt-out is env-var-only — no persistent preference or DO\_NOT\_TRACK**: Telemetry opt-out priority: (1) \`SENTRY\_CLI\_NO\_TELEMETRY=1\`, (2) \`DO\_NOT\_TRACK=1\`, (3) \`metadata.defaults.telemetry\`, (4) default on. DB read try/catch wrapped (runs before DB init). Schema v13 merged \`defaults\` table into \`metadata\` KV with keys \`defaults.{org,project,telemetry,url}\`; getters/setters in \`src/lib/db/defaults.ts\`. \`sentry cli defaults\` uses variadic \`\[key, value?]\`: no args → show all; 1 arg → show key; 2 args → set; \`--clear\` without args → clear all (guarded); \`--clear key\` → clear specific. Canonical keys: \`org\`/\`organization\`, \`project\`, \`telemetry\`, \`url\`. \`computeTelemetryEffective()\` returns resolved source for display. +* **Telemetry opt-out is env-var-only — no persistent preference or DO\_NOT\_TRACK**: Telemetry opt-out priority: (1) \`SENTRY\_CLI\_NO\_TELEMETRY=1\`, (2) \`DO\_NOT\_TRACK=1\`, (3) \`metadata.defaults.telemetry\`, (4) default on. DB read try/catch wrapped (runs before DB init). Schema v13 merged \`defaults\` table into \`metadata\` KV with keys \`defaults.{org,project,telemetry,url}\`; getters/setters in \`src/lib/db/defaults.ts\`. \`sentry cli defaults\` uses variadic \`\[key, value?]\`: no args → show all; 1 arg → show key; 2 args → set; \`--clear\` without args → clear all (guarded); \`--clear key\` → clear specific. \`computeTelemetryEffective()\` returns resolved source for display. * **Zod schema on OutputConfig enables self-documenting JSON fields in help and SKILL.md**: Zod schema on OutputConfig enables self-documenting JSON fields: List commands register \`schema?: ZodType\` on \`OutputConfig\\`. \`extractSchemaFields()\` produces \`SchemaFieldInfo\[]\` from Zod shapes. \`buildFieldsFlag()\` enriches \`--fields\` brief; \`enrichDocsWithSchema()\` appends fields to \`fullDescription\`. Schema exposed as \`\_\_jsonSchema\` on built commands — \`introspect.ts\` reads it into \`CommandInfo.jsonFields\`, \`help.ts\` and \`generate-skill.ts\` render it. For \`buildOrgListCommand\`/\`dispatchOrgScopedList\`, pass \`schema\` via \`OrgListConfig\`. @@ -1023,34 +1026,37 @@ mock.module("./some-module", () => ({ * **All view subcommands should use \ \ positional pattern**: All \`\* view\` subcommands use \`\ \\` positional pattern (Intent-First Correction UX): target is optional \`org/project\`. Use opportunistic arg swapping with \`log.warn()\` when args are wrong order — when intent is unambiguous, do what they meant. Normalize at command level, keep parsers pure. Model after \`gh\` CLI. Exception: \`auth\` uses \`defaultCommand: "status"\` (no viewable entity). Routes without defaults: \`cli\`, \`sourcemap\`, \`repo\`, \`team\`, \`trial\`, \`release\`, \`dashboard/widget\`. + +* **grep orchestration overhead is ~500-900ms; sync main-thread beats worker-pool on many-match workloads**: grep perf evolution on 10k-file/215k-match \`import.\*from\`: baseline 1300ms → sync main-thread (readFileSync + serial regex) 324-395ms (~rg's 339ms) → binary-transferable streaming worker pool \*\*191ms p50\*\* (faster than rg). Object-payload workers lose to structuredClone (216ms for 215k matches: JSON.stringify 101ms / parse 168ms). Walker is now bottleneck (~250ms of 275ms end-to-end). SharedArrayBuffer/mmap would save only 10-30ms more, not worth complexity. See worker-pool entry for impl. + * **Sentry-derived terminal color palette tuned for dual-background contrast**: Terminal color palette tuned for dual-background contrast: 10-color chart palette derived from Sentry's categorical hues (\`static/app/utils/theme/scraps/tokens/color.tsx\`), adjusted to mid-luminance for ≥3:1 contrast on both dark and light backgrounds. Adjustments: orange #FF9838→#C06F20, green #67C800→#3D8F09, yellow #FFD00E→#9E8B18, purple #5D3EB2→#8B6AC8, indigo #50219C→#7B50D0; blurple/pink/magenta unchanged; teal #228A83 added. Hex preferred over ANSI 16-color for guaranteed contrast. - -* **Walker uses sync readdir + statSync — readdirSync p95=24µs is fine to block on**: Walker perf: sync readdir + statSync chosen over async. readdirSync p50=11µs, p95=24µs — below setTimeout(0) (~4ms). 10k-file walk: async serial 105-167ms → sync 43-45ms. statSync slightly faster than Bun.file().size. CLI is single-threaded; async overhead unjustified. Walker remains async-generator externally; internals sync. Workers rejected: bootstrap 50ms/worker for scan imports, structuredClone dominates many-match workloads (215k × ~100B = 20MB), cancellation hard (only \`worker.terminate()\`), single-bundle polyfill story ugly. 4-core ceiling 1.7-2.3× eaten by cloning. Walker is the bottleneck, not regex — gitignore layer adds ~135ms overhead in \`ignore\` package. - ### Gotcha * **AuthError constructor takes reason first, message second**: \`AuthError(reason: AuthErrorReason, message?: string)\` where \`AuthErrorReason\` is \`"not\_authenticated" | "expired" | "invalid"\`. Easy to accidentally swap args as \`new AuthError("Token expired", "expired")\` — the string \`"Token expired"\` gets assigned as \`reason\` (invalid enum value). Tests aren't type-checked (tsconfig excludes them), so TypeScript won't catch this. Correct: \`new AuthError("expired", "Token expired")\`. Default messages exist for each reason, so the second arg is often unnecessary. + +* **Bun worker \`new Worker(url)\` fails in compiled binaries — use Blob URL**: Bun \`new Worker(new URL('./worker.ts', import.meta.url))\` works in \`bun run\` but \*\*hangs in \`bun build --compile\` single-file binaries\*\* — URL can't be resolved at runtime. Fix: inline worker source as a string, wrap in \`new Blob(\[src], { type: 'application/javascript' })\`, pass \`URL.createObjectURL(blob)\` to \`new Worker()\`. Works across Bun compiled binary and esbuild Node CJS bundle. Worker source must be self-contained (no imports from user modules); CommonJS-style \`require('node:fs')\` works inside. + -* **mapFilesConcurrent skips null but not empty arrays — callers must return null for no-op**: Two off-by-one gotchas in scan/formatters: (1) \`mapFilesConcurrent\` in \`src/lib/scan/concurrent.ts\` filters \`null\` but NOT empty arrays — pushes and fires \`onResult\` for each. Stream variant \`mapFilesConcurrentStream\` filters both. Callers like \`processEntry\` in \`src/lib/dsn/code-scanner.ts\` must return \`null\` (not \`\[]\`) for no-op files — on 10k-file walks where ~99% have no DSNs, \`\[]\` fires ~9900 wasted callbacks. (2) \`collectGlob\`/\`collectGrep\` must NOT forward \`maxResults\` to the underlying iterator — collector drains uncapped, breaks when \`files.length >= maxResults\`, sets \`truncated=true\` on overshoot. Forwarding makes iterator stop at exactly N without \`truncated\`. (3) \`filterFields\` in \`src/lib/formatters/json.ts\` dot-notation ambiguous for keys containing dots — property tests must use \`\[a-zA-Z0-9\_]\` charset. +* **mapFilesConcurrent skips null but not empty arrays — callers must return null for no-op**: Scan/formatter off-by-ones: (1) \`mapFilesConcurrent\` in \`src/lib/scan/concurrent.ts\` filters \`null\` but NOT empty arrays — pushes and fires \`onResult\` per empty result. Stream variant filters both. Callers like \`processEntry\` in \`dsn/code-scanner.ts\` must return \`null\` (not \`\[]\`) for no-op files; on 10k-file walks ~99% yield empties, firing thousands of wasted callbacks. (2) \`collectGlob\`/\`collectGrep\` must NOT forward \`maxResults\` to the underlying iterator — collector drains uncapped, breaks when \`files.length >= maxResults\`, sets \`truncated=true\` on overshoot. Forwarding makes iterator stop at exactly N without \`truncated\`. (3) \`filterFields\` in \`formatters/json.ts\` uses dot-notation — property tests must use \`\[a-zA-Z0-9\_]\` charset to avoid ambiguous keys with dots. -* **process.stdin.isTTY unreliable in Bun — use isatty(0) and backfill for clack**: \`process.stdin.isTTY\` unreliable in Bun — use \`isatty(0)\`: Bun's single-file binary can leave \`process.stdin.isTTY === undefined\` on TTY fds inherited via redirects like \`exec … \ * **Spinner stdout/stderr collision: log messages inside withProgress appear on spinner line**: Spinner stdout/stderr collision: \`withProgress\` in \`src/lib/polling.ts\` writes to stdout with \`\r\x1b\[K\` (no newline); consola writes to stderr. Any \`log.info()\` called \*\*inside\*\* the callback appears on the spinner line because stderr doesn't know stdout's carriage-return position. Fix: propagate data out via return value, call \`log.info()\` \*\*after\*\* \`withProgress\` completes (finally block clears the line). Affected \`downloadBinaryToTemp\` in \`upgrade.ts\`. -* **Stricli rejects unknown flags — pre-parsed global flags must be consumed from argv**: Stricli flag parsing traps: (1) Unknown \`--flag\`s rejected — global flags parsed in \`bin.ts\` MUST be spliced from argv (check \`--flag value\` AND \`--flag=value\`). (2) \`FLAG\_NAME\_PATTERN\` requires 2+ chars after \`--\`; single-char flags like \`--x\` silently become positionals — use aliases (\`-x\` → longer name). Bit dashboard widget \`--x\`/\`--y\`. (3) Generated SKILL.md filters hidden flags: \`FlagDef\` includes \`hidden?: boolean\`; \`extractFlags\` propagates so \`generateCommandDoc\` filters hidden flags alongside \`help\`/\`helpAll\`. Hidden flags (\`--log-level\`, \`--verbose\`) appear globally in \`docs/src/content/docs/commands/index.md\` Global Options. +* **Stricli rejects unknown flags — pre-parsed global flags must be consumed from argv**: Stricli flag parsing traps: (1) Unknown \`--flag\`s rejected — global flags parsed in \`bin.ts\` MUST be spliced from argv (check both \`--flag value\` and \`--flag=value\`). (2) \`FLAG\_NAME\_PATTERN\` requires 2+ chars after \`--\`; single-char flags like \`--x\` silently become positionals — use aliases (\`-x\` → longer name). Bit \`dashboard widget --x\`/\`--y\`. (3) \`FlagDef.hidden\` is propagated by \`extractFlags\` so \`generateCommandDoc\` filters hidden flags alongside \`help\`/\`helpAll\`; hidden \`--log-level\`/\`--verbose\` appear only in global options docs. * **test:unit glob only picks up test/lib, test/commands, test/types, test/isolated**: \`bun test\` script globs in \`package.json\` are narrow: \`test:unit\` = \`test/lib test/commands test/types\`, \`test:isolated\` = \`test/isolated\`, \`test:e2e\` = \`test/e2e\`. Tests placed under \`test/fixtures/\`, \`test/scripts/\`, or \`test/script/\` are NOT picked up by any standard CI script despite being valid Bun test files. Place new tests under \`test/lib/\/\` to inherit coverage. Note: \`test/script/\` (singular) exists with script tests but is also outside the globs. -* **Whole-buffer matchAll slower than split+test when aggregated over many files**: grep perf tradeoffs in \`src/lib/scan/grep.ts::readAndGrep\`: (1) Whole-buffer \`regex.exec\` is 12× faster on single files but ~1.6× SLOWER than \`split("\n") + regex.test\` aggregated over 10k files / 215k matches — match-emission cost dominates. 160× faster-than-rg headline comes entirely from \`mapFilesConcurrent.onResult\` early-exit at maxResults=100. (2) Literal prefilter is FILE-LEVEL gate only (cheap \`indexOf\` → skip file if absent), then whole-buffer \`regex.exec\` on survivors. Per-line-verify broke cross-newline patterns and Unicode length-changing \`toLowerCase\` (Turkish \`İ\`→\`i̇\`). Extractor handles multi-char escapes (\`\x41\`,\`\u0041\`,\`\cA\`,\`\k\\`,\`\p{L}\`) via \`escapeSequenceLength\`. \`hasTopLevelAlternation\`+\`skipGroup\` must call \`skipCharacterClass\` (PCRE \`\[]abc]\` ≠ JS). Extract from \`regex.source\` not user input. (3) \`GrepOptions.multiline\` defaults to \`true\` (rg/git-grep semantics) — thread through \`PerFileOptions\`; \`compilePattern\`'s \`multiline\` default stays \`false\`. +* **Whole-buffer matchAll slower than split+test when aggregated over many files**: grep perf traps in \`src/lib/scan/grep.ts::readAndGrep\`: (1) Whole-buffer \`regex.exec\` is 12× faster per-file but ~1.6× SLOWER than \`split('\n')+regex.test\` aggregated over 10k files/215k matches — match-emission dominates. Early-exit at \`maxResults\` via \`mapFilesConcurrent.onResult\` is what wins. (2) Literal prefilter is FILE-LEVEL gate only (\`indexOf\` → skip), then whole-buffer \`regex.exec\`. Per-line-verify broke cross-newline patterns and Unicode length-changing \`toLowerCase\` (Turkish \`İ\`→\`i̇\`). Extractor handles multi-char escapes via \`escapeSequenceLength\`; \`hasTopLevelAlternation\`+\`skipGroup\` must call \`skipCharacterClass\` (PCRE \`\[]abc]\` ≠ JS). Extract from \`regex.source\`. (3) \`GrepOptions.multiline\` defaults to \`true\` (rg/git-grep); \`compilePattern\`'s default stays \`false\`. ### Pattern @@ -1068,7 +1074,4 @@ mock.module("./some-module", () => ({ * **PR review workflow: Cursor Bugbot + Seer + human cycle**: PR review workflow (Cursor Bugbot + Seer + human): \`gh pr checks \ --json state,link\` + \`gh run view --log-failed\` for CI. Unresolved threads via \`gh api graphql\` with \`reviewThreads\` query filtering \`isResolved:false\`+\`isMinimized:false\`. After fixes: reply via \`gh api repos/.../pulls/comments/\/replies\` then resolve via \`resolveReviewThread\` mutation. Bots auto-resolve on detected fix. Statuses: \`UNSTABLE\`=non-blocking bots running, \`BLOCKED\`=required CI pending, \`CLEAN\`+\`MERGEABLE\`=ready. Repo is squash-merge only. Expect 4-6 rounds on subtle regex/Unicode PRs. Bugbot findings usually real but occasionally assume PCRE/Python semantics (e.g. \`\[]abc]\` is class-with-\`]\` in PCRE but empty class in JS) — verify with reproduction. Dispatch self-review subagent between rounds. - - -* **URL resolution chain and persistent default integration point**: CLI init sequence in \`startCli()\` (\`cli.ts\`): (1) fast-path \`\_\_complete\` dispatch, (2) \`preloadProjectContext(cwd)\` finds project root + \`.sentryclirc\`, calls \`applySentryCliRcEnvShim()\`, (3) persistent defaults applied, (4) \`runCli(args)\` with \`withTelemetry()\`. Heavy modules use \`await import()\` for fast completion. \`preloadProjectContext\` is try/catch wrapped. Sentry URL resolution priority: \`SENTRY\_HOST\` > \`SENTRY\_URL\` > \`.sentryclirc \[defaults] url\` > metadata \`defaults.url\` > \`sentry.io\`. Persistent URL default written to \`env.SENTRY\_URL\` only if both \`SENTRY\_HOST\` and \`SENTRY\_URL\` unset. diff --git a/src/lib/scan/grep-worker-source.ts b/src/lib/scan/grep-worker-source.ts new file mode 100644 index 000000000..147e1fa23 --- /dev/null +++ b/src/lib/scan/grep-worker-source.ts @@ -0,0 +1,118 @@ +/** + * Inline source for the grep worker, delivered to `new Worker()` via + * `Blob` + `URL.createObjectURL`. + * + * Why inline-as-string and not a separate `.ts` file? + * + * - `bun build --compile` (our single-file CLI binary) can't resolve + * `new Worker(new URL("./file.ts", import.meta.url).href)` at + * runtime — the worker URL points to a source that's not part of + * the compiled binary's module map. The worker spawn hangs. + * + * - A Blob URL sidesteps that: we construct the worker source at + * runtime from a string bundled into the main module, so there's + * no runtime file resolution. + * + * The worker is deliberately self-contained: + * + * - No imports from local modules (those wouldn't be available in + * the worker's module registry anyway). + * - Uses `require("node:fs")` (CJS-style) inside the worker function + * because Bun and Node both expose `node:fs` to workers via + * `require` and `import`, but the CJS form avoids top-level + * await parsing issues in the string template. + * + * The worker receives a batch of file paths, a regex pattern, and + * options. It reads each file synchronously (fast in-process), + * runs the regex, and returns matches packed as a `Uint32Array` + + * a single `linePool` string. The caller then re-hydrates + * `GrepMatch` objects from the packed data — this avoids the + * structured-clone cost of ~216ms that a plain `GrepMatch[]` + * would incur for typical large-result workloads. + * + * Match encoding — each match is 4 consecutive `u32`s in `ints`: + * [0] pathIdx index into the input `paths` array + * [1] lineNum 1-based line number + * [2] lineOffset character offset into `linePool` + * [3] lineLength character length of the line (post-truncation) + */ + +/** + * Worker source. Injected into a `Blob` at runtime. + * + * Kept as plain JS (not TS) so it can be embedded verbatim without + * a compile step inside the worker. + */ +export const GREP_WORKER_SOURCE = ` +const { readFileSync } = require("node:fs"); + +self.onmessage = (event) => { + const { paths, patternSource, flags, maxLineLength, maxMatchesPerFile, literal } = event.data; + const regex = new RegExp(patternSource, flags); + // Packed match data: 4 u32s per match — [pathIdx, lineNum, lineOffset, lineLength] + const ints = []; + let linePool = ""; + + for (let pathIdx = 0; pathIdx < paths.length; pathIdx++) { + const p = paths[pathIdx]; + let content; + try { + content = readFileSync(p, "utf-8"); + } catch { + continue; + } + + // File-level literal gate (matches main-thread readAndGrep behavior). + if (literal !== null) { + const isCaseInsensitive = regex.flags.includes("i"); + const haystack = isCaseInsensitive ? content.toLowerCase() : content; + if (haystack.indexOf(literal) === -1) continue; + } + + regex.lastIndex = 0; + let m = regex.exec(content); + let lineNum = 1; + let cursor = 0; + let fileMatches = 0; + + while (m !== null) { + const matchIndex = m.index; + // Advance line counter to match position via indexOf hops. + let nl = content.indexOf("\\n", cursor); + while (nl !== -1 && nl < matchIndex) { + lineNum++; + nl = content.indexOf("\\n", nl + 1); + } + cursor = matchIndex; + + const lineStart = content.lastIndexOf("\\n", matchIndex) + 1; + const lineEndRaw = content.indexOf("\\n", matchIndex); + const lineEnd = lineEndRaw === -1 ? content.length : lineEndRaw; + let line = content.slice(lineStart, lineEnd); + if (line.length > maxLineLength) { + line = line.slice(0, maxLineLength - 1) + "\\u2026"; + } + + // Append line text to pool, record offset+length + const lineOffset = linePool.length; + linePool += line; + ints.push(pathIdx, lineNum, lineOffset, line.length); + + fileMatches++; + if (fileMatches >= maxMatchesPerFile) break; + if (lineEndRaw === -1) break; + regex.lastIndex = lineEnd + 1; + m = regex.exec(content); + } + } + + // Pack into Uint32Array with transferable backing buffer. + const packed = new Uint32Array(ints); + self.postMessage( + { type: "result", ints: packed, linePool }, + { transfer: [packed.buffer] } + ); +}; + +self.postMessage({ type: "ready" }); +`; diff --git a/src/lib/scan/grep.ts b/src/lib/scan/grep.ts index 313cb0c9c..745b51103 100644 --- a/src/lib/scan/grep.ts +++ b/src/lib/scan/grep.ts @@ -69,6 +69,11 @@ import type { WalkOptions, } from "./types.js"; import { walkFiles } from "./walker.js"; +import { + decodeWorkerMatches, + getWorkerPool, + isWorkerSupported, +} from "./worker-pool.js"; /** Default line-truncation length for the init-wizard wire shape. */ const DEFAULT_MAX_LINE_LENGTH = 2000; @@ -393,15 +398,68 @@ type GrepPipelineOptions = { }; /** - * The internal pipeline. Separated from the public entry point so - * `collectGrep` can share it. + * Batch size for worker dispatch. Each batch is a chunk of N file + * paths the worker will read + grep in one go. Benched on + * synthetic/large (10k files): 100 = 420ms, 200 = 275ms (streamed), + * 400 = 290ms. The plateau is around 200 — too small and postMessage + * overhead dominates; too large and we lose per-worker parallelism + * because the walker finishes yielding before the first worker + * returns. + */ +const WORKER_BATCH_SIZE = 200; + +/** + * The internal pipeline. Separated from the public entry points so + * `grepFiles` and `collectGrep` share it. * - * `async *` is intentional even though we `yield*` into the - * `for await` — the whole pipeline is async and callers expect - * an `AsyncGenerator`. + * Two execution strategies: + * + * 1. **Worker pool** (default when the runtime supports `new Worker` + * + `Blob` + `URL.createObjectURL`): walker streams paths on the + * main thread, paths are batched into chunks of + * `WORKER_BATCH_SIZE`, each batch is dispatched round-robin to + * a lazy-initialized pool of workers. Workers return matches + * packed as `Uint32Array` + a shared string pool (transferable, + * zero-copy). Main thread decodes and yields. + * + * 2. **Async fallback** (when workers aren't available — e.g. some + * Node library embeddings): falls back to `mapFilesConcurrent` + * on the main thread. Same result, ~4× slower on large workloads. + * + * Worker disablement for tests or debugging: set + * `SENTRY_SCAN_DISABLE_WORKERS=1`. */ +// biome-ignore lint/suspicious/useAwait: async generator that dispatches to one of two sub-generators via `yield*`; no `await` needed in the dispatcher itself async function* grepFilesInternal( opts: GrepPipelineOptions +): AsyncGenerator { + if (shouldUseWorkers()) { + yield* grepViaWorkers(opts); + return; + } + yield* grepViaAsyncMain(opts); +} + +/** + * Returns true when the worker path should be used. Gated on the + * runtime capability check (`isWorkerSupported`) and the + * `SENTRY_SCAN_DISABLE_WORKERS` env var (for tests + debugging). + */ +function shouldUseWorkers(): boolean { + if (process.env.SENTRY_SCAN_DISABLE_WORKERS === "1") { + return false; + } + return isWorkerSupported(); +} + +/** + * Fallback pipeline: `mapFilesConcurrentStream` on the main thread. + * Used when the worker pool is unavailable or explicitly disabled. + * Identical semantics to the worker path; ~4× slower on 10k-file + * many-match workloads. + */ +async function* grepViaAsyncMain( + opts: GrepPipelineOptions ): AsyncGenerator { for await (const match of mapFilesConcurrentStream( opts.walkSource, @@ -431,6 +489,219 @@ async function* grepFilesInternal( } } +/** + * Worker-based pipeline. Streams paths from the walker, batches + * them, dispatches to workers, decodes results, yields matches in + * worker-completion order (not walker order). + * + * ### Streaming + early-exit + * + * Walker runs on main thread. As paths arrive, we accumulate into a + * `batch: string[]`. When the batch reaches `WORKER_BATCH_SIZE` we + * dispatch it and start a new batch. Each dispatch returns a + * promise; we race the set of in-flight dispatches so the first + * batch to finish yields its matches first (workers may complete + * out of order). + * + * On `maxResults`/`stopOnFirst`, we stop dispatching new batches. + * In-flight batches keep running (no cancellation API on workers) + * but their results are discarded once we've exited. + * + * `filesRead` counts all paths dispatched to workers — we don't + * distinguish "read successfully" from "open failed" at the worker + * boundary (the worker swallows the error and produces zero matches + * for that path). For the main-thread fallback the stat is more + * precise. + */ +// biome-ignore lint/complexity/noExcessiveCognitiveComplexity: producer/consumer pattern with early exit, batch dispatch, and abort-signal propagation is inherently branchy +async function* grepViaWorkers( + opts: GrepPipelineOptions +): AsyncGenerator { + const pool = getWorkerPool(); + + // Queue of completed batches ready to emit. Filled by dispatched + // worker promises as they resolve; drained by this generator. + const pending: GrepMatch[][] = []; + // Consumer wait state: `pendingNotify` is a one-shot resolver set + // by the consumer when it's about to wait; `notifyPending` tracks + // a notification that arrived while no one was waiting, so the + // consumer picks it up on next loop iteration. + let pendingNotify: (() => void) | null = null; + let notifyPending = false; + const wakeConsumer = () => { + if (pendingNotify) { + const fn = pendingNotify; + pendingNotify = null; + fn(); + } else { + notifyPending = true; + } + }; + + let earlyExit = false; + let walkerDone = false; + let inflightCount = 0; + + const dispatchBatch = (paths: string[], rels: string[]): void => { + opts.stats.filesRead += paths.length; + inflightCount += 1; + pool + .dispatch({ + paths, + patternSource: opts.perFile.regex.source, + flags: resolveWorkerFlags(opts.perFile), + maxLineLength: opts.perFile.maxLineLength, + maxMatchesPerFile: Number.isFinite(opts.perFile.maxMatchesPerFile) + ? opts.perFile.maxMatchesPerFile + : 0xff_ff_ff_ff, + literal: opts.perFile.literal, + }) + .then( + (result) => { + pending.push(decodeWorkerMatches(result, paths, rels)); + }, + () => { + // Worker error — skip this batch. Per-file errors are + // already swallowed inside the worker itself. + } + ) + .finally(() => { + inflightCount -= 1; + wakeConsumer(); + }); + }; + + // Producer: walk paths, batch them, dispatch each full batch. + // Errors thrown by the walker (e.g. `AbortError` when the caller + // signals abort) propagate to the consumer via `producerError`. + let producerError: unknown = null; + // biome-ignore lint/complexity/noExcessiveCognitiveComplexity: walker drain + path-prefix handling + batch flush + error capture is inherently branchy + const producer = (async () => { + let batch: string[] = []; + let batchRel: string[] = []; + try { + for await (const entry of opts.walkSource) { + if (earlyExit) { + break; + } + batch.push(entry.absolutePath); + batchRel.push( + opts.perFile.pathPrefix + ? joinPosix(opts.perFile.pathPrefix, entry.relativePath) + : entry.relativePath + ); + if (batch.length >= WORKER_BATCH_SIZE) { + dispatchBatch(batch, batchRel); + batch = []; + batchRel = []; + } + } + // Final partial batch. + if (batch.length > 0) { + dispatchBatch(batch, batchRel); + } + } catch (error) { + producerError = error; + } finally { + walkerDone = true; + wakeConsumer(); + } + })(); + + // Helper to check + throw on aborted signal. Matches the behavior + // of `walkFiles` which throws `DOMException("AbortError")` on the + // next yield when the signal fires mid-iteration. + const checkAborted = (): void => { + const signal = opts.concurrent.signal; + if (signal?.aborted) { + throw new DOMException("aborted", "AbortError"); + } + }; + + // Consumer: drain `pending` as batches complete, yielding matches + // and respecting `maxResults`/`stopOnFirst`. Wait for `notify` + // when there's nothing to emit and more work is in flight. + try { + while (!earlyExit) { + checkAborted(); + // Emit everything currently pending. + while (pending.length > 0) { + checkAborted(); + const matches = pending.shift(); + if (!matches) { + continue; + } + for (const m of matches) { + if (opts.stats.matchesEmitted >= opts.maxResults) { + opts.stats.truncated = true; + earlyExit = true; + break; + } + opts.stats.matchesEmitted += 1; + yield m; + checkAborted(); + if (opts.stopOnFirst) { + opts.stats.truncated = true; + earlyExit = true; + break; + } + } + if (earlyExit) { + break; + } + } + if (earlyExit) { + break; + } + // Nothing pending. If the walker is done AND no in-flight + // batches remain, we're finished. + if (walkerDone && inflightCount === 0) { + break; + } + // Otherwise, wait for something to happen. Check for a + // pre-arrived notification first to avoid missing wakeups + // that happened while we were iterating `pending`. + if (notifyPending) { + notifyPending = false; + continue; + } + await new Promise((resolve) => { + if (notifyPending) { + notifyPending = false; + resolve(); + } else { + pendingNotify = resolve; + } + }); + } + } finally { + // Ensure the producer settles even on early exit so errors + // surface in the next run (pool promises stay alive otherwise). + earlyExit = true; + wakeConsumer(); + await producer; + // If the walker threw (typically `AbortError`), re-throw so the + // caller sees the same behavior as the async-fallback path. + if (producerError !== null) { + // biome-ignore lint/correctness/noUnsafeFinally: intentional — re-raising walker error (e.g., AbortError) so callers see it + throw producerError; + } + } +} + +/** + * Resolve the regex flags to send to the worker. Matches the logic + * in `grepByWholeBuffer` — always `/g` (required for iteration), + * plus `/m` when the caller opted into line-boundary semantics (the + * default). + */ +function resolveWorkerFlags(opts: PerFileOptions): string { + const ensured = opts.multiline + ? ensureGlobalMultilineFlags(opts.regex) + : ensureGlobalFlag(opts.regex); + return ensured.flags; +} + /** * Resolved pipeline inputs shared by `grepFiles` and `collectGrep`. * Extracted so both entry points get the same default-resolution, diff --git a/src/lib/scan/worker-pool.ts b/src/lib/scan/worker-pool.ts new file mode 100644 index 000000000..b0336df72 --- /dev/null +++ b/src/lib/scan/worker-pool.ts @@ -0,0 +1,328 @@ +/** + * Worker pool for parallel file-grep work. + * + * Lazy-initialized singleton: the first call to `getWorkerPool()` + * spawns N workers via a Blob URL (for compatibility with + * `bun build --compile`'s single-file binary — see + * `grep-worker-source.ts`). Subsequent calls reuse the pool. + * + * Workers are `.unref()`'d so the CLI exits cleanly without an + * explicit shutdown — they'll be torn down by the runtime when the + * main process ends. + * + * ### Feature-gate + * + * `isWorkerSupported()` returns true only when `Worker` and `Blob` + * + `URL.createObjectURL` are available. On the Node library bundle + * (`dist/index.cjs`), these globals exist (Node provides them via + * `worker_threads` aliases in Bun, and Node 22+ has `Blob` / + * `createObjectURL`). We avoid depending on `bun:*` or `Bun.*` APIs + * so the same code works in both runtimes. + * + * When unsupported, callers fall back to the current async + * `mapFilesConcurrent` path. + */ + +import { availableParallelism } from "node:os"; +import { GREP_WORKER_SOURCE } from "./grep-worker-source.js"; +import type { GrepMatch } from "./types.js"; + +/** + * Batch dispatched to a worker. `paths` are absolute filesystem + * paths the worker will `readFileSync`. `pathsBase` is the + * relative-path prefix (usually the walker's `cwd + "/"`) that the + * caller uses to reconstruct `GrepMatch.path` from + * `pathsBase + absolutePath.slice(pathsBase.length)`. + */ +export type WorkerGrepRequest = { + paths: string[]; + patternSource: string; + flags: string; + maxLineLength: number; + maxMatchesPerFile: number; + literal: string | null; +}; + +/** + * Packed result from a single worker batch. Rehydrate via + * `decodeWorkerMatches`. + */ +export type WorkerGrepResult = { + /** Packed 4-u32-per-match (pathIdx, lineNum, lineOffset, lineLength). */ + ints: Uint32Array; + /** Concatenated line text, indexed by `ints[i*4 + 2]` and `+3`. */ + linePool: string; +}; + +/** + * Per-worker state. Dispatch queues requests FIFO per worker — we + * can't use `addEventListener` per request because multiple + * concurrent dispatches to the same worker would make their handlers + * all fire on the first `result` message (they each match the + * shape), resulting in the wrong `resolve()` getting called with + * the wrong batch's data. + * + * Instead: one `onmessage` per worker, `pending` queue of resolvers, + * shift the head on each `result` message. + */ +type PooledWorker = { + worker: Worker; + /** Promise that resolves once the worker signals `"ready"`. */ + ready: Promise; + /** Number of batches currently dispatched to this worker. */ + inflight: number; + /** + * Queue of pending result resolvers. Populated by `dispatch`, + * drained in FIFO order by `worker.onmessage` as `result` messages + * arrive. Workers process messages in postMessage order (both Bun + * and Node `worker_threads` guarantee this), so FIFO-order + * resolution is correct. + */ + pending: Array<{ + resolve: (r: WorkerGrepResult) => void; + reject: (e: unknown) => void; + }>; +}; + +type WorkerPool = { + workers: PooledWorker[]; + /** + * Dispatch `request` to the least-loaded worker, returning a + * promise that resolves when the worker posts its `"result"`. + */ + dispatch(request: WorkerGrepRequest): Promise; + /** + * Terminate all workers in the pool. Used by tests and on process + * teardown. Safe to call multiple times. + */ + terminate(): void; +}; + +/** + * Module-level pool singleton. Lazily initialized on first + * `getWorkerPool()` call. Cleared by `terminatePool()` (used by + * tests). + */ +let pool: WorkerPool | null = null; + +/** + * True when the runtime supports Web-Workers-style `new Worker(url)` + * with a Blob-URL source. Covers: + * - Bun (dev mode and single-file compiled binary) + * - Node 22+ when the library bundle is consumed (Node exposes + * `Worker` via the DOM shim in newer versions; older Node lacks + * it, so we fall back). + */ +export function isWorkerSupported(): boolean { + return ( + typeof Worker !== "undefined" && + typeof Blob !== "undefined" && + typeof URL !== "undefined" && + typeof URL.createObjectURL === "function" + ); +} + +/** + * Default pool size: clamped to [2, 8] based on `availableParallelism()`. + * Bench (synthetic/large, 10k files, `import.*from`): 4 workers hits + * the knee (183ms p50 vs 171ms for 8 workers; 316ms for 2). Most + * CLI hosts are 4–16 core, so clamping at 8 prevents over-spawn on + * high-core boxes where per-worker stat/read contention dominates. + */ +export function getPoolSize(): number { + return Math.max(2, Math.min(8, availableParallelism())); +} + +/** + * Build the worker's Blob-URL source once per pool. Cached so + * repeated pool recreation (in tests) doesn't leak URLs. + */ +let cachedBlobUrl: string | null = null; +function getWorkerBlobUrl(): string { + if (cachedBlobUrl !== null) { + return cachedBlobUrl; + } + const blob = new Blob([GREP_WORKER_SOURCE], { + type: "application/javascript", + }); + cachedBlobUrl = URL.createObjectURL(blob); + return cachedBlobUrl; +} + +/** + * Get or lazily create the worker pool. Throws if + * `isWorkerSupported()` is false — callers should feature-gate + * on that first. + */ +export function getWorkerPool(): WorkerPool { + if (pool !== null) { + return pool; + } + if (!isWorkerSupported()) { + throw new Error( + "Worker pool requested but Workers are unavailable in this runtime" + ); + } + + const size = getPoolSize(); + const url = getWorkerBlobUrl(); + const workers: PooledWorker[] = []; + + for (let i = 0; i < size; i += 1) { + const w = new Worker(url); + // Note: NOT calling .unref() — when multiple dispatches are + // in-flight but the main thread only has promise waits pending, + // unref'd workers don't process messages on idle ticks, causing + // deadlock. The CLI explicitly terminates the pool at exit via + // `terminatePool()` so this doesn't leak. + const pw: PooledWorker = { + worker: w, + ready: new Promise((resolve) => { + // Single readiness listener; removed when the ready signal + // arrives. Subsequent messages are handled by `onmessage` + // set below (which takes over after ready). + const readyHandler = (event: MessageEvent) => { + if (event.data?.type === "ready") { + w.removeEventListener("message", readyHandler); + resolve(); + } + }; + w.addEventListener("message", readyHandler); + }), + inflight: 0, + pending: [], + }; + // Single onmessage handler per worker. Matches `result` messages + // to the oldest pending dispatch via FIFO shift. Messages from + // the worker arrive in the same order as `postMessage` calls, + // and the worker processes requests sequentially (single-thread + // inside), so FIFO matching is sound. + w.addEventListener("message", (event) => { + const data = event.data as { type?: string } & WorkerGrepResult; + if (data.type !== "result") { + return; + } + const next = pw.pending.shift(); + if (!next) { + return; + } + pw.inflight -= 1; + next.resolve({ ints: data.ints, linePool: data.linePool }); + }); + w.addEventListener("error", (err) => { + // Fail all pending dispatches on worker error. + const errMsg = err.message ?? String(err); + while (pw.pending.length > 0) { + const p = pw.pending.shift(); + p?.reject(new Error(`worker error: ${errMsg}`)); + } + pw.inflight = 0; + }); + workers.push(pw); + } + + pool = { + workers, + dispatch(request: WorkerGrepRequest): Promise { + // Pick least-loaded worker. On tie, lowest index wins (stable). + let best = workers[0] as PooledWorker; + for (const pw of workers) { + if (pw.inflight < best.inflight) { + best = pw; + } + } + best.inflight += 1; + + // Enqueue a pending slot for this request. The worker's + // `onmessage` handler will resolve it when the corresponding + // `result` message arrives (FIFO). + const result = new Promise((resolve, reject) => { + best.pending.push({ resolve, reject }); + }); + // Wait for readiness (first dispatch only), then post the + // request. Subsequent dispatches skip the await (the ready + // promise is already settled). + best.ready.then( + () => { + best.worker.postMessage(request); + }, + (err) => { + // Readiness failed — fail this dispatch's resolver. + const slot = best.pending.pop(); + if (slot) { + best.inflight -= 1; + slot.reject(err); + } + } + ); + return result; + }, + terminate(): void { + for (const pw of workers) { + try { + pw.worker.terminate(); + } catch { + // Ignore — terminate is experimental in Bun and may throw + // on already-terminated workers. + } + } + }, + }; + + return pool; +} + +/** + * Tear down the singleton pool. Primarily for tests — the + * singleton is otherwise kept alive for the process lifetime. + */ +export function terminatePool(): void { + if (pool !== null) { + pool.terminate(); + pool = null; + } + if (cachedBlobUrl !== null) { + // URL.revokeObjectURL is safe to call on Node + Bun. + URL.revokeObjectURL(cachedBlobUrl); + cachedBlobUrl = null; + } +} + +/** + * Decode a worker's packed `{ints, linePool}` into an array of + * `GrepMatch`es, using the caller's `paths` and `pathsBase` to + * reconstruct path fields. + * + * `pathsBase` is the `relativePath` prefix the walker would emit — + * typically `cfg.cwd.length + 1` characters trimmed from each + * absolute path. For grep we pass the absolute path as both + * `path` and `absolutePath` and let the caller reinterpret; see + * `grep.ts` integration. + */ +export function decodeWorkerMatches( + result: WorkerGrepResult, + paths: string[], + relPaths: string[] +): GrepMatch[] { + const { ints, linePool } = result; + const matches: GrepMatch[] = []; + // 4 u32s per match (pathIdx, lineNum, lineOffset, lineLength). + const count = Math.floor(ints.length / 4); + for (let i = 0; i < count; i += 1) { + const base = i * 4; + const pathIdx = ints[base] ?? 0; + const lineNum = ints[base + 1] ?? 0; + const lineOffset = ints[base + 2] ?? 0; + const lineLength = ints[base + 3] ?? 0; + const absolutePath = paths[pathIdx] ?? ""; + const relativePath = relPaths[pathIdx] ?? absolutePath; + const line = linePool.slice(lineOffset, lineOffset + lineLength); + matches.push({ + path: relativePath, + absolutePath, + lineNum, + line, + }); + } + return matches; +} From 6ea26f3d8287b7440ef2857e481091853073d59f Mon Sep 17 00:00:00 2001 From: Burak Yigit Kaya Date: Tue, 21 Apr 2026 22:49:33 +0000 Subject: [PATCH 2/9] perf(scan,dsn): unify DSN scanner with grep pipeline; fix dead-worker dispatch MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## DSN scanner unification The DSN scanner had its own `walkFiles + mapFilesConcurrent + Bun.file().text() + extractDsnsFromContent` pipeline that predated the worker pool from PR #807. It was doing the same fundamental work as `grepFiles` (regex over file content + collect matches) but via a separate path with more orchestration overhead. Rewrote `scanCodeForDsns` to use `grepFiles` directly: - Pattern: `DSN_PATTERN` — grep's literal prefilter gates files by `http` (matches the old `HTTP_SCHEME_PROBE` fast path exactly). - Walker options: `dsnScanOptions()` + `recordMtimes: true` + `onDirectoryVisit` hook for cache-invalidation maps. - Per-match post-filter on the main thread: comment-line check, host validation, dedup, mtime tracking, package path inference. - Multi-DSN-per-line handling preserved via `extractDsnsOnLine` which re-runs the regex on `match.line`. `scanCodeForFirstDsn` deliberately does NOT route through the worker pool. It's the stopOnFirst fast path (first valid DSN wins, return early). Spawning a 4-worker pool for a workload that often completes after reading 1-2 files would add ~20ms of pool-init cost that dwarfs the work. Instead it uses a direct `walkFiles` loop with sync-ish per-file read + `extractFirstDsnFromContent`, which already benefits from the `/http/i` literal fast path. This keeps `detectDsn.cold` at ~3.5ms instead of regressing to ~20ms. ## New grep options: `recordMtimes` + `onDirectoryVisit` To support the DSN cache-invalidation maps, extended `GrepOptions`: - `recordMtimes: boolean` — when true, each `GrepMatch` includes the source file's floored `mtimeMs` on an optional `mtime` field. Walker already exposes `entry.mtime` via its own `recordMtimes` passthrough; the main thread captures mtimes per-batch before dispatching and the decoder attaches them indexed by `pathIdx`. Worker doesn't need to stat again (the main thread already did via `statSync`). - `onDirectoryVisit: (absDir, mtimeMs) => void` — pass-through to the walker's per-directory hook. Used for directory-level cache invalidation ("has a new file been added to this dir?"). Both are opt-in; when unused, `GrepMatch.mtime` is absent (not 0 — distinguishes "not asked" from "asked and happens to be 0"). ## Cursor Bugbot: dead-worker dispatch deadlock Cursor found a real correctness issue in the worker pool from #807: > After a worker `error` event, the error handler rejects all > pending items and resets `pw.inflight` to 0, but the dead worker > remains in the pool. The least-loaded selection in `dispatch` > will favor this worker (it appears to have 0 `inflight`), push > new pending slots, and call `postMessage` — but the dead worker > can't respond. Those dispatch promises hang forever, so > `inflightCount` in `grepViaWorkers` never reaches 0, deadlocking > the consumer's exit condition at > `walkerDone && inflightCount === 0`. Added `alive: boolean` to `PooledWorker`, flipped to false on the first `error` event before rejecting pending dispatches. The least-loaded picker now skips dead workers entirely and falls back to `Promise.reject(...)` when every worker is dead. Prevents the reported deadlock and ensures a pool-wide failure surfaces as an error instead of hanging. ## Perf (synthetic/large, 10k files, p50) | Op | Before unification | After unification | Δ | |---|---:|---:|---:| | `scanCodeForDsns` | 293ms | **234ms** | **−20%** | | `detectAllDsns.cold` | 297ms | **250ms** | **−16%** | | `detectDsn.cold` | 4.98ms | **3.45ms** | **−31%** | | `scanCodeForFirstDsn` | 2.07ms | 1.88ms | — | | `scan.grepFiles` | 191ms | 170ms | − | | `detectAllDsns.warm` | 27.5ms | 27.2ms | — | The full-scan paths (`scanCodeForDsns`, `detectAllDsns.cold`) are 20% faster because they now share the worker-pool infrastructure from PR #807. The `stopOnFirst` paths (`scanCodeForFirstDsn`, `detectDsn.cold`) are essentially unchanged (or slightly better) because they now use a tighter direct-walker loop. ## Correctness - `extractDsnsFromContent` still exported for `src/lib/dsn/detector.ts::isDsnStillPresent` (single-file cache verification) — unchanged. - `extractFirstDsnFromContent` still exported for `scanCodeForFirstDsn`'s per-file fast path — unchanged. - All DSN tests pass, including the tricky "deduplicates same DSN from multiple files" test which required tracking `fileHadValidDsn` separately from seen-set insertion to preserve the pre-refactor behavior of recording mtime for every file that contributed a validated DSN (even if the DSN was a duplicate). - `sourceMtimes` / `dirMtimes` shapes unchanged — the cache schema in `src/lib/db/dsn-cache.ts` doesn't need to change. - Minor observability drift: per-file `log.debug("Cannot read file: …")` no longer fires on individual read errors in the full-scan path (worker swallows them). The `scanCodeForFirstDsn` path still logs top-level errors. Accepted as minor regression. ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean (1 pre-existing markdown warning) - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5640 pass, 0 fail** - [x] `bun test test/isolated` — 138 pass - [x] `bun test test/lib/dsn/code-scanner.test.ts` — 150 pass (including the "deduplicates same DSN from multiple files" regression test that caught the initial buggy implementation) - [x] `bun test test/lib/scan/grep.test.ts` — 38 pass - [x] `bun run bench --size large --runs 5 --warmup 2` — `scanCodeForDsns` 234ms p50 (−20% from 293ms) --- AGENTS.md | 11 +- src/lib/dsn/code-scanner.ts | 353 +++++++++++++++++++++--------------- src/lib/scan/grep.ts | 44 ++++- src/lib/scan/types.ts | 22 +++ src/lib/scan/worker-pool.ts | 75 +++++--- 5 files changed, 324 insertions(+), 181 deletions(-) diff --git a/AGENTS.md b/AGENTS.md index 03e65a36d..a4940e21c 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -1001,7 +1001,7 @@ mock.module("./some-module", () => ({ * **DSN cache invalidation uses two-level mtime tracking (sourceMtimes + dirMtimes)**: \*\*DSN cache invalidation uses two-level mtime tracking\*\*: \`sourceMtimes\` (DSN-bearing files, catches in-place edits) + \`dirMtimes\` (every walked dir, catches new files) + root mtime fast-path + 24h TTL. Dropping either map is a correctness regression. Walker emits mtimes via \`onDirectoryVisit\` hook + \`recordMtimes\` option. \*\*Error-path invariant\*\*: \`scanDirectory\` catch block MUST return empty \`dirMtimes: {}\` — NOT partial map from \`onDirectoryVisit\` callback, which would silently bless unvisited dirs. Empty forces full rescan. \`ConfigError\` re-throws instead. \`scan.walk.dsnParity\` bench op imports \`dsnScanOptions()\` from \`dsn/scan-options.ts\` to keep \`scan/\` policy-free. -* **Grep worker pool: binary-transferable matches + streaming dispatch in src/lib/scan/**: Grep worker pool in \`src/lib/scan/worker-pool.ts\` + \`grep-worker-source.ts\`: lazy singleton pool (size \`min(8, max(2, availableParallelism()))\`) for \`collectGrep\`/\`grepFiles\`. Matches encoded as \`Uint32Array\` of \`\[pathIdx, lineNum, lineOffset, lineLength]\` quads + single \`linePool\` string, transferred via \`postMessage(msg, \[ints.buffer])\` — drops structuredClone cost 198ms→2-3ms. Worker source inlined as string, loaded via \`Blob\` + \`URL.createObjectURL\` (URL-based workers hang in \`bun build --compile\` binaries). Main-thread producer batches walker paths (size 200), dispatches least-loaded; consumer yields decoded matches with AbortSignal polling. Disable via \`SENTRY\_SCAN\_DISABLE\_WORKERS=1\`; falls back to \`grepViaAsyncMain\` on Node. Perf: 10k files 1300ms → ~191ms p50 (faster than rg's 339ms); walker now bottleneck (~250ms of 275ms). +* **Grep worker pool: binary-transferable matches + streaming dispatch in src/lib/scan/**: \*\*Grep worker pool\*\* (\`src/lib/scan/worker-pool.ts\` + \`grep-worker-source.ts\`): lazy singleton pool (size \`min(8, max(2, availableParallelism()))\`) for \`collectGrep\`/\`grepFiles\`. Matches encoded as \`Uint32Array\` quads \`\[pathIdx, lineNum, lineOffset, lineLength]\` + single \`linePool\` string, transferred via \`postMessage(msg, \[ints.buffer])\` — drops structuredClone 198ms→2-3ms. Worker source inlined as string, loaded via \`Blob\` + \`URL.createObjectURL\` (URL-based workers hang in \`bun build --compile\`). Main-thread producer batches walker paths (size 200), dispatches least-loaded. Disable via \`SENTRY\_SCAN\_DISABLE\_WORKERS=1\`; falls back to \`grepViaAsyncMain\` on Node. \*\*Perf\*\*: cut grep pipeline 322ms→191ms (−41%) on DSN-options fixture; walker is now floor (~140ms of 191ms). Full-walk case: walker ~233ms, grep ~272ms for 8009 files (~34µs/file). Slower than rg on pure-literal (SIMD/DFA/zero-copy) but 14× faster on regex+literal with cap=100 (early-exit wins). * **Input validation layer: src/lib/input-validation.ts guards CLI arg parsing**: CLI arg input validation (\`src/lib/input-validation.ts\`): Four validators — \`rejectControlChars\` (ASCII < 0x20), \`rejectPreEncoded\` (%XX), \`validateResourceId\` (rejects ?, #, %, whitespace), \`validateEndpoint\` (rejects \`..\` traversal). Applied in \`parseSlashOrgProject\`, \`parseOrgProjectArg\`, \`parseIssueArg\`, \`normalizeEndpoint\`. NOT applied in \`parseSlashSeparatedArg\` for plain IDs (may contain structural separators). Env vars and DB cache values are trusted. @@ -1013,7 +1013,7 @@ mock.module("./some-module", () => ({ * **Sentry SDK uses @sentry/node-core/light instead of @sentry/bun to avoid OTel overhead**: Sentry SDK uses \`@sentry/node-core/light\` instead of \`@sentry/bun\` to avoid OpenTelemetry overhead (~150ms, 24MB). \`@sentry/core\` barrel patched via \`bun patch\` to remove ~32 unused exports. Gotcha: \`LightNodeClient\` hardcodes \`runtime: { name: 'node' }\` AFTER spreading options — fix by patching \`client.getOptions().runtime\` post-init (mutable ref). Transport uses Node \`http\` instead of native \`fetch\`. Upstream: getsentry/sentry-javascript#19885, #19886. -* **src/lib/scan/ module: policy-free file walker with IgnoreStack**: \`src/lib/scan/\` — policy-free walker + grep/glob engine. Files: \`walker.ts\` (DFS async gen; \`minDepth\`/\`maxDepth\` cap DESCENT not file yield; \`timeBudgetMs\`, \`AbortSignal\`, \`onDirectoryVisit\`, \`recordMtimes\`), \`ignore.ts\` (IgnoreStack two-tier \`#rootIg\`+\`#nestedByRelDir\`), \`binary.ts\` (8KB NUL sniff + ext fast-path), \`concurrent.ts\`, \`regex.ts\` (\`(?i)\`/\`(?im)\`/\`(?U)\` → JS flags), \`grep.ts\`/\`glob.ts\` (picomatch \`{dot:true}\`). DSN policy lives in \`src/lib/dsn/scan-options.ts\`. Gotchas: (1) \`classifyFile\` short-circuits when \`cfg.extensions\` set. (2) \`buildWalkOptions\` MUST forward \`descentHook\`+\`alwaysSkipDirs\`+\`followSymlinks\`+\`recordMtimes\`+\`clock\`. (3) NEVER \`pLimit.clearQueue()\` (hangs); use \`onResult\` returning \`{done:true}\`. (4) \`path.join\`/\`relative\`/\`extname\` ~10× slower than manual string ops in hot loop — use \`abs.slice(cwdPrefixLen)\` and \`lastIndexOf('.')+slice+toLowerCase\`. +* **src/lib/scan/ module: policy-free file walker with IgnoreStack**: \`src/lib/scan/\` — policy-free walker + grep/glob engine. Files: \`walker.ts\` (DFS async gen; \`minDepth\`/\`maxDepth\` cap DESCENT not yield; \`timeBudgetMs\`, \`AbortSignal\`, \`onDirectoryVisit\`, \`recordMtimes\`), \`ignore.ts\` (IgnoreStack two-tier \`#rootIg\`+\`#nestedByRelDir\`), \`binary.ts\` (8KB NUL sniff + ext fast-path), \`regex.ts\` (\`(?i)\`/\`(?im)\`/\`(?U)\` → JS flags), \`grep.ts\`/\`glob.ts\` (picomatch \`{dot:true}\`). DSN policy lives in \`src/lib/dsn/scan-options.ts\`. Gotchas: (1) \`classifyFile\` short-circuits when \`cfg.extensions\` set. (2) \`buildWalkOptions\` MUST forward \`descentHook\`+\`alwaysSkipDirs\`+\`followSymlinks\`+\`recordMtimes\`+\`clock\`. (3) NEVER \`pLimit.clearQueue()\` (hangs); use \`onResult\` returning \`{done:true}\`. (4) \`path.join\`/\`relative\`/\`extname\` ~10× slower than manual string ops in hot loop — use \`abs.slice(cwdPrefixLen)\` and \`lastIndexOf('.')+slice+toLowerCase\`. * **Telemetry opt-out is env-var-only — no persistent preference or DO\_NOT\_TRACK**: Telemetry opt-out priority: (1) \`SENTRY\_CLI\_NO\_TELEMETRY=1\`, (2) \`DO\_NOT\_TRACK=1\`, (3) \`metadata.defaults.telemetry\`, (4) default on. DB read try/catch wrapped (runs before DB init). Schema v13 merged \`defaults\` table into \`metadata\` KV with keys \`defaults.{org,project,telemetry,url}\`; getters/setters in \`src/lib/db/defaults.ts\`. \`sentry cli defaults\` uses variadic \`\[key, value?]\`: no args → show all; 1 arg → show key; 2 args → set; \`--clear\` without args → clear all (guarded); \`--clear key\` → clear specific. \`computeTelemetryEffective()\` returns resolved source for display. @@ -1026,9 +1026,6 @@ mock.module("./some-module", () => ({ * **All view subcommands should use \ \ positional pattern**: All \`\* view\` subcommands use \`\ \\` positional pattern (Intent-First Correction UX): target is optional \`org/project\`. Use opportunistic arg swapping with \`log.warn()\` when args are wrong order — when intent is unambiguous, do what they meant. Normalize at command level, keep parsers pure. Model after \`gh\` CLI. Exception: \`auth\` uses \`defaultCommand: "status"\` (no viewable entity). Routes without defaults: \`cli\`, \`sourcemap\`, \`repo\`, \`team\`, \`trial\`, \`release\`, \`dashboard/widget\`. - -* **grep orchestration overhead is ~500-900ms; sync main-thread beats worker-pool on many-match workloads**: grep perf evolution on 10k-file/215k-match \`import.\*from\`: baseline 1300ms → sync main-thread (readFileSync + serial regex) 324-395ms (~rg's 339ms) → binary-transferable streaming worker pool \*\*191ms p50\*\* (faster than rg). Object-payload workers lose to structuredClone (216ms for 215k matches: JSON.stringify 101ms / parse 168ms). Walker is now bottleneck (~250ms of 275ms end-to-end). SharedArrayBuffer/mmap would save only 10-30ms more, not worth complexity. See worker-pool entry for impl. - * **Sentry-derived terminal color palette tuned for dual-background contrast**: Terminal color palette tuned for dual-background contrast: 10-color chart palette derived from Sentry's categorical hues (\`static/app/utils/theme/scraps/tokens/color.tsx\`), adjusted to mid-luminance for ≥3:1 contrast on both dark and light backgrounds. Adjustments: orange #FF9838→#C06F20, green #67C800→#3D8F09, yellow #FFD00E→#9E8B18, purple #5D3EB2→#8B6AC8, indigo #50219C→#7B50D0; blurple/pink/magenta unchanged; teal #228A83 added. Hex preferred over ANSI 16-color for guaranteed contrast. @@ -1041,7 +1038,7 @@ mock.module("./some-module", () => ({ * **Bun worker \`new Worker(url)\` fails in compiled binaries — use Blob URL**: Bun \`new Worker(new URL('./worker.ts', import.meta.url))\` works in \`bun run\` but \*\*hangs in \`bun build --compile\` single-file binaries\*\* — URL can't be resolved at runtime. Fix: inline worker source as a string, wrap in \`new Blob(\[src], { type: 'application/javascript' })\`, pass \`URL.createObjectURL(blob)\` to \`new Worker()\`. Works across Bun compiled binary and esbuild Node CJS bundle. Worker source must be self-contained (no imports from user modules); CommonJS-style \`require('node:fs')\` works inside. -* **mapFilesConcurrent skips null but not empty arrays — callers must return null for no-op**: Scan/formatter off-by-ones: (1) \`mapFilesConcurrent\` in \`src/lib/scan/concurrent.ts\` filters \`null\` but NOT empty arrays — pushes and fires \`onResult\` per empty result. Stream variant filters both. Callers like \`processEntry\` in \`dsn/code-scanner.ts\` must return \`null\` (not \`\[]\`) for no-op files; on 10k-file walks ~99% yield empties, firing thousands of wasted callbacks. (2) \`collectGlob\`/\`collectGrep\` must NOT forward \`maxResults\` to the underlying iterator — collector drains uncapped, breaks when \`files.length >= maxResults\`, sets \`truncated=true\` on overshoot. Forwarding makes iterator stop at exactly N without \`truncated\`. (3) \`filterFields\` in \`formatters/json.ts\` uses dot-notation — property tests must use \`\[a-zA-Z0-9\_]\` charset to avoid ambiguous keys with dots. +* **mapFilesConcurrent skips null but not empty arrays — callers must return null for no-op**: Scan/formatter off-by-ones: (1) \`mapFilesConcurrent\` in \`src/lib/scan/concurrent.ts\` filters \`null\` but NOT empty arrays — fires \`onResult\` per empty result. Callers like \`processEntry\` in \`dsn/code-scanner.ts\` must return \`null\` (not \`\[]\`) for no-op files; on 10k-file walks ~99% yield empties. Stream variant filters both. (2) \`collectGlob\`/\`collectGrep\` must NOT forward \`maxResults\` to the underlying iterator — collector drains uncapped, sets \`truncated=true\` on overshoot. Forwarding makes iterator stop at exactly N without \`truncated\`. (3) \`filterFields\` in \`formatters/json.ts\` uses dot-notation — property tests must use \`\[a-zA-Z0-9\_]\` charset to avoid ambiguous keys with dots. * **process.stdin.isTTY unreliable in Bun — use isatty(0) and backfill for clack**: \`process.stdin.isTTY\` unreliable in Bun — use \`isatty(0)\` from \`node:tty\`. Bun's single-file binary can leave \`process.stdin.isTTY === undefined\` on TTY fds inherited via redirects like \`exec … \ ({ ### Pattern -* **Bench harness with concurrency sweep + deterministic fixtures**: Bench harness (not in CI): \`script/bench.ts\` + \`script/bench-sweep.ts\`. Scripts: \`bun run bench\`, \`bench:save\` (\`.bench/baseline.json\`), \`bench:compare\` (fails >20% regression), \`bench:sweep\` (concurrency tuning). Flags: \`--size\`, \`--op\`, \`--repo \\` (refuses \`--save-baseline\`), \`--runs\`, \`--warmup\`, \`--json\`. Fixtures in \`test/fixtures/bench/\` use xorshift32-seeded generator for byte-identical trees in \`os.tmpdir()/sentry-cli-bench/\`. Baselines gitignored. Op labels match Sentry span names. \`withBenchDb(fn)\` scopes \`SENTRY\_CONFIG\_DIR\`; \`clearDsnDetectionCache()\` between cold iterations. \*\*CONCURRENCY\_LIMIT\*\*: sweep showed knee=availableParallelism for walker-fed workloads; pure I/O wants 16-32. Walker dominates; kept at \`min(16, max(2, availableParallelism()))\`. +* **Bench harness with concurrency sweep + deterministic fixtures**: Bench harness (not in CI): \`script/bench.ts\` + \`script/bench-sweep.ts\`. Scripts: \`bun run bench\`, \`bench:save\` (\`.bench/baseline.json\`), \`bench:compare\` (fails >20% regression), \`bench:sweep\` (concurrency tuning). Flags: \`--size\`, \`--op\`, \`--repo \\` (refuses \`--save-baseline\`), \`--runs\`, \`--warmup\`, \`--json\`. Fixtures in \`test/fixtures/bench/\` use xorshift32-seeded generator for byte-identical trees in \`os.tmpdir()/sentry-cli-bench/\`. Op labels match Sentry span names. \`withBenchDb(fn)\` scopes \`SENTRY\_CONFIG\_DIR\`; \`clearDsnDetectionCache()\` between cold iterations. \*\*CONCURRENCY\_LIMIT\*\*: sweep showed knee=availableParallelism for walker-fed workloads; pure I/O wants 16-32. Kept at \`min(16, max(2, availableParallelism()))\`. * **Event view cross-org fallback chain: project → org → all orgs**: Event view cross-org fallback chain: \`fetchEventWithContext\` in \`src/commands/event/view.ts\` tries (1) \`getEvent(org, project, eventId)\`, (2) \`resolveEventInOrg(org, eventId)\`, (3) \`findEventAcrossOrgs(eventId, { excludeOrgs })\`. Extracted into \`tryEventFallbacks()\` for Biome complexity limit. \`excludeOrgs\` only set when same-org search returned null (not on transient error). Both catch blocks re-throw \`AuthError\` while swallowing transient errors. Warning distinguishes same-org/different-project from different-org via \`crossOrg.org === org\`. diff --git a/src/lib/dsn/code-scanner.ts b/src/lib/dsn/code-scanner.ts index 24b6f8516..5ac8855c0 100644 --- a/src/lib/dsn/code-scanner.ts +++ b/src/lib/dsn/code-scanner.ts @@ -4,45 +4,61 @@ * This module owns the DSN-specific policy (URL regex, comment-line * filtering, host validation, package-path inference, stop-on-first * semantics). All file walking, `.gitignore` handling, extension - * filtering, and bounded concurrency are delegated to the shared - * `src/lib/scan/` module. + * filtering, bounded concurrency, AND worker-pool dispatch are + * delegated to the shared `src/lib/scan/` module via `grepFiles`. * * Flow: - * 1. `scanDirectory(cwd, stopOnFirst)` calls `walkFiles` with the - * DSN preset (`dsnScanOptions()`), passing `recordMtimes` and an - * `onDirectoryVisit` hook so the cache-invalidation map is - * populated in one traversal. - * 2. Each yielded file is read + passed through `extractDsnsFromContent` - * via `mapFilesConcurrent`. Per-file `ConfigError` re-throws up - * to the caller; all other errors are logged at debug level and - * the file is skipped. - * 3. `onResult` in `mapFilesConcurrent` dedups into a shared Map - * and raises the early-exit flag on first unique DSN when - * `stopOnFirst: true`. + * 1. `scanDirectory(cwd, stopOnFirst)` calls `grepFiles` with the + * DSN pattern and preset (`dsnScanOptions()`), plus + * `recordMtimes: true` and an `onDirectoryVisit` hook so the + * cache-invalidation maps are populated in one traversal. + * 2. `grepFiles` dispatches per-file work to the worker pool (when + * available) or a concurrent-async fallback. Each yielded + * `GrepMatch` represents one line containing a DSN URL; the + * grep engine handles the file-level literal gate (`http`) for + * free, so we skip files that can't possibly match before any + * regex runs. + * 3. Main thread post-filters each match: + * - Skip commented lines (language-aware comment prefixes) + * - Re-run `DSN_PATTERN` on `match.line` to recover all DSNs + * (grep emits one match per line regardless of how many + * hits the line contains — rare for DSNs but the contract + * predates this refactor) + * - Validate host (`isValidDsnHost`) + * - Dedup on raw DSN string + * - Early-exit on first unique DSN when `stopOnFirst: true` + * - Build `DetectedDsn` with inferred package path + * 4. `sourceMtimes` records mtime per file that contributed a + * validated DSN; `dirMtimes` records mtime per visited dir via + * the hook. Both are used by `src/lib/db/dsn-cache.ts` for + * cache invalidation. * * Behavior change landed in PR 3: the walker's `nestedGitignore: true` * default (via `dsnScanOptions()`) means nested `.gitignore` files are * now honored. Pre-PR-3 code only read the project-root `.gitignore`. * This is a correctness improvement matching git's cumulative semantics; * DSNs in files covered by a subdir `.gitignore` are no longer detected. + * + * Behavior change landed in PR 6 (this one): the DSN scanner now shares + * the grep pipeline and gets worker-pool parallelism for free. + * End-to-end time on the 10k-file fixture drops from ~330ms → ~200ms. + * Correctness is unchanged — `extractDsnsFromContent` is still + * exported for `src/lib/dsn/detector.ts::isDsnStillPresent` (the + * cache-verify fast path for a single file) and internally we still + * go through the same comment/host-validation filter. */ import path from "node:path"; import { DEFAULT_SENTRY_HOST, getConfiguredSentryUrl } from "../constants.js"; import { ConfigError } from "../errors.js"; import { logger } from "../logger.js"; -import { - mapFilesConcurrent, - normalizePath, - type WalkEntry, - walkFiles, -} from "../scan/index.js"; +import { grepFiles, normalizePath, walkFiles } from "../scan/index.js"; import { withTracingSpan } from "../telemetry.js"; import { createDetectedDsn, inferPackagePath, parseDsn } from "./parser.js"; import { DSN_MAX_DEPTH, dsnScanOptions } from "./scan-options.js"; import type { DetectedDsn } from "./types.js"; -/** Scoped logger for DSN code scanning */ +/** Scoped logger for DSN code scanning. */ const log = logger.withTag("dsn-scan"); /** @@ -202,21 +218,79 @@ export function extractFirstDsnFromContent(content: string): string | null { * mtimes for cache invalidation. */ export function scanCodeForDsns(cwd: string): Promise { - return scanDirectory(cwd, false); + return scanDirectory(cwd); } /** * Scan a directory and return the first DSN found. * * Optimized for the common case of single-project repositories. - * Stops scanning as soon as a valid DSN is found (propagates via - * `mapFilesConcurrent`'s shared early-exit flag). + * This path deliberately avoids the worker pool — spawning workers + * for a stopOnFirst scan adds ~20ms of startup cost that dwarfs the + * actual work (most scans find the DSN in the first few files). + * + * Walks files one at a time on the main thread, reading each with + * `Bun.file().text()` and passing through `extractFirstDsnFromContent` + * which benefits from the `/http/i` literal fast path — ~99% of + * files skip the full regex entirely. + * + * The full-scan variant `scanCodeForDsns` takes the opposite + * tradeoff: it uses workers + parallel file reads, which is a net + * win when we need to inspect every file. */ -export async function scanCodeForFirstDsn( - cwd: string -): Promise { - const { dsns } = await scanDirectory(cwd, true); - return dsns[0] ?? null; +export function scanCodeForFirstDsn(cwd: string): Promise { + return withTracingSpan( + "scanCodeForFirstDsn", + "dsn.detect.code", + // biome-ignore lint/complexity/noExcessiveCognitiveComplexity: walker loop + read + extract + validate + span-status error branch is inherently branchy + async (span) => { + let filesScanned = 0; + try { + for await (const entry of walkFiles({ + cwd, + ...dsnScanOptions(), + })) { + filesScanned += 1; + let content: string; + try { + content = await Bun.file(entry.absolutePath).text(); + } catch { + continue; + } + const raw = extractFirstDsnFromContent(content); + if (raw === null) { + continue; + } + const detected = createDetectedDsn( + raw, + "code", + entry.relativePath, + inferPackagePath(entry.relativePath) + ); + if (detected !== null) { + span.setAttribute("dsn.files_scanned", filesScanned); + span.setAttribute("dsn.dsns_found", 1); + return detected; + } + } + span.setAttribute("dsn.files_scanned", filesScanned); + span.setAttribute("dsn.dsns_found", 0); + return null; + } catch (error) { + if (error instanceof ConfigError) { + throw error; + } + span.setStatus({ code: 2, message: "Directory scan failed" }); + log.debug(`scanCodeForFirstDsn failed: ${String(error)}`); + return null; + } + }, + { + "dsn.scan_dir": cwd, + "dsn.stop_on_first": true, + "dsn.max_depth": DSN_MAX_DEPTH, + } + ); } /** @@ -281,27 +355,23 @@ function isValidDsnHost(dsn: string): boolean { ); } -/** - * Bundle of per-scan mutable state. Collecting these into one record - * keeps the per-file processor's arity under Biome's 4-param ceiling. - */ -type ScanDirectoryState = { - cwd: string; - stopOnFirst: boolean; - seen: Map; - sourceMtimes: Record; - filesScanned: { count: number }; -}; - /** * Main scan implementation. Wraps the pipeline in a traced span so * production dashboards + the `scanCodeForDsns` bench op stay in * sync. Attribute names match the pre-PR-3 scanner byte-for-byte. + * + * Delegates the heavy lifting to `grepFiles`: + * - Walker config (depth, gitignore, skip dirs) from `dsnScanOptions()`. + * - `DSN_PATTERN` as the grep pattern — the engine's literal + * prefilter uses `http` as a file-level gate, identical to the + * pre-refactor `HTTP_SCHEME_PROBE` regex. + * - `recordMtimes: true` + `onDirectoryVisit` populate the two + * cache-invalidation maps in one traversal. + * - Worker pool handles per-file read + regex in parallel; main + * thread post-filters each `GrepMatch` for comments and host + * validation. */ -function scanDirectory( - cwd: string, - stopOnFirst: boolean -): Promise { +function scanDirectory(cwd: string): Promise { return withTracingSpan( "scanCodeForDsns", "dsn.detect.code", @@ -309,27 +379,25 @@ function scanDirectory( const sourceMtimes: Record = {}; const dirMtimes: Record = {}; const seen = new Map(); - // Mutable counters threaded through the state object so the - // per-file callback can update them without capturing. - const filesScanned = { count: 0 }; let filesCollected = 0; - - const state: ScanDirectoryState = { - cwd, - stopOnFirst, - seen, - sourceMtimes, - filesScanned, - }; + // `grepFiles` emits one match per line; the walker yields every + // file that passes the preset. We count the latter via a + // set of unique absolute paths seen across all emitted matches + // PLUS an implicit one-time count from the first match per file. + // For files with zero DSNs, grep's file-level `http` gate + // silently skips them without emitting — they're not counted + // in `filesCollected` here. The pre-refactor counter tracked + // walker-yielded files (including those with zero DSNs), so to + // preserve the telemetry shape we'd need a separate walker tap. + // Accept the semantic drift: `filesCollected` now means "files + // that contained at least one DSN-like URL"; still useful + // signal, just stricter. + const filesSeenForMtime = new Set(); try { - // Walker yields every text file under cwd that passes the DSN - // preset (depth 3 with monorepo reset, full DSN skip list, - // nested .gitignore honored). We tap the iterator to count - // collected files for telemetry, then feed the tapped stream - // through mapFilesConcurrent for bounded parallel reads. - const walkSource = walkFiles({ + const iter = grepFiles({ cwd, + pattern: DSN_PATTERN, ...dsnScanOptions(), recordMtimes: true, onDirectoryVisit: (absDir, mtimeMs) => { @@ -337,32 +405,19 @@ function scanDirectory( dirMtimes[rel] = mtimeMs; }, }); - const tapped = tapWalker(walkSource, () => { - filesCollected += 1; - }); - await mapFilesConcurrent( - tapped, - (entry) => processEntry(entry, state), - { - onResult: (detected) => { - let firstUnique = false; - for (const dsn of detected) { - if (!seen.has(dsn.raw)) { - seen.set(dsn.raw, dsn); - if (stopOnFirst) { - firstUnique = true; - } - } - } - return firstUnique ? { done: true } : undefined; - }, - } - ); + for await (const match of iter) { + filesCollected += 1; + processMatch(match, { + seen, + sourceMtimes, + filesSeenForMtime, + }); + } span.setAttribute("dsn.files_collected", filesCollected); span.setAttributes({ - "dsn.files_scanned": filesScanned.count, + "dsn.files_scanned": filesCollected, "dsn.dsns_found": seen.size, }); @@ -372,7 +427,6 @@ function scanDirectory( dirMtimes, }; } catch (error) { - // ConfigError is a user-facing misconfiguration — surface it. if (error instanceof ConfigError) { throw error; } @@ -389,83 +443,90 @@ function scanDirectory( }, { "dsn.scan_dir": cwd, - "dsn.stop_on_first": stopOnFirst, + "dsn.stop_on_first": false, "dsn.max_depth": DSN_MAX_DEPTH, } ); } /** - * Per-file worker: read the file, extract DSNs via the DSN-specific - * content pipeline, wrap each raw match in a `DetectedDsn`. Returns - * `null` when the file contributes nothing (no raw matches OR every - * match got rejected by host validation OR the file was unreadable) - * so `mapFilesConcurrent` skips the push + onResult callback entirely - * — important on large walks where most files have zero DSNs. + * Per-match processing context. Collected into one record so the + * hot loop's callback stays under Biome's cognitive-complexity + * ceiling and the caller can mutate `seen` / `sourceMtimes` / + * `filesSeenForMtime` by reference. + */ +type MatchProcessingContext = { + seen: Map; + sourceMtimes: Record; + filesSeenForMtime: Set; +}; + +/** + * Process one `GrepMatch` from the DSN scan: * - * Re-throws `ConfigError` so `scanDirectory`'s outer try/catch can - * propagate user-facing misconfig. All other fs errors are logged at - * debug level and treated as "skip this file silently". + * 1. Skip commented lines. + * 2. Extract all DSNs on the line via `extractDsnsOnLine` (rare + * multi-DSN lines preserved from pre-refactor behavior). + * 3. For each DSN: validate via `createDetectedDsn`, record + * `fileHadValidDsn` for mtime tracking, dedup into `seen`. + * 4. If this file contributed at least one validated DSN, record + * its mtime in `sourceMtimes` (first match wins since subsequent + * matches on the same file have the same mtime). */ -async function processEntry( - entry: WalkEntry, - state: ScanDirectoryState -): Promise { - state.filesScanned.count += 1; - try { - const content = await Bun.file(entry.absolutePath).text(); - const raws = extractDsnsFromContent( - content, - state.stopOnFirst ? 1 : undefined - ); - if (raws.length === 0) { - // Return null so `mapFilesConcurrent` skips the push + the - // `onResult` callback entirely. On a typical 10k-file walk - // where ~99% of files contain no DSN, returning `[]` would - // fire ~9900 no-op `onResult` calls and push that many empty - // arrays into the (otherwise unused) results list. `null` is - // the documented "skip this file" signal. - return null; +function processMatch( + match: { + path: string; + absolutePath: string; + line: string; + mtime?: number; + }, + ctx: MatchProcessingContext +): void { + if (isCommentedLine(match.line.trim())) { + return; + } + const dsns = extractDsnsOnLine(match.line); + if (dsns.length === 0) { + return; + } + const packagePath = inferPackagePath(match.path); + let fileHadValidDsn = false; + for (const raw of dsns) { + const detected = createDetectedDsn(raw, "code", match.path, packagePath); + if (detected === null) { + continue; } - const packagePath = inferPackagePath(entry.relativePath); - const detected = raws - .map((raw) => - createDetectedDsn(raw, "code", entry.relativePath, packagePath) - ) - .filter((d): d is DetectedDsn => d !== null); - // Only record mtime when at least one DSN was accepted (matches - // pre-PR-3 behavior — the cache only tracks files it cares about). - if (detected.length === 0) { - // All raw matches got rejected by host validation (e.g., a - // self-hosted repo scanning a file with a saas.sentry.io URL). - // Skip the push + onResult — same signal as "no DSNs found". - return null; + fileHadValidDsn = true; + if (!ctx.seen.has(raw)) { + ctx.seen.set(raw, detected); } - state.sourceMtimes[entry.relativePath] = entry.mtime; - return detected; - } catch (error) { - if (error instanceof ConfigError) { - throw error; + } + if (fileHadValidDsn && !ctx.filesSeenForMtime.has(match.absolutePath)) { + ctx.filesSeenForMtime.add(match.absolutePath); + if (match.mtime !== undefined) { + ctx.sourceMtimes[match.path] = match.mtime; } - // ENOENT / EACCES / malformed content — the pre-PR-3 scanner - // matched these with a single `log.debug(...)` and returned - // empty. Preserve that behavior exactly (null = skip silently). - log.debug(`Cannot read file: ${entry.relativePath}`); - return null; } } /** - * Pass-through async generator that invokes `onEach` once per entry - * before yielding. Lets `scanDirectory` count collected files without - * forking the walker's output iterator. + * Extract DSN URLs from a single line's text. Called by + * `scanDirectory` on each matching line yielded by `grepFiles`. + * Runs the full `DSN_PATTERN` match + host validation, mirroring + * `extractDsnsFromContent` but without the whole-file literal gate + * (grep already handles that). + * + * Most lines have zero or one DSN; this loop handles the rare case + * of two URLs on one line (a preserved behavior of the pre-refactor + * scanner's `content.matchAll`). */ -async function* tapWalker( - source: AsyncIterable, - onEach: () => void -): AsyncGenerator { - for await (const entry of source) { - onEach(); - yield entry; +function extractDsnsOnLine(line: string): string[] { + const dsns: string[] = []; + for (const m of line.matchAll(DSN_PATTERN)) { + const raw = m[0]; + if (!dsns.includes(raw) && isValidDsnHost(raw)) { + dsns.push(raw); + } } + return dsns; } diff --git a/src/lib/scan/grep.ts b/src/lib/scan/grep.ts index 745b51103..d56e58458 100644 --- a/src/lib/scan/grep.ts +++ b/src/lib/scan/grep.ts @@ -175,6 +175,13 @@ type PerFileOptions = { * silent misses under per-line verify. */ literal: string | null; + /** + * When true, each emitted `GrepMatch` carries `mtime` (the + * source file's floored `mtimeMs`). The walker produces it; the + * grep path just forwards it. Incurs one extra stat per file at + * the walker layer. + */ + recordMtimes: boolean; }; /** @@ -352,7 +359,7 @@ function buildMatch(ctx: MatchContext, bounds: LineBounds): GrepMatch { rawLine.length > ctx.opts.maxLineLength ? `${rawLine.slice(0, ctx.opts.maxLineLength - 1)}…` : rawLine; - return { + const match: GrepMatch = { path: ctx.opts.pathPrefix ? joinPosix(ctx.opts.pathPrefix, ctx.entry.relativePath) : ctx.entry.relativePath, @@ -360,6 +367,10 @@ function buildMatch(ctx: MatchContext, bounds: LineBounds): GrepMatch { lineNum: bounds.lineNum, line, }; + if (ctx.opts.recordMtimes) { + match.mtime = ctx.entry.mtime; + } + return match; } /** @@ -379,6 +390,8 @@ function buildWalkOptions(opts: GrepOptions, root: string): WalkOptions { maxFileSize: opts.maxFileSize, descentHook: opts.descentHook, followSymlinks: opts.followSymlinks, + recordMtimes: opts.recordMtimes, + onDirectoryVisit: opts.onDirectoryVisit, signal: opts.signal, timeBudgetMs: opts.timeBudgetMs, }; @@ -542,7 +555,11 @@ async function* grepViaWorkers( let walkerDone = false; let inflightCount = 0; - const dispatchBatch = (paths: string[], rels: string[]): void => { + const dispatchBatch = ( + paths: string[], + rels: string[], + mtimes: readonly number[] | null + ): void => { opts.stats.filesRead += paths.length; inflightCount += 1; pool @@ -558,7 +575,10 @@ async function* grepViaWorkers( }) .then( (result) => { - pending.push(decodeWorkerMatches(result, paths, rels)); + // `decodeWorkerMatches` attaches per-file mtime by pathIdx + // when `mtimes` is non-null. Mtime comes from the walker + // (main thread) — worker doesn't need to stat again. + pending.push(decodeWorkerMatches(result, paths, rels, mtimes)); }, () => { // Worker error — skip this batch. Per-file errors are @@ -575,10 +595,15 @@ async function* grepViaWorkers( // Errors thrown by the walker (e.g. `AbortError` when the caller // signals abort) propagate to the consumer via `producerError`. let producerError: unknown = null; + const recordMtimes = opts.perFile.recordMtimes; // biome-ignore lint/complexity/noExcessiveCognitiveComplexity: walker drain + path-prefix handling + batch flush + error capture is inherently branchy const producer = (async () => { let batch: string[] = []; let batchRel: string[] = []; + // Per-batch mtime array (parallel to `batch`). Only allocated + // when the caller opts in via `recordMtimes`; otherwise stays + // null and the decoder skips the `mtime` field entirely. + let batchMtimes: number[] | null = recordMtimes ? [] : null; try { for await (const entry of opts.walkSource) { if (earlyExit) { @@ -590,15 +615,19 @@ async function* grepViaWorkers( ? joinPosix(opts.perFile.pathPrefix, entry.relativePath) : entry.relativePath ); + if (batchMtimes !== null) { + batchMtimes.push(entry.mtime); + } if (batch.length >= WORKER_BATCH_SIZE) { - dispatchBatch(batch, batchRel); + dispatchBatch(batch, batchRel, batchMtimes); batch = []; batchRel = []; + batchMtimes = recordMtimes ? [] : null; } } // Final partial batch. if (batch.length > 0) { - dispatchBatch(batch, batchRel); + dispatchBatch(batch, batchRel, batchMtimes); } } catch (error) { producerError = error; @@ -720,6 +749,8 @@ type GrepPipelineSetup = { * are skipped entirely. */ literal: string | null; + /** Mirrors `GrepOptions.recordMtimes`. Threaded into PerFileOptions. */ + recordMtimes: boolean; stats: GrepStats; walkSource: AsyncIterable; maxResults: number; @@ -783,6 +814,7 @@ function setupGrepPipeline(opts: GrepOptions): GrepPipelineSetup { regex, multiline, literal, + recordMtimes: opts.recordMtimes ?? false, stats, walkSource, maxResults: opts.maxResults ?? Number.POSITIVE_INFINITY, @@ -820,6 +852,7 @@ export async function* grepFiles(opts: GrepOptions): AsyncGenerator { maxMatchesPerFile: setup.maxMatchesPerFile, pathPrefix: setup.pathPrefix, literal: setup.literal, + recordMtimes: setup.recordMtimes, }, walkSource: setup.walkSource, stats: setup.stats, @@ -865,6 +898,7 @@ export async function collectGrep(opts: GrepOptions): Promise { maxMatchesPerFile: setup.maxMatchesPerFile, pathPrefix: setup.pathPrefix, literal: setup.literal, + recordMtimes: setup.recordMtimes, }, walkSource: setup.walkSource, stats: setup.stats, diff --git a/src/lib/scan/types.ts b/src/lib/scan/types.ts index 1991c3ecc..8b4959aab 100644 --- a/src/lib/scan/types.ts +++ b/src/lib/scan/types.ts @@ -214,6 +214,14 @@ export type GrepMatch = { * consumers that care should trim their own way. */ line: string; + /** + * Floored `stat.mtimeMs` of the source file. Populated only when + * `GrepOptions.recordMtimes` is true; otherwise omitted (not + * zero — the absence distinguishes "not asked" from "asked and + * happens to be 0"). Consumers use this for cache invalidation; + * see `src/lib/dsn/code-scanner.ts`. + */ + mtime?: number; }; /** @@ -260,6 +268,20 @@ export type GrepOptions = { maxFileSize?: number; descentHook?: WalkOptions["descentHook"]; followSymlinks?: boolean; + /** + * When true, each `GrepMatch` includes the source file's floored + * `mtimeMs`. Used by cache-invalidation layers (e.g., the DSN + * scanner's `sourceMtimes` map). Forwarded to the walker; + * incidentally costs an extra stat per file, so it's opt-in. + */ + recordMtimes?: boolean; + /** + * Optional hook invoked once per directory visited during the + * walk with its floored `mtimeMs`. Used for directory-level cache + * invalidation (e.g., "has a new file been added to this dir?"). + * Forwarded to `walkFiles`. + */ + onDirectoryVisit?: WalkOptions["onDirectoryVisit"]; // --- Grep-specific --- /** diff --git a/src/lib/scan/worker-pool.ts b/src/lib/scan/worker-pool.ts index b0336df72..3fbf558ac 100644 --- a/src/lib/scan/worker-pool.ts +++ b/src/lib/scan/worker-pool.ts @@ -74,14 +74,22 @@ type PooledWorker = { /** * Queue of pending result resolvers. Populated by `dispatch`, * drained in FIFO order by `worker.onmessage` as `result` messages - * arrive. Workers process messages in postMessage order (both Bun - * and Node `worker_threads` guarantee this), so FIFO-order - * resolution is correct. + * arrive. Workers process messages sequentially so FIFO matching + * is sound. */ pending: Array<{ resolve: (r: WorkerGrepResult) => void; reject: (e: unknown) => void; }>; + /** + * False once the worker has emitted an `error` event. Dispatchers + * skip dead workers — without this, the least-loaded selection + * would favor a dead worker (its `inflight` gets reset to 0 in + * the error handler), new dispatches would post to a worker that + * can't respond, and the calling `grepViaWorkers` would deadlock + * waiting for results that never arrive. + */ + alive: boolean; }; type WorkerPool = { @@ -191,6 +199,7 @@ export function getWorkerPool(): WorkerPool { }), inflight: 0, pending: [], + alive: true, }; // Single onmessage handler per worker. Matches `result` messages // to the oldest pending dispatch via FIFO shift. Messages from @@ -210,7 +219,10 @@ export function getWorkerPool(): WorkerPool { next.resolve({ ints: data.ints, linePool: data.linePool }); }); w.addEventListener("error", (err) => { - // Fail all pending dispatches on worker error. + // Mark the worker dead before rejecting pending dispatches — + // `dispatch` consults `alive` under the least-loaded picker to + // avoid routing new work to a worker that can't respond. + pw.alive = false; const errMsg = err.message ?? String(err); while (pw.pending.length > 0) { const p = pw.pending.shift(); @@ -224,33 +236,45 @@ export function getWorkerPool(): WorkerPool { pool = { workers, dispatch(request: WorkerGrepRequest): Promise { - // Pick least-loaded worker. On tie, lowest index wins (stable). - let best = workers[0] as PooledWorker; + // Pick least-loaded LIVE worker. Dead workers (those that + // emitted an `error` event) are skipped entirely — dispatching + // to them would hang because they can't respond. Their + // `inflight` gets reset to 0 in the error handler, which + // would otherwise make them the "least loaded" and silently + // capture all subsequent dispatches. + let best: PooledWorker | null = null; for (const pw of workers) { - if (pw.inflight < best.inflight) { + if (!pw.alive) { + continue; + } + if (best === null || pw.inflight < best.inflight) { best = pw; } } - best.inflight += 1; + if (best === null) { + return Promise.reject(new Error("worker pool: all workers dead")); + } + const chosen = best; + chosen.inflight += 1; // Enqueue a pending slot for this request. The worker's // `onmessage` handler will resolve it when the corresponding // `result` message arrives (FIFO). const result = new Promise((resolve, reject) => { - best.pending.push({ resolve, reject }); + chosen.pending.push({ resolve, reject }); }); // Wait for readiness (first dispatch only), then post the // request. Subsequent dispatches skip the await (the ready // promise is already settled). - best.ready.then( + chosen.ready.then( () => { - best.worker.postMessage(request); + chosen.worker.postMessage(request); }, (err) => { // Readiness failed — fail this dispatch's resolver. - const slot = best.pending.pop(); + const slot = chosen.pending.pop(); if (slot) { - best.inflight -= 1; + chosen.inflight -= 1; slot.reject(err); } } @@ -290,19 +314,20 @@ export function terminatePool(): void { /** * Decode a worker's packed `{ints, linePool}` into an array of - * `GrepMatch`es, using the caller's `paths` and `pathsBase` to + * `GrepMatch`es, using the caller's `paths` and `relPaths` to * reconstruct path fields. * - * `pathsBase` is the `relativePath` prefix the walker would emit — - * typically `cfg.cwd.length + 1` characters trimmed from each - * absolute path. For grep we pass the absolute path as both - * `path` and `absolutePath` and let the caller reinterpret; see - * `grep.ts` integration. + * Optional `mtimes` is a parallel per-path array: when provided, + * each emitted `GrepMatch` gets an `mtime` field indexed by the + * match's `pathIdx`. Populated by callers that set + * `recordMtimes: true` — the mtime is known on the main thread + * from the walker, not from the worker. */ export function decodeWorkerMatches( result: WorkerGrepResult, - paths: string[], - relPaths: string[] + paths: readonly string[], + relPaths: readonly string[], + mtimes: readonly number[] | null = null ): GrepMatch[] { const { ints, linePool } = result; const matches: GrepMatch[] = []; @@ -317,12 +342,16 @@ export function decodeWorkerMatches( const absolutePath = paths[pathIdx] ?? ""; const relativePath = relPaths[pathIdx] ?? absolutePath; const line = linePool.slice(lineOffset, lineOffset + lineLength); - matches.push({ + const match: GrepMatch = { path: relativePath, absolutePath, lineNum, line, - }); + }; + if (mtimes !== null) { + match.mtime = mtimes[pathIdx]; + } + matches.push(match); } return matches; } From 47a5ee8dbf17d798b7302d40918ff2cf04b828aa Mon Sep 17 00:00:00 2001 From: Burak Yigit Kaya Date: Tue, 21 Apr 2026 23:24:51 +0000 Subject: [PATCH 3/9] refactor(scan): worker body as a real .js file (not an inline string) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Replaces `src/lib/scan/grep-worker-source.ts` (a TS module exporting the worker body as a string template literal) with a real `src/lib/scan/grep-worker.js` file that's imported via `with { type: "text" }`. The runtime behavior is unchanged — Bun and the esbuild plugin both resolve the attribute to the file contents as a string — but now the worker is a first-class file: lintable, syntax-checked, formattable, searchable. ## Why `grep-worker-source.ts` was functional but awkward. The string inside the template literal wasn't visible to Biome's lint rules, IDE navigation, or any other tooling that cares about real files. We caught several lint issues once the body became a real `.js` file (`noIncrementDecrement`, `useBlockStatements`, `useTemplate`) — real issues that had been silently skipped. The blob-URL + raw-text-import approach is the only form that works in all three of our target environments: - **`bun run` (dev):** Bun handles `with { type: "text" }` natively. - **`bun build --compile` (single-file CLI):** the esbuild plugin inlines the file contents as a string at bundle time; the compiled binary has no runtime file resolution. - **`bun bundle` (npm library `dist/index.cjs`):** same esbuild plugin, same inlined string. When Node consumers import the library, Workers are feature-gated on `isWorkerSupported()` and fall back to the async path. I explored Bun's documented alternative (`new Worker(new URL(...))` or `new Worker("./path.ts")` with the worker as an additional entrypoint to `bun build --compile`) but all three forms the docs suggest either (a) fail at runtime in compiled binaries because `import.meta.url` resolves to the binary path and URL resolution doesn't hit the bundled entrypoint, or (b) work but require the binary to run with CWD equal to the `bun build` project root, which is brittle. The Blob-URL + text-import approach has no such fragility. ## Changes - **New:** `src/lib/scan/grep-worker.js` — real plain-JS file containing the worker body (moved verbatim from the old string template; no logic changes). Pass-through of all existing features: literal gate, line-counter, multiline mode, maxLineLength truncation, maxMatchesPerFile, transferable-buffer match packing. - **New:** `script/text-import-plugin.ts` — tiny esbuild plugin (~20 LOC) that polyfills Bun's `with { type: "text" }` attribute by reading the file and emitting it as a default-export string. Shared by `script/build.ts` (CLI binary) and `script/bundle.ts` (npm library). - **Modified:** `src/lib/scan/worker-pool.ts` imports the worker body as `import grepWorkerSource from "./grep-worker.js" with { type: "text" }`. TypeScript's type system doesn't model the attribute, so we cast through `unknown` to a string. The runtime Blob constructor would throw if given a non-string, so the cast is guarded. - **Modified:** `script/build.ts`, `script/bundle.ts` — add the plugin to their esbuild plugin arrays. - **Deleted:** `src/lib/scan/grep-worker-source.ts`. ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean (1 pre-existing markdown warning). Biome now lints `grep-worker.js`; several issues surfaced during the refactor (`noIncrementDecrement`, `useBlockStatements`, `useTemplate`) were all fixed in the same commit. - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5640 pass, 0 fail** - [x] `bun test test/isolated` — 138 pass - [x] `SENTRY_CLIENT_ID=placeholder bun run bundle` — library bundle builds successfully, worker source is embedded in `dist/index.cjs` as expected. - [x] `bun run bench --size large --runs 3 --warmup 1` — perf unchanged (`scan.grepFiles` 161ms, `scanCodeForDsns` 241ms). --- AGENTS.md | 14 +-- script/build.ts | 2 + script/bundle.ts | 7 +- script/text-import-plugin.ts | 51 ++++++++++ src/lib/scan/grep-worker-source.ts | 118 ---------------------- src/lib/scan/grep-worker.js | 153 +++++++++++++++++++++++++++++ src/lib/scan/worker-pool.ts | 19 +++- 7 files changed, 233 insertions(+), 131 deletions(-) create mode 100644 script/text-import-plugin.ts delete mode 100644 src/lib/scan/grep-worker-source.ts create mode 100644 src/lib/scan/grep-worker.js diff --git a/AGENTS.md b/AGENTS.md index a4940e21c..9484ffd53 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -998,10 +998,10 @@ mock.module("./some-module", () => ({ * **Consola chosen as CLI logger with Sentry createConsolaReporter integration**: Consola is the CLI logger with Sentry \`createConsolaReporter\` integration. Two reporters: FancyReporter (stderr) + Sentry structured logs. Level via \`SENTRY\_LOG\_LEVEL\`. \`buildCommand\` injects hidden \`--log-level\`/\`--verbose\` flags. \`withTag()\` creates independent instances; \`setLogLevel()\` propagates via registry. All user-facing output must use consola, not raw stderr. \`HandlerContext\` intentionally omits stderr. -* **DSN cache invalidation uses two-level mtime tracking (sourceMtimes + dirMtimes)**: \*\*DSN cache invalidation uses two-level mtime tracking\*\*: \`sourceMtimes\` (DSN-bearing files, catches in-place edits) + \`dirMtimes\` (every walked dir, catches new files) + root mtime fast-path + 24h TTL. Dropping either map is a correctness regression. Walker emits mtimes via \`onDirectoryVisit\` hook + \`recordMtimes\` option. \*\*Error-path invariant\*\*: \`scanDirectory\` catch block MUST return empty \`dirMtimes: {}\` — NOT partial map from \`onDirectoryVisit\` callback, which would silently bless unvisited dirs. Empty forces full rescan. \`ConfigError\` re-throws instead. \`scan.walk.dsnParity\` bench op imports \`dsnScanOptions()\` from \`dsn/scan-options.ts\` to keep \`scan/\` policy-free. +* **DSN cache invalidation uses two-level mtime tracking (sourceMtimes + dirMtimes)**: \*\*DSN cache invalidation uses two-level mtime tracking\*\*: \`sourceMtimes\` (DSN-bearing files, catches in-place edits) + \`dirMtimes\` (every walked dir, catches new files) + root mtime fast-path + 24h TTL. Dropping either map is a correctness regression. Walker emits mtimes via \`onDirectoryVisit\` hook + \`recordMtimes\` option; DSN scanner uses \`grepFiles({pattern: DSN\_PATTERN, recordMtimes: true, onDirectoryVisit})\` as full-scan pipeline (~20% faster than old walkFiles path). \`scanCodeForFirstDsn\` stays on direct walker loop — worker-pool init cost (~20ms) dominates for single-DSN. \*\*sourceMtimes invariant\*\*: \`processMatch\` must record mtime for EVERY file containing a host-validated DSN, not just deduplicated new ones — track via \`fileHadValidDsn\` flag independent of \`seen.has(raw)\`. \*\*Error-path invariant\*\*: \`scanDirectory\` catch MUST return empty \`dirMtimes: {}\`, NOT partial map from callback (would silently bless unvisited dirs). \`ConfigError\` re-throws instead. -* **Grep worker pool: binary-transferable matches + streaming dispatch in src/lib/scan/**: \*\*Grep worker pool\*\* (\`src/lib/scan/worker-pool.ts\` + \`grep-worker-source.ts\`): lazy singleton pool (size \`min(8, max(2, availableParallelism()))\`) for \`collectGrep\`/\`grepFiles\`. Matches encoded as \`Uint32Array\` quads \`\[pathIdx, lineNum, lineOffset, lineLength]\` + single \`linePool\` string, transferred via \`postMessage(msg, \[ints.buffer])\` — drops structuredClone 198ms→2-3ms. Worker source inlined as string, loaded via \`Blob\` + \`URL.createObjectURL\` (URL-based workers hang in \`bun build --compile\`). Main-thread producer batches walker paths (size 200), dispatches least-loaded. Disable via \`SENTRY\_SCAN\_DISABLE\_WORKERS=1\`; falls back to \`grepViaAsyncMain\` on Node. \*\*Perf\*\*: cut grep pipeline 322ms→191ms (−41%) on DSN-options fixture; walker is now floor (~140ms of 191ms). Full-walk case: walker ~233ms, grep ~272ms for 8009 files (~34µs/file). Slower than rg on pure-literal (SIMD/DFA/zero-copy) but 14× faster on regex+literal with cap=100 (early-exit wins). +* **Grep worker pool: binary-transferable matches + streaming dispatch in src/lib/scan/**: \*\*Grep worker pool\*\* (\`src/lib/scan/worker-pool.ts\` + \`grep-worker-source.ts\`): lazy singleton pool (size \`min(8, max(2, availableParallelism()))\`) for \`collectGrep\`/\`grepFiles\`. Matches encoded as \`Uint32Array\` quads \`\[pathIdx, lineNum, lineOffset, lineLength]\` + single \`linePool\` string, transferred via \`postMessage(msg, \[ints.buffer])\` — cut grep pipeline ~40% by dropping structuredClone. Worker source inlined as string, loaded via \`Blob\` + \`URL.createObjectURL(blob)\`: \`new Worker(new URL('./worker.ts', import.meta.url))\` HANGS in \`bun build --compile\` binaries (\`import.meta.url\` resolves to \`/$bunfs/root/binary\`); string paths resolve relative to CWD, unusable. Worker source must be self-contained (no user imports); \`require('node:fs')\` works inside. \*\*FIFO handler queue per worker\*\* (NOT addEventListener-per-dispatch) — fresh listeners fire on every message, resolving wrong requests and causing hangs. \*\*Dead-worker guard\*\*: \`PooledWorker.alive\` flag flipped on \`error\`; dispatch filters live workers only. Disable via \`SENTRY\_SCAN\_DISABLE\_WORKERS=1\`. * **Input validation layer: src/lib/input-validation.ts guards CLI arg parsing**: CLI arg input validation (\`src/lib/input-validation.ts\`): Four validators — \`rejectControlChars\` (ASCII < 0x20), \`rejectPreEncoded\` (%XX), \`validateResourceId\` (rejects ?, #, %, whitespace), \`validateEndpoint\` (rejects \`..\` traversal). Applied in \`parseSlashOrgProject\`, \`parseOrgProjectArg\`, \`parseIssueArg\`, \`normalizeEndpoint\`. NOT applied in \`parseSlashSeparatedArg\` for plain IDs (may contain structural separators). Env vars and DB cache values are trusted. @@ -1013,7 +1013,7 @@ mock.module("./some-module", () => ({ * **Sentry SDK uses @sentry/node-core/light instead of @sentry/bun to avoid OTel overhead**: Sentry SDK uses \`@sentry/node-core/light\` instead of \`@sentry/bun\` to avoid OpenTelemetry overhead (~150ms, 24MB). \`@sentry/core\` barrel patched via \`bun patch\` to remove ~32 unused exports. Gotcha: \`LightNodeClient\` hardcodes \`runtime: { name: 'node' }\` AFTER spreading options — fix by patching \`client.getOptions().runtime\` post-init (mutable ref). Transport uses Node \`http\` instead of native \`fetch\`. Upstream: getsentry/sentry-javascript#19885, #19886. -* **src/lib/scan/ module: policy-free file walker with IgnoreStack**: \`src/lib/scan/\` — policy-free walker + grep/glob engine. Files: \`walker.ts\` (DFS async gen; \`minDepth\`/\`maxDepth\` cap DESCENT not yield; \`timeBudgetMs\`, \`AbortSignal\`, \`onDirectoryVisit\`, \`recordMtimes\`), \`ignore.ts\` (IgnoreStack two-tier \`#rootIg\`+\`#nestedByRelDir\`), \`binary.ts\` (8KB NUL sniff + ext fast-path), \`regex.ts\` (\`(?i)\`/\`(?im)\`/\`(?U)\` → JS flags), \`grep.ts\`/\`glob.ts\` (picomatch \`{dot:true}\`). DSN policy lives in \`src/lib/dsn/scan-options.ts\`. Gotchas: (1) \`classifyFile\` short-circuits when \`cfg.extensions\` set. (2) \`buildWalkOptions\` MUST forward \`descentHook\`+\`alwaysSkipDirs\`+\`followSymlinks\`+\`recordMtimes\`+\`clock\`. (3) NEVER \`pLimit.clearQueue()\` (hangs); use \`onResult\` returning \`{done:true}\`. (4) \`path.join\`/\`relative\`/\`extname\` ~10× slower than manual string ops in hot loop — use \`abs.slice(cwdPrefixLen)\` and \`lastIndexOf('.')+slice+toLowerCase\`. +* **src/lib/scan/ module: policy-free file walker with IgnoreStack**: \`src/lib/scan/\` — policy-free walker + grep/glob engine. Files: \`walker.ts\` (DFS async gen; \`minDepth\`/\`maxDepth\` cap DESCENT not yield; \`timeBudgetMs\`, \`AbortSignal\`, \`onDirectoryVisit\`, \`recordMtimes\`), \`ignore.ts\` (IgnoreStack two-tier \`#rootIg\`+\`#nestedByRelDir\`), \`binary.ts\` (8KB NUL sniff + ext fast-path), \`regex.ts\` (\`(?i)\`/\`(?im)\`/\`(?U)\` → JS flags), \`grep.ts\`/\`glob.ts\` (picomatch \`{dot:true}\`). DSN policy lives in \`src/lib/dsn/scan-options.ts\`. \`GrepOptions\` supports \`recordMtimes\` + \`onDirectoryVisit\` passthroughs — each \`GrepMatch\` gets optional \`mtime\` from walker's \`entry.mtime\`. Gotchas: (1) \`classifyFile\` short-circuits when \`cfg.extensions\` set. (2) \`buildWalkOptions\` MUST forward \`descentHook\`+\`alwaysSkipDirs\`+\`followSymlinks\`+\`recordMtimes\`+\`clock\`. (3) NEVER \`pLimit.clearQueue()\` (hangs); use \`onResult\` returning \`{done:true}\`. (4) \`path.join\`/\`relative\`/\`extname\` ~10× slower than manual string ops in hot loop — use \`abs.slice(cwdPrefixLen)\` and \`lastIndexOf('.')+slice+toLowerCase\`. (5) Scan module trusts \`opts.path\`; sandbox enforcement is caller's responsibility (e.g. init-wizard adapters must call \`safePath\` from \`init/tools/shared.ts\`). * **Telemetry opt-out is env-var-only — no persistent preference or DO\_NOT\_TRACK**: Telemetry opt-out priority: (1) \`SENTRY\_CLI\_NO\_TELEMETRY=1\`, (2) \`DO\_NOT\_TRACK=1\`, (3) \`metadata.defaults.telemetry\`, (4) default on. DB read try/catch wrapped (runs before DB init). Schema v13 merged \`defaults\` table into \`metadata\` KV with keys \`defaults.{org,project,telemetry,url}\`; getters/setters in \`src/lib/db/defaults.ts\`. \`sentry cli defaults\` uses variadic \`\[key, value?]\`: no args → show all; 1 arg → show key; 2 args → set; \`--clear\` without args → clear all (guarded); \`--clear key\` → clear specific. \`computeTelemetryEffective()\` returns resolved source for display. @@ -1034,9 +1034,6 @@ mock.module("./some-module", () => ({ * **AuthError constructor takes reason first, message second**: \`AuthError(reason: AuthErrorReason, message?: string)\` where \`AuthErrorReason\` is \`"not\_authenticated" | "expired" | "invalid"\`. Easy to accidentally swap args as \`new AuthError("Token expired", "expired")\` — the string \`"Token expired"\` gets assigned as \`reason\` (invalid enum value). Tests aren't type-checked (tsconfig excludes them), so TypeScript won't catch this. Correct: \`new AuthError("expired", "Token expired")\`. Default messages exist for each reason, so the second arg is often unnecessary. - -* **Bun worker \`new Worker(url)\` fails in compiled binaries — use Blob URL**: Bun \`new Worker(new URL('./worker.ts', import.meta.url))\` works in \`bun run\` but \*\*hangs in \`bun build --compile\` single-file binaries\*\* — URL can't be resolved at runtime. Fix: inline worker source as a string, wrap in \`new Blob(\[src], { type: 'application/javascript' })\`, pass \`URL.createObjectURL(blob)\` to \`new Worker()\`. Works across Bun compiled binary and esbuild Node CJS bundle. Worker source must be self-contained (no imports from user modules); CommonJS-style \`require('node:fs')\` works inside. - * **mapFilesConcurrent skips null but not empty arrays — callers must return null for no-op**: Scan/formatter off-by-ones: (1) \`mapFilesConcurrent\` in \`src/lib/scan/concurrent.ts\` filters \`null\` but NOT empty arrays — fires \`onResult\` per empty result. Callers like \`processEntry\` in \`dsn/code-scanner.ts\` must return \`null\` (not \`\[]\`) for no-op files; on 10k-file walks ~99% yield empties. Stream variant filters both. (2) \`collectGlob\`/\`collectGrep\` must NOT forward \`maxResults\` to the underlying iterator — collector drains uncapped, sets \`truncated=true\` on overshoot. Forwarding makes iterator stop at exactly N without \`truncated\`. (3) \`filterFields\` in \`formatters/json.ts\` uses dot-notation — property tests must use \`\[a-zA-Z0-9\_]\` charset to avoid ambiguous keys with dots. @@ -1053,7 +1050,7 @@ mock.module("./some-module", () => ({ * **test:unit glob only picks up test/lib, test/commands, test/types, test/isolated**: \`bun test\` script globs in \`package.json\` are narrow: \`test:unit\` = \`test/lib test/commands test/types\`, \`test:isolated\` = \`test/isolated\`, \`test:e2e\` = \`test/e2e\`. Tests placed under \`test/fixtures/\`, \`test/scripts/\`, or \`test/script/\` are NOT picked up by any standard CI script despite being valid Bun test files. Place new tests under \`test/lib/\/\` to inherit coverage. Note: \`test/script/\` (singular) exists with script tests but is also outside the globs. -* **Whole-buffer matchAll slower than split+test when aggregated over many files**: grep perf traps in \`src/lib/scan/grep.ts::readAndGrep\`: (1) Whole-buffer \`regex.exec\` is 12× faster per-file but ~1.6× SLOWER than \`split('\n')+regex.test\` aggregated over 10k files/215k matches — match-emission dominates. Early-exit at \`maxResults\` via \`mapFilesConcurrent.onResult\` is what wins. (2) Literal prefilter is FILE-LEVEL gate only (\`indexOf\` → skip), then whole-buffer \`regex.exec\`. Per-line-verify broke cross-newline patterns and Unicode length-changing \`toLowerCase\` (Turkish \`İ\`→\`i̇\`). Extractor handles multi-char escapes via \`escapeSequenceLength\`; \`hasTopLevelAlternation\`+\`skipGroup\` must call \`skipCharacterClass\` (PCRE \`\[]abc]\` ≠ JS). Extract from \`regex.source\`. (3) \`GrepOptions.multiline\` defaults to \`true\` (rg/git-grep); \`compilePattern\`'s default stays \`false\`. +* **Whole-buffer matchAll slower than split+test when aggregated over many files**: grep perf traps in \`src/lib/scan/grep.ts::readAndGrep\`: (1) Whole-buffer \`regex.exec\` is 12× faster per-file but ~1.6× SLOWER than \`split('\n')+regex.test\` aggregated over 10k files — match-emission dominates. Early-exit at \`maxResults\` via \`mapFilesConcurrent.onResult\` is what wins. (2) Literal prefilter is FILE-LEVEL gate only (\`indexOf\` → skip), then whole-buffer \`regex.exec\`. Per-line-verify broke cross-newline patterns and Unicode length-changing \`toLowerCase\` (Turkish \`İ\`→\`i̇\`). Extractor handles multi-char escapes via \`escapeSequenceLength\`; \`hasTopLevelAlternation\`+\`skipGroup\` must call \`skipCharacterClass\` (PCRE \`\[]abc]\` ≠ JS). (3) \`GrepOptions.multiline\` defaults to \`true\` (rg/git-grep); \`compilePattern\`'s default stays \`false\`. (4) Wake-latch race: \`let notify=null; await new Promise(r=>notify=r)\` loses signals if producer calls \`wakeConsumer()\` before assignment — use a latched \`pendingWake\` flag. ### Pattern @@ -1066,9 +1063,6 @@ mock.module("./some-module", () => ({ * **Extract shared pipeline setup between streaming and collecting variants of same operation**: When a module exposes both a streaming (\`AsyncGenerator\`) and collecting (\`Promise\\`) variant of the same pipeline (e.g., \`grepFiles\` + \`collectGrep\` in \`src/lib/scan/grep.ts\`), extract a shared setup helper (e.g., \`setupGrepPipeline\`) that handles pattern compilation, matcher compilation, option defaulting, walker construction, and filter wiring. Otherwise divergence between the two setup paths becomes a silent correctness bug. The collecting variant typically adds a \`+1 probe\` pattern over the streaming variant's \`maxResults\` to distinguish 'exactly N' from 'N+ available'. - -* **Init-wizard grep/glob adapters must sandbox search.path via safePath before passing to scan**: Init-wizard grep/glob tools in \`src/lib/init/tools/{grep,glob}.ts\` are thin adapters over \`collectGrep\`/\`collectGlob\` from \`src/lib/scan/\`. The scan module trusts \`opts.path\` — per \`src/lib/scan/types.ts\`, sandbox enforcement is the adapter's responsibility. Adapters MUST call \`safePath(payload.cwd, search.path)\` from \`init/tools/shared.ts\` before forwarding; skipping this regresses the \`rejects grep paths outside the init sandbox\` test. \`safePath\` returns an absolute path + walks ancestors with \`realpathSync\` to block symlink escapes. Two-layer defense: \`validateToolSandbox(payload, context.directory)\` in the registry pre-checks \`payload.cwd\` is inside project root; \`safePath\` guards per-tool sub-paths. Wire contract (GrepPayload/GlobPayload in \`init/types.ts\`) is minimal: pattern/path?/include? for grep, patterns\[]/path?/maxResults? for glob — no caseInsensitive/fixedStrings/etc on the wire today. - * **PR review workflow: Cursor Bugbot + Seer + human cycle**: PR review workflow (Cursor Bugbot + Seer + human): \`gh pr checks \ --json state,link\` + \`gh run view --log-failed\` for CI. Unresolved threads via \`gh api graphql\` with \`reviewThreads\` query filtering \`isResolved:false\`+\`isMinimized:false\`. After fixes: reply via \`gh api repos/.../pulls/comments/\/replies\` then resolve via \`resolveReviewThread\` mutation. Bots auto-resolve on detected fix. Statuses: \`UNSTABLE\`=non-blocking bots running, \`BLOCKED\`=required CI pending, \`CLEAN\`+\`MERGEABLE\`=ready. Repo is squash-merge only. Expect 4-6 rounds on subtle regex/Unicode PRs. Bugbot findings usually real but occasionally assume PCRE/Python semantics (e.g. \`\[]abc]\` is class-with-\`]\` in PCRE but empty class in JS) — verify with reproduction. Dispatch self-review subagent between rounds. diff --git a/script/build.ts b/script/build.ts index fe6737ecf..03e9b3a98 100644 --- a/script/build.ts +++ b/script/build.ts @@ -41,6 +41,7 @@ import { build as esbuild } from "esbuild"; import pkg from "../package.json"; import { uploadSourcemaps } from "../src/lib/api/sourcemaps.js"; import { injectDebugId, PLACEHOLDER_DEBUG_ID } from "./debug-id.js"; +import { textImportPlugin } from "./text-import-plugin.js"; const gzipAsync = promisify(gzip); @@ -134,6 +135,7 @@ async function bundleJs(): Promise { "process.env.NODE_ENV": JSON.stringify("production"), __SENTRY_DEBUG_ID__: JSON.stringify(PLACEHOLDER_DEBUG_ID), }, + plugins: [textImportPlugin], }); const output = result.metafile?.outputs[BUNDLE_JS]; diff --git a/script/bundle.ts b/script/bundle.ts index 20c863e15..a7795e2c1 100644 --- a/script/bundle.ts +++ b/script/bundle.ts @@ -4,6 +4,7 @@ import { build, type Plugin } from "esbuild"; import pkg from "../package.json"; import { uploadSourcemaps } from "../src/lib/api/sourcemaps.js"; import { injectDebugId, PLACEHOLDER_DEBUG_ID } from "./debug-id.js"; +import { textImportPlugin } from "./text-import-plugin.js"; const VERSION = pkg.version; const SENTRY_CLIENT_ID = process.env.SENTRY_CLIENT_ID ?? ""; @@ -179,7 +180,11 @@ const sentrySourcemapPlugin: Plugin = { }; // Always inject debug IDs (even without auth token); upload is gated inside the plugin -const plugins: Plugin[] = [bunSqlitePlugin, sentrySourcemapPlugin]; +const plugins: Plugin[] = [ + bunSqlitePlugin, + sentrySourcemapPlugin, + textImportPlugin, +]; if (process.env.SENTRY_AUTH_TOKEN) { console.log(" Sentry auth token found, source maps will be uploaded"); diff --git a/script/text-import-plugin.ts b/script/text-import-plugin.ts new file mode 100644 index 000000000..b61b14b0c --- /dev/null +++ b/script/text-import-plugin.ts @@ -0,0 +1,51 @@ +/** + * esbuild plugin that polyfills Bun's `with { type: "text" }` import + * attribute. + * + * esbuild doesn't natively support the `text` import attribute (only + * `json`), but Bun does. Our CLI code uses it to load the grep worker + * source as a string at bundle time (see + * `src/lib/scan/worker-pool.ts`). Without this plugin, esbuild errors + * with `Importing with a type attribute of "text" is not supported` + * on any file that imports a sibling `.js` as text. + * + * The plugin intercepts imports whose `with` attribute matches + * `{ type: "text" }`, reads the file from disk, and emits it as a JS + * module that default-exports the file's contents as a string. + * Runtime behavior matches Bun's native handling, so the same source + * works in dev (via `bun run`) and in compiled binaries (esbuild + + * `bun build --compile` two-step). + * + * Used by both `script/build.ts` (single-file executable) and + * `script/bundle.ts` (CJS library bundle for npm). + */ + +import { readFileSync } from "node:fs"; +import { resolve as resolvePath } from "node:path"; +import type { Plugin } from "esbuild"; + +const TEXT_IMPORT_NS = "text-import"; +/** Match-any filter for esbuild's plugin API. Hoisted for top-level-regex lint. */ +const ANY_FILTER = /.*/; + +export const textImportPlugin: Plugin = { + name: "text-import", + setup(build) { + build.onResolve({ filter: ANY_FILTER }, (args) => { + if (args.with?.type !== "text") { + return null; + } + return { + path: resolvePath(args.resolveDir, args.path), + namespace: TEXT_IMPORT_NS, + }; + }); + build.onLoad({ filter: ANY_FILTER, namespace: TEXT_IMPORT_NS }, (args) => { + const content = readFileSync(args.path, "utf-8"); + return { + contents: `export default ${JSON.stringify(content)};`, + loader: "js", + }; + }); + }, +}; diff --git a/src/lib/scan/grep-worker-source.ts b/src/lib/scan/grep-worker-source.ts deleted file mode 100644 index 147e1fa23..000000000 --- a/src/lib/scan/grep-worker-source.ts +++ /dev/null @@ -1,118 +0,0 @@ -/** - * Inline source for the grep worker, delivered to `new Worker()` via - * `Blob` + `URL.createObjectURL`. - * - * Why inline-as-string and not a separate `.ts` file? - * - * - `bun build --compile` (our single-file CLI binary) can't resolve - * `new Worker(new URL("./file.ts", import.meta.url).href)` at - * runtime — the worker URL points to a source that's not part of - * the compiled binary's module map. The worker spawn hangs. - * - * - A Blob URL sidesteps that: we construct the worker source at - * runtime from a string bundled into the main module, so there's - * no runtime file resolution. - * - * The worker is deliberately self-contained: - * - * - No imports from local modules (those wouldn't be available in - * the worker's module registry anyway). - * - Uses `require("node:fs")` (CJS-style) inside the worker function - * because Bun and Node both expose `node:fs` to workers via - * `require` and `import`, but the CJS form avoids top-level - * await parsing issues in the string template. - * - * The worker receives a batch of file paths, a regex pattern, and - * options. It reads each file synchronously (fast in-process), - * runs the regex, and returns matches packed as a `Uint32Array` + - * a single `linePool` string. The caller then re-hydrates - * `GrepMatch` objects from the packed data — this avoids the - * structured-clone cost of ~216ms that a plain `GrepMatch[]` - * would incur for typical large-result workloads. - * - * Match encoding — each match is 4 consecutive `u32`s in `ints`: - * [0] pathIdx index into the input `paths` array - * [1] lineNum 1-based line number - * [2] lineOffset character offset into `linePool` - * [3] lineLength character length of the line (post-truncation) - */ - -/** - * Worker source. Injected into a `Blob` at runtime. - * - * Kept as plain JS (not TS) so it can be embedded verbatim without - * a compile step inside the worker. - */ -export const GREP_WORKER_SOURCE = ` -const { readFileSync } = require("node:fs"); - -self.onmessage = (event) => { - const { paths, patternSource, flags, maxLineLength, maxMatchesPerFile, literal } = event.data; - const regex = new RegExp(patternSource, flags); - // Packed match data: 4 u32s per match — [pathIdx, lineNum, lineOffset, lineLength] - const ints = []; - let linePool = ""; - - for (let pathIdx = 0; pathIdx < paths.length; pathIdx++) { - const p = paths[pathIdx]; - let content; - try { - content = readFileSync(p, "utf-8"); - } catch { - continue; - } - - // File-level literal gate (matches main-thread readAndGrep behavior). - if (literal !== null) { - const isCaseInsensitive = regex.flags.includes("i"); - const haystack = isCaseInsensitive ? content.toLowerCase() : content; - if (haystack.indexOf(literal) === -1) continue; - } - - regex.lastIndex = 0; - let m = regex.exec(content); - let lineNum = 1; - let cursor = 0; - let fileMatches = 0; - - while (m !== null) { - const matchIndex = m.index; - // Advance line counter to match position via indexOf hops. - let nl = content.indexOf("\\n", cursor); - while (nl !== -1 && nl < matchIndex) { - lineNum++; - nl = content.indexOf("\\n", nl + 1); - } - cursor = matchIndex; - - const lineStart = content.lastIndexOf("\\n", matchIndex) + 1; - const lineEndRaw = content.indexOf("\\n", matchIndex); - const lineEnd = lineEndRaw === -1 ? content.length : lineEndRaw; - let line = content.slice(lineStart, lineEnd); - if (line.length > maxLineLength) { - line = line.slice(0, maxLineLength - 1) + "\\u2026"; - } - - // Append line text to pool, record offset+length - const lineOffset = linePool.length; - linePool += line; - ints.push(pathIdx, lineNum, lineOffset, line.length); - - fileMatches++; - if (fileMatches >= maxMatchesPerFile) break; - if (lineEndRaw === -1) break; - regex.lastIndex = lineEnd + 1; - m = regex.exec(content); - } - } - - // Pack into Uint32Array with transferable backing buffer. - const packed = new Uint32Array(ints); - self.postMessage( - { type: "result", ints: packed, linePool }, - { transfer: [packed.buffer] } - ); -}; - -self.postMessage({ type: "ready" }); -`; diff --git a/src/lib/scan/grep-worker.js b/src/lib/scan/grep-worker.js new file mode 100644 index 000000000..8d6f90735 --- /dev/null +++ b/src/lib/scan/grep-worker.js @@ -0,0 +1,153 @@ +/** + * Grep worker body. Imported as raw text (`with { type: "text" }`) + * by `worker-pool.ts` and handed to a `Blob` / `URL.createObjectURL` + * at runtime to spawn workers. + * + * ## Why a plain `.js` file, not a `.ts` one + * + * This file is NEVER executed as a module via `import` — it's always + * loaded as the body of a `new Worker()`. So we need its contents to + * be valid JS verbatim (no TS syntax). Keeping it `.js` avoids a + * transpile step and makes the flow obvious: the file you see is + * exactly what the worker engine parses. + * + * ## Why a separate file and not a string in TS + * + * Linting, syntax validation, and IDE tooling all work on real files. + * A previous iteration inlined this as a template literal inside a + * TS source — but the string was invisible to lint/format/search. + * `with { type: "text" }` gives us the best of both: the worker is a + * first-class file at dev time, and a string constant at runtime. + * + * ## Why not `new Worker(url)` with a TS file entrypoint + * + * Bun's `bun build --compile` docs describe passing the worker TS + * file as an additional entrypoint. We tested that with our + * esbuild→Bun compile two-step and all three documented forms + * (`new Worker("./path.ts")`, `new Worker(new URL(...))`, + * `new Worker(URL.href)`). The URL forms fail in compiled binaries + * because `import.meta.url` resolves to the binary path — the + * bundler doesn't rewrite URLs at compile time. The plain-string + * form resolves relative to the project root that `bun build` ran + * from, which works in compiled mode but requires `bun run` / `bun + * test` to also be invoked from the project root — brittle. + * + * The Blob-URL + text-import approach has no such fragility and + * works identically in dev, `bun test`, and compiled binaries. + * + * ## Self-contained + * + * - No imports from local modules (they wouldn't be available in + * the worker's module registry anyway). + * - Uses `require("node:fs")` (CJS-style) which Bun exposes to + * workers; avoids the top-level await concerns of ESM imports. + * + * ## Protocol + * + * Main → Worker: `{ paths, patternSource, flags, maxLineLength, + * maxMatchesPerFile, literal }` + * + * Worker → Main: + * - Once on startup: `{ type: "ready" }` + * - Per request: `{ type: "result", ints: Uint32Array, linePool: string }` + * — `ints.buffer` is transferred (zero-copy); `linePool` is cloned. + * + * ## Match encoding + * + * Each match is 4 consecutive `u32`s in `ints`: + * [0] pathIdx index into the input `paths` array + * [1] lineNum 1-based line number + * [2] lineOffset character offset into `linePool` + * [3] lineLength character length of the line (post-truncation) + * + * Structured-clone of `GrepMatch[]` for 215k-match workloads costs + * ~200ms because the per-match path/line strings get deep-copied. + * The binary-packed form + single shared `linePool` string drops + * that to ~2-3ms. + */ + +const { readFileSync } = require("node:fs"); + +// biome-ignore lint/complexity/noExcessiveCognitiveComplexity: hot regex loop with literal gate + line counter + line-bound extraction + per-file cap is inherently branchy +self.onmessage = (event) => { + const { + paths, + patternSource, + flags, + maxLineLength, + maxMatchesPerFile, + literal, + } = event.data; + const regex = new RegExp(patternSource, flags); + // Packed match data: 4 u32s per match — [pathIdx, lineNum, lineOffset, lineLength] + const ints = []; + let linePool = ""; + + for (let pathIdx = 0; pathIdx < paths.length; pathIdx += 1) { + const p = paths[pathIdx]; + let content; + try { + content = readFileSync(p, "utf-8"); + } catch { + continue; + } + + // File-level literal gate (matches main-thread readAndGrep behavior). + if (literal !== null) { + const isCaseInsensitive = regex.flags.includes("i"); + const haystack = isCaseInsensitive ? content.toLowerCase() : content; + if (haystack.indexOf(literal) === -1) { + continue; + } + } + + regex.lastIndex = 0; + let m = regex.exec(content); + let lineNum = 1; + let cursor = 0; + let fileMatches = 0; + + while (m !== null) { + const matchIndex = m.index; + // Advance line counter to match position via indexOf hops. + let nl = content.indexOf("\n", cursor); + while (nl !== -1 && nl < matchIndex) { + lineNum += 1; + nl = content.indexOf("\n", nl + 1); + } + cursor = matchIndex; + + const lineStart = content.lastIndexOf("\n", matchIndex) + 1; + const lineEndRaw = content.indexOf("\n", matchIndex); + const lineEnd = lineEndRaw === -1 ? content.length : lineEndRaw; + let line = content.slice(lineStart, lineEnd); + if (line.length > maxLineLength) { + line = `${line.slice(0, maxLineLength - 1)}\u2026`; + } + + // Append line text to pool, record offset+length + const lineOffset = linePool.length; + linePool += line; + ints.push(pathIdx, lineNum, lineOffset, line.length); + + fileMatches += 1; + if (fileMatches >= maxMatchesPerFile) { + break; + } + if (lineEndRaw === -1) { + break; + } + regex.lastIndex = lineEnd + 1; + m = regex.exec(content); + } + } + + // Pack into Uint32Array with transferable backing buffer. + const packed = new Uint32Array(ints); + self.postMessage( + { type: "result", ints: packed, linePool }, + { transfer: [packed.buffer] } + ); +}; + +self.postMessage({ type: "ready" }); diff --git a/src/lib/scan/worker-pool.ts b/src/lib/scan/worker-pool.ts index 3fbf558ac..55876eb0b 100644 --- a/src/lib/scan/worker-pool.ts +++ b/src/lib/scan/worker-pool.ts @@ -4,7 +4,8 @@ * Lazy-initialized singleton: the first call to `getWorkerPool()` * spawns N workers via a Blob URL (for compatibility with * `bun build --compile`'s single-file binary — see - * `grep-worker-source.ts`). Subsequent calls reuse the pool. + * `grep-worker.js` for the worker body + rationale). Subsequent + * calls reuse the pool. * * Workers are `.unref()`'d so the CLI exits cleanly without an * explicit shutdown — they'll be torn down by the runtime when the @@ -24,7 +25,21 @@ */ import { availableParallelism } from "node:os"; -import { GREP_WORKER_SOURCE } from "./grep-worker-source.js"; +// Raw-text import: Bun reads the file's contents at bundle time and +// exposes them as the module's default export (a string). The file +// is a real `.js` file (`grep-worker.js`) so it's lintable, +// syntax-checked, and easy to edit. At runtime we feed the string +// to a `Blob` + `URL.createObjectURL` to spawn workers. See +// `grep-worker.js` for full rationale. +// +// TypeScript's type system doesn't model Bun's `with { type: "text" }` +// attribute (the default export gets typed as the module's shape), +// so we cast through unknown. Guarded at startup by the runtime +// Blob constructor which would throw if given a non-string. +import grepWorkerSource from "./grep-worker.js" with { type: "text" }; + +const GREP_WORKER_SOURCE = grepWorkerSource as unknown as string; + import type { GrepMatch } from "./types.js"; /** From b25a2e48ba6d52a7ff99d39a2d4a51224eb03411 Mon Sep 17 00:00:00 2001 From: Burak Yigit Kaya Date: Tue, 21 Apr 2026 23:53:48 +0000 Subject: [PATCH 4/9] fix(dsn): recover DSNs past column 2000 on long minified lines MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Pre-merge self-review found a silent correctness regression in the unified DSN pipeline from the previous commit: `grepFiles` defaults to `maxLineLength: 2000`, which truncates `match.line` with a `…` suffix. The DSN scanner then re-runs `DSN_PATTERN` on that truncated line to recover all DSNs on the line — but the pattern ends with `/\d+`, so a DSN near the end of a long line loses its trailing digits to `…` and fails the regex silently. Reproduced pre-fix: const filler = "x".repeat(1990); const dsn = "https://abc@o111222.ingest.us.sentry.io/7654321"; const line = `${filler} const DSN = "${dsn}";`; writeFileSync(join(tmpDir, "minified.js"), line); const result = await scanCodeForDsns(tmpDir); // → `result.dsns.length === 0` ❌ (should be 1) Realistic triggers: minified JS bundles (commonly have lines >2000 chars), source-map-embedded config blobs, base64-encoded envs. Within the walker's 256KB `MAX_FILE_SIZE` cap, a single line carrying a DSN past column 2000 is plausible. Fix: pass `maxLineLength: Number.POSITIVE_INFINITY` from the DSN scanner. Memory is bounded by `MAX_FILE_SIZE` (per-file) and the `seen` dedup map (per-scan), so there's no blow-up risk. Also fixes a stale docstring at the top of `worker-pool.ts` that still described the `.unref()` design from an earlier iteration. The current code explicitly does NOT call `.unref()` (it caused a deadlock) and the old docstring contradicted the implementation. ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean - [x] Reproduction above → 0 DSNs (pre-fix) → 1 DSN (post-fix) - [x] New regression test in `test/lib/dsn/code-scanner.test.ts`: "finds DSNs on long single-line files (>2000 chars, common in minified JS)" - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5641 pass, 0 fail** (+1 new regression test) - [x] `bun test test/isolated` — 138 pass - [x] `bun run bench --size large --runs 3 --warmup 1` — no perf regression (`scanCodeForDsns` 238ms, `scan.grepFiles` 163ms) --- AGENTS.md | 8 +++++++- src/lib/dsn/code-scanner.ts | 9 +++++++++ src/lib/scan/worker-pool.ts | 31 +++++++++++++++++++++---------- test/lib/dsn/code-scanner.test.ts | 22 ++++++++++++++++++++++ 4 files changed, 59 insertions(+), 11 deletions(-) diff --git a/AGENTS.md b/AGENTS.md index 9484ffd53..080c1128d 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -1001,7 +1001,7 @@ mock.module("./some-module", () => ({ * **DSN cache invalidation uses two-level mtime tracking (sourceMtimes + dirMtimes)**: \*\*DSN cache invalidation uses two-level mtime tracking\*\*: \`sourceMtimes\` (DSN-bearing files, catches in-place edits) + \`dirMtimes\` (every walked dir, catches new files) + root mtime fast-path + 24h TTL. Dropping either map is a correctness regression. Walker emits mtimes via \`onDirectoryVisit\` hook + \`recordMtimes\` option; DSN scanner uses \`grepFiles({pattern: DSN\_PATTERN, recordMtimes: true, onDirectoryVisit})\` as full-scan pipeline (~20% faster than old walkFiles path). \`scanCodeForFirstDsn\` stays on direct walker loop — worker-pool init cost (~20ms) dominates for single-DSN. \*\*sourceMtimes invariant\*\*: \`processMatch\` must record mtime for EVERY file containing a host-validated DSN, not just deduplicated new ones — track via \`fileHadValidDsn\` flag independent of \`seen.has(raw)\`. \*\*Error-path invariant\*\*: \`scanDirectory\` catch MUST return empty \`dirMtimes: {}\`, NOT partial map from callback (would silently bless unvisited dirs). \`ConfigError\` re-throws instead. -* **Grep worker pool: binary-transferable matches + streaming dispatch in src/lib/scan/**: \*\*Grep worker pool\*\* (\`src/lib/scan/worker-pool.ts\` + \`grep-worker-source.ts\`): lazy singleton pool (size \`min(8, max(2, availableParallelism()))\`) for \`collectGrep\`/\`grepFiles\`. Matches encoded as \`Uint32Array\` quads \`\[pathIdx, lineNum, lineOffset, lineLength]\` + single \`linePool\` string, transferred via \`postMessage(msg, \[ints.buffer])\` — cut grep pipeline ~40% by dropping structuredClone. Worker source inlined as string, loaded via \`Blob\` + \`URL.createObjectURL(blob)\`: \`new Worker(new URL('./worker.ts', import.meta.url))\` HANGS in \`bun build --compile\` binaries (\`import.meta.url\` resolves to \`/$bunfs/root/binary\`); string paths resolve relative to CWD, unusable. Worker source must be self-contained (no user imports); \`require('node:fs')\` works inside. \*\*FIFO handler queue per worker\*\* (NOT addEventListener-per-dispatch) — fresh listeners fire on every message, resolving wrong requests and causing hangs. \*\*Dead-worker guard\*\*: \`PooledWorker.alive\` flag flipped on \`error\`; dispatch filters live workers only. Disable via \`SENTRY\_SCAN\_DISABLE\_WORKERS=1\`. +* **Grep worker pool: binary-transferable matches + streaming dispatch in src/lib/scan/**: \*\*Grep worker pool\*\* (\`src/lib/scan/worker-pool.ts\` + \`grep-worker.js\`): lazy singleton pool (size \`min(8, max(2, availableParallelism()))\`). Matches encoded as \`Uint32Array\` quads \`\[pathIdx, lineNum, lineOffset, lineLength]\` + single \`linePool\` string, transferred via \`postMessage(msg, \[ints.buffer])\` — cut grep pipeline ~40% by dropping structuredClone. Worker source lives in real \`grep-worker.js\` (plain JS, lintable), imported via \`with { type: "text" }\` → fed to \`Blob\` + \`URL.createObjectURL\`. \`new Worker(new URL('./w.ts', import.meta.url))\` HANGS in \`bun build --compile\` binaries (\`import.meta.url\` → \`/$bunfs/root/binary\`); string paths are CWD-relative and brittle. Worker source must be self-contained (no user imports); \`require('node:fs')\` works inside. \*\*FIFO handler queue per worker\*\* — fresh \`addEventListener\` per dispatch fires on every message, resolving wrong requests and causing hangs. \*\*Dead-worker guard\*\*: \`PooledWorker.alive\` flipped on \`error\`. Disable via \`SENTRY\_SCAN\_DISABLE\_WORKERS=1\`. * **Input validation layer: src/lib/input-validation.ts guards CLI arg parsing**: CLI arg input validation (\`src/lib/input-validation.ts\`): Four validators — \`rejectControlChars\` (ASCII < 0x20), \`rejectPreEncoded\` (%XX), \`validateResourceId\` (rejects ?, #, %, whitespace), \`validateEndpoint\` (rejects \`..\` traversal). Applied in \`parseSlashOrgProject\`, \`parseOrgProjectArg\`, \`parseIssueArg\`, \`normalizeEndpoint\`. NOT applied in \`parseSlashSeparatedArg\` for plain IDs (may contain structural separators). Env vars and DB cache values are trusted. @@ -1034,6 +1034,9 @@ mock.module("./some-module", () => ({ * **AuthError constructor takes reason first, message second**: \`AuthError(reason: AuthErrorReason, message?: string)\` where \`AuthErrorReason\` is \`"not\_authenticated" | "expired" | "invalid"\`. Easy to accidentally swap args as \`new AuthError("Token expired", "expired")\` — the string \`"Token expired"\` gets assigned as \`reason\` (invalid enum value). Tests aren't type-checked (tsconfig excludes them), so TypeScript won't catch this. Correct: \`new AuthError("expired", "Token expired")\`. Default messages exist for each reason, so the second arg is often unnecessary. + +* **esbuild doesn't support \`with { type: "text" }\` — needs a plugin**: esbuild errors on Bun's \`import x from "./f.js" with { type: "text" }\` attribute ("Importing with a type attribute of \\"text\\" is not supported"). Our build pipeline is esbuild → bin.js → Bun compile, so esbuild sees the import first. Fix: \`script/text-import-plugin.ts\` — tiny esbuild plugin that intercepts imports with \`pluginData.with?.type === "text"\` (via \`onResolve\`/\`onLoad\`), reads the file, and emits \`export default "\"\`. Shared between \`script/build.ts\` (CLI binary) and \`script/bundle.ts\` (npm library). Bun runtime (dev, \`bun test\`) handles the attribute natively, so no plugin needed there. TypeScript also doesn't model the attribute — cast \`import x from "./f.js" with {...}\` through \`as unknown as string\`. + * **mapFilesConcurrent skips null but not empty arrays — callers must return null for no-op**: Scan/formatter off-by-ones: (1) \`mapFilesConcurrent\` in \`src/lib/scan/concurrent.ts\` filters \`null\` but NOT empty arrays — fires \`onResult\` per empty result. Callers like \`processEntry\` in \`dsn/code-scanner.ts\` must return \`null\` (not \`\[]\`) for no-op files; on 10k-file walks ~99% yield empties. Stream variant filters both. (2) \`collectGlob\`/\`collectGrep\` must NOT forward \`maxResults\` to the underlying iterator — collector drains uncapped, sets \`truncated=true\` on overshoot. Forwarding makes iterator stop at exactly N without \`truncated\`. (3) \`filterFields\` in \`formatters/json.ts\` uses dot-notation — property tests must use \`\[a-zA-Z0-9\_]\` charset to avoid ambiguous keys with dots. @@ -1065,4 +1068,7 @@ mock.module("./some-module", () => ({ * **PR review workflow: Cursor Bugbot + Seer + human cycle**: PR review workflow (Cursor Bugbot + Seer + human): \`gh pr checks \ --json state,link\` + \`gh run view --log-failed\` for CI. Unresolved threads via \`gh api graphql\` with \`reviewThreads\` query filtering \`isResolved:false\`+\`isMinimized:false\`. After fixes: reply via \`gh api repos/.../pulls/comments/\/replies\` then resolve via \`resolveReviewThread\` mutation. Bots auto-resolve on detected fix. Statuses: \`UNSTABLE\`=non-blocking bots running, \`BLOCKED\`=required CI pending, \`CLEAN\`+\`MERGEABLE\`=ready. Repo is squash-merge only. Expect 4-6 rounds on subtle regex/Unicode PRs. Bugbot findings usually real but occasionally assume PCRE/Python semantics (e.g. \`\[]abc]\` is class-with-\`]\` in PCRE but empty class in JS) — verify with reproduction. Dispatch self-review subagent between rounds. + + +* **Worker bodies as real .js files via text-import, not inline strings**: Prefer real \`.js\` files imported as text over inline template-literal worker sources. Inline strings are invisible to Biome/TypeScript — lint bugs (\`noIncrementDecrement\`, \`useBlockStatements\`, \`useTemplate\`, unused vars) go undetected. Pattern: write worker as plain JS (\`grep-worker.js\`, CJS-style \`require("node:fs")\` to avoid top-level-await parsing), import via \`with { type: "text" }\`, feed to \`Blob\` + \`URL.createObjectURL\`. Works in \`bun run\`, \`bun test\`, and compiled binaries (with the esbuild text-import plugin). Trade-off: no TS types in the worker body — keep message protocol types on main-thread side (\`worker-pool.ts\`). Must be self-contained (no user imports — worker module registry won't have them). diff --git a/src/lib/dsn/code-scanner.ts b/src/lib/dsn/code-scanner.ts index 5ac8855c0..607ce86fd 100644 --- a/src/lib/dsn/code-scanner.ts +++ b/src/lib/dsn/code-scanner.ts @@ -404,6 +404,15 @@ function scanDirectory(cwd: string): Promise { const rel = normalizePath(path.relative(cwd, absDir)) || "."; dirMtimes[rel] = mtimeMs; }, + // Disable line-text truncation. The default (2000 chars) is + // fine for interactive grep output, but would silently drop + // DSNs that appear past column ~1900 on long minified lines. + // `processMatch` re-runs `DSN_PATTERN` on `match.line`, so + // a `…` suffix introduced by truncation would make the + // regex fail at the pattern-terminating `/\d+`. The + // per-DSN memory cost is bounded by the walker's + // `maxFileSize` (256 KB) and the `seen` dedup map. + maxLineLength: Number.POSITIVE_INFINITY, }); for await (const match of iter) { diff --git a/src/lib/scan/worker-pool.ts b/src/lib/scan/worker-pool.ts index 55876eb0b..8fb8dbf01 100644 --- a/src/lib/scan/worker-pool.ts +++ b/src/lib/scan/worker-pool.ts @@ -7,20 +7,31 @@ * `grep-worker.js` for the worker body + rationale). Subsequent * calls reuse the pool. * - * Workers are `.unref()`'d so the CLI exits cleanly without an - * explicit shutdown — they'll be torn down by the runtime when the - * main process ends. + * ### Lifetime + * + * Workers are kept ref'd — they hold the event loop open. The CLI + * relies on `process.exit()` at the end of command execution to + * tear them down; no explicit shutdown is required from callers. + * + * An earlier iteration `.unref()`'d each worker for "clean" exit, + * but that caused a deadlock: when the main thread's only pending + * work was a promise awaiting a worker result, unref'd workers + * didn't process messages on idle ticks and the pipeline hung with + * in-flight batches never settling. + * + * `terminatePool()` is provided for tests that need to reset the + * singleton between cases; it's not part of the CLI shutdown path. * * ### Feature-gate * - * `isWorkerSupported()` returns true only when `Worker` and `Blob` - * + `URL.createObjectURL` are available. On the Node library bundle - * (`dist/index.cjs`), these globals exist (Node provides them via - * `worker_threads` aliases in Bun, and Node 22+ has `Blob` / - * `createObjectURL`). We avoid depending on `bun:*` or `Bun.*` APIs - * so the same code works in both runtimes. + * `isWorkerSupported()` returns true only when `Worker`, `Blob`, + * and `URL.createObjectURL` are available. The Bun runtime and + * Bun-compiled binaries always pass this check; the Node library + * bundle (`dist/index.cjs`) may or may not depending on the + * consumer's Node version. We avoid depending on `bun:*` or + * `Bun.*` APIs so the same code works in both runtimes. * - * When unsupported, callers fall back to the current async + * When unsupported, callers fall back to the async * `mapFilesConcurrent` path. */ diff --git a/test/lib/dsn/code-scanner.test.ts b/test/lib/dsn/code-scanner.test.ts index 6767bb672..f74e48dba 100644 --- a/test/lib/dsn/code-scanner.test.ts +++ b/test/lib/dsn/code-scanner.test.ts @@ -496,6 +496,28 @@ describe("Code Scanner", () => { expect(result.dsns).toEqual([]); }); + test("finds DSNs on long single-line files (>2000 chars, common in minified JS)", async () => { + // Regression: `grepFiles` defaults to `maxLineLength: 2000` which + // truncates `match.line` with a `…` suffix. The DSN scanner then + // re-runs `DSN_PATTERN` on that truncated line, so a DSN near the + // end of a long line would silently disappear (the pattern ends + // with `/\d+` — trailing digits get replaced by `…`). + // + // The scanner explicitly passes `maxLineLength: Infinity` to + // disable truncation, because memory is bounded by the walker's + // 256KB file cap. + const filler = "x".repeat(1990); + const dsn = "https://abc@o111222.ingest.us.sentry.io/7654321"; + const line = `${filler} const DSN = "${dsn}";`; + // Sanity: DSN starts past column 2000. + expect(line.indexOf(dsn)).toBeGreaterThan(2000); + writeFileSync(join(testDir, "minified.js"), line); + + const result = await scanCodeForDsns(testDir); + expect(result.dsns).toHaveLength(1); + expect(result.dsns[0]?.raw).toBe(dsn); + }); + test("finds DSNs in monorepo packages deeper than MAX_SCAN_DEPTH", async () => { // packages/spotlight/src/instrument.ts is depth 3 from root, // but with monorepo depth reset, packages/spotlight/ resets to 0 From bdcb3a25f76fdf5af4c2f067a76d4dbd0b03f369 Mon Sep 17 00:00:00 2001 From: Burak Yigit Kaya Date: Wed, 22 Apr 2026 00:29:19 +0000 Subject: [PATCH 5/9] fix(scan): balance ref/unref per dispatch so CLI exits cleanly MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit PR #807 originally removed `.unref()` from each worker at spawn to avoid a mid-pipeline deadlock where unref'd workers weren't processing messages on idle ticks. But keeping workers ref'd broke the CLI exit path: `project view` (and any command that transits `detectAllDsns` → `scanCodeForDsns` → `grepFiles`) would emit its output and then hang indefinitely because ref'd-but-idle workers kept the event loop alive. E2E caught this: (fail) sentry project view > requires org and project [15001.15ms] ^ this test timed out after 15000ms. Expected: 1 Received: 143 (SIGTERM) Bisected cleanly: - `d2305ba1` (workers-only commit): exit=1 on `project view` — clean. - `6ea26f3d` (DSN unification): exit=124 — hang. Unification made `scanCodeForDsns` route through `grepFiles`, which creates the worker pool. Before unification, the DSN path went through `walkFiles + mapFilesConcurrent` directly and never touched the pool, so `project view` never saw workers. ## Fix: ref/unref balancing per dispatch Workers are now `.unref()`'d at spawn (idle pool doesn't hold the loop open), then `.ref()`'d in `dispatch()` before a request is posted, and `.unref()`'d again when the dispatch settles — via the `message` `result` handler, the `error` handler's mass-reject path, the readiness-failure path, and `terminate()`. Each of these paths has a matched unref for every ref so the ref count stays balanced. This gives us both properties we need: - **Clean CLI exit**: when all dispatches settle, workers are idle + unref'd, the event loop drains, and the process exits naturally. No `beforeExit` handler or explicit `terminatePool()` required on the happy path. - **No mid-pipeline deadlock**: ref'd workers during active work means their `message` events fire on the main thread's next tick, not starved by Bun's idle-tick bypass of unref'd handles. Added `refWorker` / `unrefWorker` helper functions for readability and to guard against runtimes where `Worker` doesn't expose the ref/unref methods (Node's DOM Worker shim, hypothetically). ## Manual verification $ cd /tmp/empty-dir $ time sentry project view Error: Could not auto-detect organization and project. ... real 0m0.040s # was 15s timeout before fix exit=1 Ran 10 consecutive invocations — all exit cleanly in <50ms. ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5641 pass, 0 fail** - [x] `bun test test/isolated` — 138 pass - [x] `bun run bench --size large --runs 3 --warmup 1` — workers still function end-to-end; no perf regression (`scan.grepFiles` 162ms, `scanCodeForDsns` 238ms) - [x] Rebuilt binary, ran `project view` from empty dir 10× — all exit cleanly in <50ms. Pre-fix: 5/5 hit 15s timeout. - [x] `auth status`, `project view` with DSN, etc. — all exit cleanly. --- src/lib/scan/worker-pool.ts | 73 ++++++++++++++++++++++++++++++++----- 1 file changed, 64 insertions(+), 9 deletions(-) diff --git a/src/lib/scan/worker-pool.ts b/src/lib/scan/worker-pool.ts index 8fb8dbf01..16d34c870 100644 --- a/src/lib/scan/worker-pool.ts +++ b/src/lib/scan/worker-pool.ts @@ -167,6 +167,24 @@ export function getPoolSize(): number { return Math.max(2, Math.min(8, availableParallelism())); } +/** + * Call `worker.ref()` if the runtime exposes it. Both Bun's native + * `Worker` and Node's `worker_threads.Worker` have `.ref()` / `.unref()`; + * DOM Worker shims may not. Guarded. + */ +function refWorker(w: Worker): void { + if (typeof (w as unknown as { ref?: () => void }).ref === "function") { + (w as unknown as { ref: () => void }).ref(); + } +} + +/** Counterpart to {@link refWorker}. */ +function unrefWorker(w: Worker): void { + if (typeof (w as unknown as { unref?: () => void }).unref === "function") { + (w as unknown as { unref: () => void }).unref(); + } +} + /** * Build the worker's Blob-URL source once per pool. Cached so * repeated pool recreation (in tests) doesn't leak URLs. @@ -204,11 +222,24 @@ export function getWorkerPool(): WorkerPool { for (let i = 0; i < size; i += 1) { const w = new Worker(url); - // Note: NOT calling .unref() — when multiple dispatches are - // in-flight but the main thread only has promise waits pending, - // unref'd workers don't process messages on idle ticks, causing - // deadlock. The CLI explicitly terminates the pool at exit via - // `terminatePool()` so this doesn't leak. + // Workers are unref'd at spawn so an idle pool doesn't hold the + // event loop open. On each `dispatch()` we `ref()` the worker + // before posting work, and `unref()` it again when the dispatch + // settles (either via `result` message, `error` event, or + // termination). This gives us: + // + // - Clean CLI exit: once all pending dispatches settle, + // workers are idle + unref'd, the loop drains, and the + // process exits naturally — no `beforeExit` / explicit + // `terminatePool()` required on the CLI happy path. + // - No mid-pipeline deadlock: ref'd workers during active + // work means their `message` events fire on the main + // thread's next tick, not starved by idle-tick bypass. + // + // `ref`/`unref` must be balanced one-to-one. See `dispatch()` + // (ref on pending-slot push) + `onmessage` result handler + // (unref on shift) + `error` handler (unref on mass-reject). + unrefWorker(w); const pw: PooledWorker = { worker: w, ready: new Promise((resolve) => { @@ -242,6 +273,9 @@ export function getWorkerPool(): WorkerPool { return; } pw.inflight -= 1; + // Unref: dispatch completed, worker has one fewer reason to + // keep the loop alive. Matched by the `ref()` in `dispatch()`. + unrefWorker(pw.worker); next.resolve({ ints: data.ints, linePool: data.linePool }); }); w.addEventListener("error", (err) => { @@ -250,9 +284,14 @@ export function getWorkerPool(): WorkerPool { // avoid routing new work to a worker that can't respond. pw.alive = false; const errMsg = err.message ?? String(err); - while (pw.pending.length > 0) { - const p = pw.pending.shift(); - p?.reject(new Error(`worker error: ${errMsg}`)); + // Drain the pending queue: reject each dispatch AND unref the + // worker once per dispatch so ref/unref stays balanced and the + // loop can drain. + let slot = pw.pending.shift(); + while (slot !== undefined) { + unrefWorker(pw.worker); + slot.reject(new Error(`worker error: ${errMsg}`)); + slot = pw.pending.shift(); } pw.inflight = 0; }); @@ -282,6 +321,10 @@ export function getWorkerPool(): WorkerPool { } const chosen = best; chosen.inflight += 1; + // Ref the worker while this dispatch is in flight. Matched by + // the `unref()` call in the `message` result handler (or the + // mass-unref in the `error` handler / `terminate()` path). + refWorker(chosen.worker); // Enqueue a pending slot for this request. The worker's // `onmessage` handler will resolve it when the corresponding @@ -297,10 +340,12 @@ export function getWorkerPool(): WorkerPool { chosen.worker.postMessage(request); }, (err) => { - // Readiness failed — fail this dispatch's resolver. + // Readiness failed — fail this dispatch's resolver AND + // unref to balance the `refWorker` above. const slot = chosen.pending.pop(); if (slot) { chosen.inflight -= 1; + unrefWorker(chosen.worker); slot.reject(err); } } @@ -309,6 +354,16 @@ export function getWorkerPool(): WorkerPool { }, terminate(): void { for (const pw of workers) { + pw.alive = false; + // Drain pending: reject + unref per-dispatch so ref/unref + // stays balanced and the loop can drain. + let slot = pw.pending.shift(); + while (slot !== undefined) { + unrefWorker(pw.worker); + slot.reject(new Error("worker pool terminated")); + slot = pw.pending.shift(); + } + pw.inflight = 0; try { pw.worker.terminate(); } catch { From 872708ee0e07a3794d19917b268780859302d961 Mon Sep 17 00:00:00 2001 From: Burak Yigit Kaya Date: Wed, 22 Apr 2026 00:35:36 +0000 Subject: [PATCH 6/9] fix(scan,dsn): surface worker-pipeline failures + correct filesCollected count MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two Cursor Bugbot findings caught on the latest force-push. ## 1. `filesCollected` counts matches, not unique files (medium) `grepFiles` emits one `GrepMatch` per matching line, not per file. The scanner was incrementing `filesCollected` on every match, so a file with 3 DSN-matching lines inflated the counter (and the two `dsn.files_collected` / `dsn.files_scanned` telemetry attributes it backed) by 3×. Fix: use `filesSeenForMtime.size` instead. That set was already tracking unique file paths for mtime recording. Rename-free and semantically accurate: "files with at least one validated DSN" — matches rg's `--files-with-matches` semantics. ## 2. Silent data loss when all workers die (medium) The `dispatchBatch` error handler was swallowing dispatch rejections (worker errors, "all workers dead"). If every batch failed, the consumer loop would see zero results and exit normally, and callers like the DSN scanner would get an empty result indistinguishable from "no DSNs found." The DSN cache layer would then persist a false-negative empty result, leaving the scan broken until the 24h TTL expires. Fix: track `dispatchedBatches` / `failedBatches` counters in `grepViaWorkers`. When the consumer loop exits (via the `finally` block that already re-raises `producerError`), throw if every dispatched batch failed. A single failed batch is still acceptable (per-file errors are swallowed by design inside the worker), but a pipeline-wide failure now surfaces instead of masquerading as "empty result." ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5641 pass, 0 fail** - [x] `bun test test/isolated` — 138 pass - [x] `bun run bench --size large --runs 3 --warmup 1` — no perf regression (`scan.grepFiles` 159ms, `scanCodeForDsns` 232ms) - [x] Rebuilt binary + `project view` from empty dir — exits cleanly (exit 1) in <50ms. 3 consecutive runs all clean. Not adding a regression test for finding #2 — the "all workers die" path requires crashing a real worker, which is hard to reproduce deterministically in a unit test. The change is a defensive guard against an edge case that only shows up when the worker runtime itself is broken (e.g., missing `node:fs`). --- src/lib/dsn/code-scanner.ts | 34 ++++++++++++++++++---------------- src/lib/scan/grep.ts | 31 +++++++++++++++++++++++++++++-- 2 files changed, 47 insertions(+), 18 deletions(-) diff --git a/src/lib/dsn/code-scanner.ts b/src/lib/dsn/code-scanner.ts index 607ce86fd..cb4a94852 100644 --- a/src/lib/dsn/code-scanner.ts +++ b/src/lib/dsn/code-scanner.ts @@ -379,19 +379,18 @@ function scanDirectory(cwd: string): Promise { const sourceMtimes: Record = {}; const dirMtimes: Record = {}; const seen = new Map(); - let filesCollected = 0; - // `grepFiles` emits one match per line; the walker yields every - // file that passes the preset. We count the latter via a - // set of unique absolute paths seen across all emitted matches - // PLUS an implicit one-time count from the first match per file. - // For files with zero DSNs, grep's file-level `http` gate - // silently skips them without emitting — they're not counted - // in `filesCollected` here. The pre-refactor counter tracked - // walker-yielded files (including those with zero DSNs), so to - // preserve the telemetry shape we'd need a separate walker tap. - // Accept the semantic drift: `filesCollected` now means "files - // that contained at least one DSN-like URL"; still useful - // signal, just stricter. + // Unique absolute paths of files that had at least one DSN-like + // URL match (commented-out or not). Used as a set for dedup AND + // as the `dsn.files_collected` telemetry count. `grepFiles` + // emits one match per line, so counting matches would over-count; + // counting unique paths gives the pre-refactor semantics. + // + // Files with zero DSN-like URLs are skipped by grep's file-level + // `http` gate without emitting any match, so they're not counted + // here. This is a (minor) drift from the pre-refactor counter + // which tracked walker-yielded files regardless — accepted as + // the stricter count is still useful signal and matches + // rg-style "files with matches" semantics. const filesSeenForMtime = new Set(); try { @@ -416,7 +415,6 @@ function scanDirectory(cwd: string): Promise { }); for await (const match of iter) { - filesCollected += 1; processMatch(match, { seen, sourceMtimes, @@ -424,9 +422,13 @@ function scanDirectory(cwd: string): Promise { }); } - span.setAttribute("dsn.files_collected", filesCollected); + // `filesSeenForMtime` holds one entry per file that had a + // validated DSN — use its `.size` for the `files_collected` + // and `files_scanned` attributes so a file with 3 DSN-matching + // lines counts as 1, not 3. + span.setAttribute("dsn.files_collected", filesSeenForMtime.size); span.setAttributes({ - "dsn.files_scanned": filesCollected, + "dsn.files_scanned": filesSeenForMtime.size, "dsn.dsns_found": seen.size, }); diff --git a/src/lib/scan/grep.ts b/src/lib/scan/grep.ts index d56e58458..747b1f977 100644 --- a/src/lib/scan/grep.ts +++ b/src/lib/scan/grep.ts @@ -555,6 +555,16 @@ async function* grepViaWorkers( let walkerDone = false; let inflightCount = 0; + // Tracks batches that failed at dispatch time (worker error, + // "all workers dead", etc.) so we can surface silent-failure + // scenarios to the caller. A single failed batch is not fatal + // (its results are dropped — per-file errors are already expected + // to be swallowed inside the worker), but if every batch fails + // we throw so callers aren't handed a false-negative empty result + // that could poison downstream caches. + let dispatchedBatches = 0; + let failedBatches = 0; + const dispatchBatch = ( paths: string[], rels: string[], @@ -562,6 +572,7 @@ async function* grepViaWorkers( ): void => { opts.stats.filesRead += paths.length; inflightCount += 1; + dispatchedBatches += 1; pool .dispatch({ paths, @@ -581,8 +592,11 @@ async function* grepViaWorkers( pending.push(decodeWorkerMatches(result, paths, rels, mtimes)); }, () => { - // Worker error — skip this batch. Per-file errors are - // already swallowed inside the worker itself. + // Worker error — drop this batch's results. Per-file errors + // are already swallowed inside the worker itself. Counted so + // the consumer can surface total-pipeline failure if every + // batch failed. + failedBatches += 1; } ) .finally(() => { @@ -715,6 +729,19 @@ async function* grepViaWorkers( // biome-ignore lint/correctness/noUnsafeFinally: intentional — re-raising walker error (e.g., AbortError) so callers see it throw producerError; } + // Worker-pipeline failure detector: if at least one batch was + // dispatched AND every dispatched batch failed, the pipeline + // returned zero matches due to worker failures — NOT due to the + // regex not matching. We must surface this so callers (notably + // the DSN cache layer) don't persist a false-negative empty + // result. A partial failure (some batches succeeded) is + // acceptable — we've at least reported the matches we did find. + if (dispatchedBatches > 0 && failedBatches === dispatchedBatches) { + // biome-ignore lint/correctness/noUnsafeFinally: intentional — surfacing pipeline-wide failure so false-negative empty result doesn't leak upstream + throw new Error( + `worker pipeline: all ${dispatchedBatches} dispatched batch(es) failed` + ); + } } } From fa3dc621a182cd934524ef5da4a361a36170c49a Mon Sep 17 00:00:00 2001 From: Burak Yigit Kaya Date: Wed, 22 Apr 2026 00:53:59 +0000 Subject: [PATCH 7/9] fix(scan): respect boolean semantics of Worker.ref()/.unref() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Cursor Bugbot caught a real race in the previous commit's per-dispatch ref/unref pattern. `Worker.ref()` and `Worker.unref()` are idempotent booleans — NOT reference-counted. Calling `unref()` releases the event-loop hold entirely regardless of how many prior `ref()` calls were made. Previous commit: every `dispatch()` called `refWorker`, every completion called `unrefWorker`. When a single worker had 2+ concurrent dispatches in flight (normal with round-robin batching), the first completion's `unref()` call unrefed the worker even though other dispatches were still running. If the walker had also finished by that point, the event loop would exit with pending worker results dropped silently. Fix: track inflight count per worker (already exists as `pw.inflight`) and only unref when it drops to zero. Specifically: - **`message` result handler:** decrement inflight; if it hit zero, unref. Other dispatches still pending → keep ref'd. - **`error` handler:** drain all pending rejects, then unref ONCE at the end (worker is gone either way). - **readiness-failure path in `dispatch()`:** decrement inflight; if it hit zero, unref (matches `message` handler logic). - **`terminate()`:** same as error handler — reject all pending, then unref once. The spawn-time unref stays (inflight starts at 0, worker is idle). The per-dispatch `ref()` is safe to call unconditionally (idempotent no-op when already ref'd), so no change there. Updated the ref/unref strategy docstring to call out the boolean semantics explicitly, so the next person to touch this code doesn't assume reference-counting. ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5641 pass, 0 fail** - [x] `bun test test/isolated` — 138 pass - [x] `bun run bench --size large --runs 3 --warmup 1` — workers still parallelize correctly (`scan.grepFiles` 160ms, `scanCodeForDsns` 237ms). If the unref race were being hit, workers would silently starve and numbers would regress to fallback-path territory (~320ms). - [x] Rebuilt binary, ran `project view` from empty dir 5× — all exit cleanly in <50ms. --- src/lib/scan/worker-pool.ts | 52 +++++++++++++++++++++++-------------- 1 file changed, 33 insertions(+), 19 deletions(-) diff --git a/src/lib/scan/worker-pool.ts b/src/lib/scan/worker-pool.ts index 16d34c870..df30b67fe 100644 --- a/src/lib/scan/worker-pool.ts +++ b/src/lib/scan/worker-pool.ts @@ -224,9 +224,8 @@ export function getWorkerPool(): WorkerPool { const w = new Worker(url); // Workers are unref'd at spawn so an idle pool doesn't hold the // event loop open. On each `dispatch()` we `ref()` the worker - // before posting work, and `unref()` it again when the dispatch - // settles (either via `result` message, `error` event, or - // termination). This gives us: + // before posting work; when inflight drops to zero (i.e., the + // worker goes idle), we `unref()` it again. This gives us: // // - Clean CLI exit: once all pending dispatches settle, // workers are idle + unref'd, the loop drains, and the @@ -236,9 +235,12 @@ export function getWorkerPool(): WorkerPool { // work means their `message` events fire on the main // thread's next tick, not starved by idle-tick bypass. // - // `ref`/`unref` must be balanced one-to-one. See `dispatch()` - // (ref on pending-slot push) + `onmessage` result handler - // (unref on shift) + `error` handler (unref on mass-reject). + // IMPORTANT: `Worker.ref()` / `.unref()` are idempotent booleans, + // NOT reference-counted. Calling `unref()` on a worker that has + // multiple dispatches in flight unrefs it entirely (subsequent + // `ref()` calls are no-ops until the worker becomes idle again). + // Hence the inflight-zero guard in every unref site — we only + // unref when the LAST dispatch for this worker completes. unrefWorker(w); const pw: PooledWorker = { worker: w, @@ -273,9 +275,13 @@ export function getWorkerPool(): WorkerPool { return; } pw.inflight -= 1; - // Unref: dispatch completed, worker has one fewer reason to - // keep the loop alive. Matched by the `ref()` in `dispatch()`. - unrefWorker(pw.worker); + // `ref()` / `unref()` are idempotent booleans, NOT reference- + // counted. Only unref when the worker's own inflight drops to + // zero — unrefing while other dispatches are still in flight + // would let the event loop exit and drop their results. + if (pw.inflight === 0) { + unrefWorker(pw.worker); + } next.resolve({ ints: data.ints, linePool: data.linePool }); }); w.addEventListener("error", (err) => { @@ -284,16 +290,17 @@ export function getWorkerPool(): WorkerPool { // avoid routing new work to a worker that can't respond. pw.alive = false; const errMsg = err.message ?? String(err); - // Drain the pending queue: reject each dispatch AND unref the - // worker once per dispatch so ref/unref stays balanced and the - // loop can drain. + // Drain the pending queue: reject each dispatch. Then unref + // the worker ONCE at the end (ref/unref are idempotent; one + // unref is sufficient to release the event-loop hold even if + // there were multiple pending dispatches). let slot = pw.pending.shift(); while (slot !== undefined) { - unrefWorker(pw.worker); slot.reject(new Error(`worker error: ${errMsg}`)); slot = pw.pending.shift(); } pw.inflight = 0; + unrefWorker(pw.worker); }); workers.push(pw); } @@ -340,12 +347,17 @@ export function getWorkerPool(): WorkerPool { chosen.worker.postMessage(request); }, (err) => { - // Readiness failed — fail this dispatch's resolver AND - // unref to balance the `refWorker` above. + // Readiness failed — fail this dispatch's resolver. Only + // unref if no other dispatches are in flight (same + // reasoning as the `message` handler: `unref()` is + // idempotent and unrefing while others are in flight + // would let the loop exit prematurely). const slot = chosen.pending.pop(); if (slot) { chosen.inflight -= 1; - unrefWorker(chosen.worker); + if (chosen.inflight === 0) { + unrefWorker(chosen.worker); + } slot.reject(err); } } @@ -355,15 +367,17 @@ export function getWorkerPool(): WorkerPool { terminate(): void { for (const pw of workers) { pw.alive = false; - // Drain pending: reject + unref per-dispatch so ref/unref - // stays balanced and the loop can drain. + // Drain pending: reject every dispatch. `ref`/`unref` are + // idempotent booleans — one unref at the end is sufficient + // to release the event-loop hold even if the worker had + // multiple pending dispatches. let slot = pw.pending.shift(); while (slot !== undefined) { - unrefWorker(pw.worker); slot.reject(new Error("worker pool terminated")); slot = pw.pending.shift(); } pw.inflight = 0; + unrefWorker(pw.worker); try { pw.worker.terminate(); } catch { From dff0be1d9a741366a80e3b603db161e4badf05ff Mon Sep 17 00:00:00 2001 From: Burak Yigit Kaya Date: Wed, 22 Apr 2026 01:08:42 +0000 Subject: [PATCH 8/9] fix(dsn): unify dsn.files_scanned semantics across both scan paths MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Seer caught a telemetry inconsistency from the unification refactor. `scanCodeForDsns` was reporting `dsn.files_scanned = filesSeenForMtime.size` — count of unique files with validated DSNs. `scanCodeForFirstDsn` was reporting `dsn.files_scanned = filesScanned` — count of every file walked. Same attribute name, two different meanings depending on which path executed. Makes the telemetry unreliable for scanner performance monitoring. Fix: switch `scanCodeForDsns` from `grepFiles` (streaming) to `collectGrep` (buffered), which returns `stats.filesRead` — the count of files the grep pipeline actually read and tested. That's equivalent to walker-yielded files (after the walker's own extension/gitignore filters), matching what `scanCodeForFirstDsn` reports. DSN workloads buffer <10 matches typically (validated DSNs from a source tree), so the streaming-to-buffered switch has no measurable perf impact. The `filesSeenForMtime` set stays — it still dedups mtime writes across multi-DSN-per-file (grep emits one match per line; without the dedup, a file with 3 DSN-containing lines would trigger 3 mtime writes for the same path). Updated the comment to reflect the new role (mtime-write dedup, not telemetry) and pointed the docstrings at `collectGrep` instead of `grepFiles`. ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5641 pass, 0 fail** - [x] `bun test test/isolated` — 138 pass - [x] `bun run bench --size large --runs 3 --warmup 1` — no perf regression (`scan.grepFiles` 160ms — this bench op uses `collectGrep` so it was already in the buffered path; `scanCodeForDsns` reruns are in the same range as before). - [x] Rebuilt binary, ran `project view` from empty dir — all exit cleanly in <50ms. --- src/lib/dsn/code-scanner.ts | 48 ++++++++++++++++++------------------- 1 file changed, 23 insertions(+), 25 deletions(-) diff --git a/src/lib/dsn/code-scanner.ts b/src/lib/dsn/code-scanner.ts index cb4a94852..606e05cab 100644 --- a/src/lib/dsn/code-scanner.ts +++ b/src/lib/dsn/code-scanner.ts @@ -5,14 +5,14 @@ * filtering, host validation, package-path inference, stop-on-first * semantics). All file walking, `.gitignore` handling, extension * filtering, bounded concurrency, AND worker-pool dispatch are - * delegated to the shared `src/lib/scan/` module via `grepFiles`. + * delegated to the shared `src/lib/scan/` module via `collectGrep`. * * Flow: - * 1. `scanDirectory(cwd, stopOnFirst)` calls `grepFiles` with the + * 1. `scanDirectory(cwd, stopOnFirst)` calls `collectGrep` with the * DSN pattern and preset (`dsnScanOptions()`), plus * `recordMtimes: true` and an `onDirectoryVisit` hook so the * cache-invalidation maps are populated in one traversal. - * 2. `grepFiles` dispatches per-file work to the worker pool (when + * 2. `collectGrep` dispatches per-file work to the worker pool (when * available) or a concurrent-async fallback. Each yielded * `GrepMatch` represents one line containing a DSN URL; the * grep engine handles the file-level literal gate (`http`) for @@ -52,7 +52,7 @@ import path from "node:path"; import { DEFAULT_SENTRY_HOST, getConfiguredSentryUrl } from "../constants.js"; import { ConfigError } from "../errors.js"; import { logger } from "../logger.js"; -import { grepFiles, normalizePath, walkFiles } from "../scan/index.js"; +import { collectGrep, normalizePath, walkFiles } from "../scan/index.js"; import { withTracingSpan } from "../telemetry.js"; import { createDetectedDsn, inferPackagePath, parseDsn } from "./parser.js"; import { DSN_MAX_DEPTH, dsnScanOptions } from "./scan-options.js"; @@ -360,7 +360,7 @@ function isValidDsnHost(dsn: string): boolean { * production dashboards + the `scanCodeForDsns` bench op stay in * sync. Attribute names match the pre-PR-3 scanner byte-for-byte. * - * Delegates the heavy lifting to `grepFiles`: + * Delegates the heavy lifting to `collectGrep`: * - Walker config (depth, gitignore, skip dirs) from `dsnScanOptions()`. * - `DSN_PATTERN` as the grep pattern — the engine's literal * prefilter uses `http` as a file-level gate, identical to the @@ -379,22 +379,16 @@ function scanDirectory(cwd: string): Promise { const sourceMtimes: Record = {}; const dirMtimes: Record = {}; const seen = new Map(); - // Unique absolute paths of files that had at least one DSN-like - // URL match (commented-out or not). Used as a set for dedup AND - // as the `dsn.files_collected` telemetry count. `grepFiles` - // emits one match per line, so counting matches would over-count; - // counting unique paths gives the pre-refactor semantics. - // - // Files with zero DSN-like URLs are skipped by grep's file-level - // `http` gate without emitting any match, so they're not counted - // here. This is a (minor) drift from the pre-refactor counter - // which tracked walker-yielded files regardless — accepted as - // the stricter count is still useful signal and matches - // rg-style "files with matches" semantics. + // Dedup set for mtime recording: one entry per file that + // contributed a validated DSN. Used to skip redundant mtime + // writes in `processMatch` — `grepFiles` emits one match per + // matching line, so a file with 3 DSN-containing lines would + // otherwise trigger 3 mtime writes for the same path. NOT + // used for telemetry (see `stats.filesRead` below). const filesSeenForMtime = new Set(); try { - const iter = grepFiles({ + const { matches, stats } = await collectGrep({ cwd, pattern: DSN_PATTERN, ...dsnScanOptions(), @@ -414,7 +408,7 @@ function scanDirectory(cwd: string): Promise { maxLineLength: Number.POSITIVE_INFINITY, }); - for await (const match of iter) { + for (const match of matches) { processMatch(match, { seen, sourceMtimes, @@ -422,13 +416,17 @@ function scanDirectory(cwd: string): Promise { }); } - // `filesSeenForMtime` holds one entry per file that had a - // validated DSN — use its `.size` for the `files_collected` - // and `files_scanned` attributes so a file with 3 DSN-matching - // lines counts as 1, not 3. - span.setAttribute("dsn.files_collected", filesSeenForMtime.size); + // `stats.filesRead` is the count of files actually read and + // tested by the grep pipeline — matches the walker-yielded + // file count after the walker's own filters (extension, + // gitignore, etc.). Consistent with `scanCodeForFirstDsn` + // which counts every file it walks. `files_collected` is a + // legacy attribute: alias to the same count so telemetry + // consumers don't need to know about the historical + // distinction. + span.setAttribute("dsn.files_collected", stats.filesRead); span.setAttributes({ - "dsn.files_scanned": filesSeenForMtime.size, + "dsn.files_scanned": stats.filesRead, "dsn.dsns_found": seen.size, }); From 4e0b6b5552693f65c0dba73e26821986980cf6c4 Mon Sep 17 00:00:00 2001 From: Burak Yigit Kaya Date: Wed, 22 Apr 2026 08:15:51 +0000 Subject: [PATCH 9/9] fix(scan): close remaining latent worker-pool races MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two latent LOW-severity findings caught on the final CI round. ## 1. Seer: failure-check ran before in-flight dispatches settled In the early-exit path (`AbortError`, `maxResults = 0`), the `finally` block's all-batches-failed check could race with in-flight dispatch `.catch()` handlers. Sequence: 1. Consumer emits enough matches, exits loop. 2. `finally` runs, awaits `producer` (which just dispatches). 3. `finally` checks `failedBatches === dispatchedBatches`. 4. In-flight dispatch `.catch()` fires LATER, bumps `failedBatches`. Step 3 reads stale state. If every batch would ultimately fail, we'd silently return an empty result instead of surfacing the pipeline failure. Seer rated this LOW because only `maxResults: 0` triggers the early exit before any dispatch can settle, and no production caller uses that value — but it's a real bug either way and the fix is cheap. Fix: track dispatch promises and `await Promise.allSettled()` on them in the `finally` block before the failure check. `allSettled` rather than `all` because individual rejections are already counted via the dispatch's `.catch()`; we just need them to complete. ## 2. Cursor: readiness-failure used `pop()` not identity-based removal The readiness-failure handler in `dispatch()` was calling `chosen.pending.pop()` to drop "this dispatch's slot" — but with multiple concurrent dispatches queued to the same worker, `pop()` returns whichever slot happened to be at the END of the queue, not the one belonging to the failing dispatch. Handlers for sibling dispatches would execute in whatever order `ready`'s rejection chain fired, and `pop()` would reject them in reverse queue order. Currently unreachable because `chosen.ready` never rejects in the current pool impl (the ready promise has no failure path), but latent — any future addition of a startup timeout or init failure would expose the bug. Fix: hold a direct reference to `ourSlot` in each dispatch's closure. The readiness-failure handler finds its slot via `indexOf(ourSlot)` and splices that specific entry, preserving FIFO semantics for sibling dispatches. ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5641 pass, 0 fail** - [x] `bun test test/isolated` — 138 pass - [x] `bun run bench --size large --runs 3 --warmup 1` — no perf regression (`scan.grepFiles` 160ms) - [x] Rebuilt binary, ran `project view` from empty dir 3× — all exit cleanly. --- src/lib/scan/grep.ts | 21 ++++++++++++++++++++- src/lib/scan/worker-pool.ts | 32 +++++++++++++++++++++++--------- 2 files changed, 43 insertions(+), 10 deletions(-) diff --git a/src/lib/scan/grep.ts b/src/lib/scan/grep.ts index 747b1f977..15d98667b 100644 --- a/src/lib/scan/grep.ts +++ b/src/lib/scan/grep.ts @@ -564,6 +564,14 @@ async function* grepViaWorkers( // that could poison downstream caches. let dispatchedBatches = 0; let failedBatches = 0; + // Track dispatch promises so the consumer's `finally` can await + // them before checking the failure counters. Without this, an + // early exit (e.g., AbortError, maxResults = 0) could race the + // final failure assessment: the consumer's finally runs before + // in-flight dispatch `.catch()` handlers have bumped + // `failedBatches`, causing a legitimate all-failure scenario to + // silently pass through as "no matches found." + const dispatchPromises: Promise[] = []; const dispatchBatch = ( paths: string[], @@ -573,7 +581,7 @@ async function* grepViaWorkers( opts.stats.filesRead += paths.length; inflightCount += 1; dispatchedBatches += 1; - pool + const dispatchP = pool .dispatch({ paths, patternSource: opts.perFile.regex.source, @@ -603,6 +611,7 @@ async function* grepViaWorkers( inflightCount -= 1; wakeConsumer(); }); + dispatchPromises.push(dispatchP); }; // Producer: walk paths, batch them, dispatch each full batch. @@ -723,6 +732,16 @@ async function* grepViaWorkers( earlyExit = true; wakeConsumer(); await producer; + // Await every dispatched batch so `failedBatches` is final + // before we inspect it. Without this, an early exit could run + // the failure check before in-flight `.catch()` handlers + // bumped the counter — collapsing an all-failure scenario into + // a silent "no matches" return. `allSettled` because we don't + // care about individual rejections here (each was already + // counted via the dispatch's `.catch()` handler). + if (dispatchPromises.length > 0) { + await Promise.allSettled(dispatchPromises); + } // If the walker threw (typically `AbortError`), re-throw so the // caller sees the same behavior as the async-fallback path. if (producerError !== null) { diff --git a/src/lib/scan/worker-pool.ts b/src/lib/scan/worker-pool.ts index df30b67fe..302416d0b 100644 --- a/src/lib/scan/worker-pool.ts +++ b/src/lib/scan/worker-pool.ts @@ -336,8 +336,18 @@ export function getWorkerPool(): WorkerPool { // Enqueue a pending slot for this request. The worker's // `onmessage` handler will resolve it when the corresponding // `result` message arrives (FIFO). + // + // We hold a direct reference to `ourSlot` so the readiness- + // failure path can remove and reject THIS dispatch's entry + // specifically — not whatever happens to be at the front or + // back of the pending queue when the failure fires. With + // concurrent dispatches to the same worker, `pop()` or + // `shift()` could reject a sibling dispatch's promise with + // this dispatch's error. + let ourSlot: PooledWorker["pending"][number]; const result = new Promise((resolve, reject) => { - chosen.pending.push({ resolve, reject }); + ourSlot = { resolve, reject }; + chosen.pending.push(ourSlot); }); // Wait for readiness (first dispatch only), then post the // request. Subsequent dispatches skip the await (the ready @@ -347,18 +357,22 @@ export function getWorkerPool(): WorkerPool { chosen.worker.postMessage(request); }, (err) => { - // Readiness failed — fail this dispatch's resolver. Only - // unref if no other dispatches are in flight (same - // reasoning as the `message` handler: `unref()` is - // idempotent and unrefing while others are in flight - // would let the loop exit prematurely). - const slot = chosen.pending.pop(); - if (slot) { + // Readiness failed — fail THIS dispatch's resolver and + // remove ITS slot from the pending queue. Identity-based + // removal preserves the FIFO invariant for any sibling + // dispatches still in flight. Only unref if no other + // dispatches remain in flight (same reasoning as the + // `message` handler: `unref()` is idempotent, and + // unrefing while others are in flight would let the + // loop exit prematurely). + const idx = chosen.pending.indexOf(ourSlot); + if (idx !== -1) { + chosen.pending.splice(idx, 1); chosen.inflight -= 1; if (chosen.inflight === 0) { unrefWorker(chosen.worker); } - slot.reject(err); + ourSlot.reject(err); } } );