diff --git a/AGENTS.md b/AGENTS.md index 3a5fd2105..7e2eac62a 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -1009,8 +1009,8 @@ mock.module("./some-module", () => ({ * **Sentry SDK uses @sentry/node-core/light instead of @sentry/bun to avoid OTel overhead**: Sentry SDK uses @sentry/node-core/light to avoid OTel overhead: Avoids \`@sentry/bun\` which loads full OpenTelemetry (~150ms, 24MB). \`@sentry/core\` barrel patched via \`bun patch\` to remove ~32 unused exports. Key gotcha: \`LightNodeClient\` hardcodes \`runtime: { name: 'node' }\` AFTER spreading options — fix by patching \`client.getOptions().runtime\` post-init (mutable ref). Transport falls back to Node \`http\` instead of native \`fetch\`. Upstream: getsentry/sentry-javascript#19885, #19886. - -* **src/lib/scan/ module layout and IgnoreStack two-tier design**: src/lib/scan/ module layout: Policy-free pure-TS ripgrep-compatible scanner. Files: \`types.ts\`, \`constants.ts\`, \`walker.ts\` (DFS with time-budgeted minDepth/maxDepth, onDirectoryVisit hook), \`ignore.ts\` (IgnoreStack: #rootIg + #nestedByRelDir Map, fast-path bypasses ancestor walk when no nested gitignores — 0.27µs/query vs 1.93µs slow), \`binary.ts\` (8KB NUL sniff + ext fast-path), \`concurrent.ts\` (mapFilesConcurrent), \`regex.ts\` (inline flag translation), \`grep.ts\`/\`glob.ts\` (AsyncIterable + collect helpers), \`path-utils.ts\` (picomatch + basename + walkerRoot + joinPosix). DSN policy lives in \`src/lib/dsn/scan-options.ts::dsnScanOptions()\`. + +* **src/lib/scan/ module: policy-free file walker with IgnoreStack**: \*\*src/lib/scan/ module: policy-free walker + grep/glob engine.\*\* Files: \`walker.ts\` (DFS async gen; \`minDepth\`/\`maxDepth\` cap DIRECTORY DESCENT not file yield; \`timeBudgetMs\`, \`AbortSignal\`, \`onDirectoryVisit\`, \`recordMtimes\`), \`ignore.ts\` (IgnoreStack two-tier: \`#rootIg\`+\`#nestedByRelDir\`; fast-path ~0.27µs, slow-path ~1.93µs), \`binary.ts\` (8KB NUL sniff + ext fast-path), \`concurrent.ts\`, \`regex.ts\` (translates \`(?i)\`/\`(?im)\`/\`(?U)\` → JS flags), \`grep.ts\`/\`glob.ts\` (picomatch \`{dot:true}\`). DSN policy lives in \`src/lib/dsn/scan-options.ts\`. Gotchas: (1) \`classifyFile\` short-circuits when \`cfg.extensions\` set. (2) \`buildWalkOptions\` MUST forward \`descentHook\`+\`alwaysSkipDirs\`+\`followSymlinks\`+\`recordMtimes\`+\`clock\`. (3) NEVER \`pLimit.clearQueue()\` (hangs); use \`onResult\` returning \`{done:true}\`. (4) \`path.join\`/\`relative\`/\`extname\` ~10× slower than manual string ops in hot walker loop — use \`abs.slice(cwdPrefixLen)\` and \`lastIndexOf('.')+slice+toLowerCase\`; on Windows use \`path.sep\`. * **Telemetry opt-out is env-var-only — no persistent preference or DO\_NOT\_TRACK**: Telemetry opt-out priority: (1) \`SENTRY\_CLI\_NO\_TELEMETRY=1\`, (2) \`DO\_NOT\_TRACK=1\`, (3) \`metadata.defaults.telemetry\`, (4) default on. DB read try/catch wrapped (runs before DB init). Schema v13 merged \`defaults\` table into \`metadata\` KV with keys \`defaults.{org,project,telemetry,url}\`; getters/setters in \`src/lib/db/defaults.ts\`. \`sentry cli defaults\` uses variadic \`\[key, value?]\`: no args → show all; 1 arg → show key; 2 args → set; \`--clear\` without args → clear all (guarded); \`--clear key\` → clear specific. Canonical keys: \`org\`/\`organization\`, \`project\`, \`telemetry\`, \`url\`. \`computeTelemetryEffective()\` returns resolved source for display. @@ -1026,22 +1026,16 @@ mock.module("./some-module", () => ({ * **Sentry-derived terminal color palette tuned for dual-background contrast**: Terminal color palette tuned for dual-background contrast: 10-color chart palette derived from Sentry's categorical hues (\`static/app/utils/theme/scraps/tokens/color.tsx\`), adjusted to mid-luminance for ≥3:1 contrast on both dark and light backgrounds. Adjustments: orange #FF9838→#C06F20, green #67C800→#3D8F09, yellow #FFD00E→#9E8B18, purple #5D3EB2→#8B6AC8, indigo #50219C→#7B50D0; blurple/pink/magenta unchanged; teal #228A83 added. Hex preferred over ANSI 16-color for guaranteed contrast. - -* **Workers rejected for scan/grep parallelization**: Investigated Bun Workers for sync-work-per-file parallelism. Bootstrap: 7.7ms trivial, ~180ms for scan module (50ms/worker for imports). Pool of 4 = 60-160ms amortizable but ONE-OFF grep calls pay full cost. structuredClone of GrepMatch\[] back to main thread likely dominates for many-match workloads (215k matches × ~100B each = 20MB clone work). Cancellation hard: only \`worker.terminate()\` (experimental); no cooperative early-exit. No existing Worker precedent in the repo (only subprocess-based tests). Polyfill story ugly: library bundle is single \`dist/index.cjs\` with no separate worker file. Theoretical ceiling on 4-core: 1.7-2.3× — much of that eaten by cloning. Walker optimization is the bigger lever regardless. \*\*Do not build worker pool.\*\* + +* **Walker uses sync readdir + statSync — readdirSync p95=24µs is fine to block on**: Walker perf: sync readdir + statSync chosen over async. readdirSync p50=11µs, p95=24µs — below setTimeout(0) (~4ms). 10k-file walk: async serial 105-167ms → sync 43-45ms. statSync slightly faster than Bun.file().size. CLI is single-threaded; async overhead unjustified. Walker remains async-generator externally; internals sync. Workers rejected: bootstrap 50ms/worker for scan imports, structuredClone dominates many-match workloads (215k × ~100B = 20MB), cancellation hard (only \`worker.terminate()\`), single-bundle polyfill story ugly. 4-core ceiling 1.7-2.3× eaten by cloning. Walker is the bottleneck, not regex — gitignore layer adds ~135ms overhead in \`ignore\` package. ### Gotcha * **AuthError constructor takes reason first, message second**: \`AuthError(reason: AuthErrorReason, message?: string)\` where \`AuthErrorReason\` is \`"not\_authenticated" | "expired" | "invalid"\`. Easy to accidentally swap args as \`new AuthError("Token expired", "expired")\` — the string \`"Token expired"\` gets assigned as \`reason\` (invalid enum value). Tests aren't type-checked (tsconfig excludes them), so TypeScript won't catch this. Correct: \`new AuthError("expired", "Token expired")\`. Default messages exist for each reason, so the second arg is often unnecessary. - -* **Dot-notation field filtering is ambiguous for keys containing dots**: Two off-by-one gotchas in formatters/scan collectors: (1) \`filterFields\` in \`src/lib/formatters/json.ts\` dot-notation ambiguous for keys containing dots — property tests must use \`\[a-zA-Z0-9\_]\` charset. Counterexample: \`{"a":{".":false}}\` with path \`"a."\` splits into \`\["a",""]\`. (2) \`collectGlob\`/\`collectGrep\` must NOT forward \`maxResults\` to the underlying iterator. The collector drains without cap, breaks when \`files.length >= maxResults\`, sets \`truncated = true\` on overshoot. Forwarding causes the iterator to stop at exactly N with \`truncated\` unset even when more matches existed. \`collectGrep\` regressed this once. - - -* **grep multiline option was silently a no-op; /m is grep-default**: \`GrepOptions.multiline\` defaults to \`true\` (grep-like, not JS-native): \`src/lib/scan/grep.ts::readAndGrep\` previously called \`ensureGlobalMultilineFlags\` unconditionally, making the option a no-op. Fix: default \`multiline: true\` (matches rg/git-grep — \`^/$\` anchor to line boundaries), thread through \`PerFileOptions\`, use \`ensureGlobalFlag\` only when caller explicitly opts out with \`multiline: false\` (buffer-boundary semantics). Note: \`compilePattern\`'s \`multiline\` default stays \`false\` (pattern-compile layer only). - -* **mapFilesConcurrent skips null but not empty arrays — callers must return null for no-op**: \`mapFilesConcurrent\` in \`src/lib/scan/concurrent.ts\` filters \`null\` results but NOT empty arrays — pushes them and fires \`onResult\` for each. Stream variant \`mapFilesConcurrentStream\` filters both \`null || length === 0\`. Asymmetry intentional (generic \`TOut\` can't assume array). Callers like \`processEntry\` in \`src/lib/dsn/code-scanner.ts\` must return \`null\` (not \`\[]\`) for no-op files — on 10k-file walks where ~99% have no DSNs, returning \`\[]\` fires ~9900 wasted \`onResult\` calls. +* **mapFilesConcurrent skips null but not empty arrays — callers must return null for no-op**: Two off-by-one gotchas in scan/formatters: (1) \`mapFilesConcurrent\` in \`src/lib/scan/concurrent.ts\` filters \`null\` but NOT empty arrays — pushes and fires \`onResult\` for each. Stream variant \`mapFilesConcurrentStream\` filters both. Callers like \`processEntry\` in \`src/lib/dsn/code-scanner.ts\` must return \`null\` (not \`\[]\`) for no-op files — on 10k-file walks where ~99% have no DSNs, \`\[]\` fires ~9900 wasted callbacks. (2) \`collectGlob\`/\`collectGrep\` must NOT forward \`maxResults\` to the underlying iterator — collector drains uncapped, breaks when \`files.length >= maxResults\`, sets \`truncated=true\` on overshoot. Forwarding makes iterator stop at exactly N without \`truncated\`. (3) \`filterFields\` in \`src/lib/formatters/json.ts\` dot-notation ambiguous for keys containing dots — property tests must use \`\[a-zA-Z0-9\_]\` charset. * **process.stdin.isTTY unreliable in Bun — use isatty(0) and backfill for clack**: \`process.stdin.isTTY\` unreliable in Bun — use \`isatty(0)\`: Bun's single-file binary can leave \`process.stdin.isTTY === undefined\` on TTY fds inherited via redirects like \`exec … \ ({ * **test:unit glob only picks up test/lib, test/commands, test/types, test/isolated**: \`bun test\` script globs in \`package.json\` are narrow: \`test:unit\` = \`test/lib test/commands test/types\`, \`test:isolated\` = \`test/isolated\`, \`test:e2e\` = \`test/e2e\`. Tests placed under \`test/fixtures/\`, \`test/scripts/\`, or \`test/script/\` are NOT picked up by any standard CI script despite being valid Bun test files. Place new tests under \`test/lib/\/\` to inherit coverage. Note: \`test/script/\` (singular) exists with script tests but is also outside the globs. - -* **Walker async-generator overhead vs sync readdir**: Walker async-generator overhead vs sync readdir: \`src/lib/scan/walker.ts\` async walker takes 480-620ms on 10k files (48µs/file). Sync \`readdirSync\` + naive walk: 50ms. Async generator overhead is 10× pure I/O cost. \`readdir\` async serial: 114ms; async concurrent (Promise.all): 21ms (fastest). Workers don't help — walker is the bottleneck. Main perf gap vs ripgrep is walker overhead, not regex speed. Culprits: \`Bun.file()\` per yield (stat+lazy metadata), classifyFile branching, onDirectoryVisit stat hook, \`path.extname().toLowerCase()\` per entry, per-directory await chain. Unopened optimization territory. - -* **Whole-buffer matchAll slower than split+test when aggregated over many files**: Whole-buffer matchAll slower than split+test when aggregated over many files: PR 791's \`readAndGrep\` rewrite replaced \`split("\n") + regex.test\` per-line with whole-buffer \`regex.exec\`. Micro-bench on single files showed 12× faster, but aggregate over 10k files says otherwise: split+test is ~1.6× FASTER on many-match workloads (215k matches across 9k files). The 160× faster-than-rg headline comes entirely from \`mapFilesConcurrent.onResult\` early-exit at maxResults=100 — NEW scans 9 files; OLD scans 55. Uncapped NEW is 63% slower than OLD fs-walk because whole-buffer match-emission costs more than per-line \`regex.test\`. Frame early-exit wins honestly in bench headlines. Fix candidates: adaptive switch based on match density, or revert per-line for many-match patterns. +* **Whole-buffer matchAll slower than split+test when aggregated over many files**: grep perf tradeoffs in \`src/lib/scan/grep.ts::readAndGrep\`: (1) Whole-buffer \`regex.exec\` is 12× faster on single files but ~1.6× SLOWER than \`split("\n") + regex.test\` aggregated over 10k files / 215k matches — match-emission cost dominates. 160× faster-than-rg headline comes entirely from \`mapFilesConcurrent.onResult\` early-exit at maxResults=100. (2) Literal prefilter is FILE-LEVEL gate only (cheap \`indexOf\` → skip file if absent), then whole-buffer \`regex.exec\` on survivors. Per-line-verify broke cross-newline patterns and Unicode length-changing \`toLowerCase\` (Turkish \`İ\`→\`i̇\`). Extractor handles multi-char escapes (\`\x41\`,\`\u0041\`,\`\cA\`,\`\k\\`,\`\p{L}\`) via \`escapeSequenceLength\`. \`hasTopLevelAlternation\`+\`skipGroup\` must call \`skipCharacterClass\` (PCRE \`\[]abc]\` ≠ JS). Extract from \`regex.source\` not user input. (3) \`GrepOptions.multiline\` defaults to \`true\` (rg/git-grep semantics) — thread through \`PerFileOptions\`; \`compilePattern\`'s \`multiline\` default stays \`false\`. ### Pattern - -* **AGENTS.md lore conflicts in rebase: take --ours (main-side) and let lore regenerate**: When rebasing onto main and hitting AGENTS.md conflicts in the \`## Long-term Knowledge\` lore-managed section, resolve via \`git checkout --ours AGENTS.md\` (takes main's newer entries) rather than trying to merge block-by-block. Branch-specific lore entries will be regenerated by the lore agent on next run. During rebase: \`--ours\` = branch being rebased onto (main), \`--theirs\` = commit being replayed. This conflict pattern recurs on every rebase when both main and the feature branch updated lore concurrently. - -* **Bench harness with concurrency sweep + deterministic fixtures**: Bench harness with concurrency sweep + deterministic fixtures: Local harness at \`script/bench.ts\` + \`script/bench-sweep.ts\` (not in CI). Scripts: \`bun run bench\`, \`bench:save\` (writes \`.bench/baseline.json\`), \`bench:compare\` (fails >20% regression), \`bench:sweep\` (concurrency tuning). Flags: \`--size\`, \`--op\`, \`--repo \\` (refuses \`--save-baseline\`), \`--runs\`, \`--warmup\`, \`--json\`. Fixtures in \`test/fixtures/bench/\` use xorshift32-seeded generator for byte-identical trees in \`os.tmpdir()/sentry-cli-bench/\`. Baselines gitignored (machine variance). Op labels match Sentry span names. \`withBenchDb(fn)\` scopes \`SENTRY\_CONFIG\_DIR\` to temp; \`clearDsnDetectionCache()\` between cold iterations. \*\*CONCURRENCY\_LIMIT gotcha\*\*: sweep showed knee=availableParallelism for walker-fed workloads, but pure I/O workloads want 16-32. Walker dominates; raising for pure-I/O regressed DSN scanner. Kept at \`min(16, max(2, availableParallelism()))\`. +* **Bench harness with concurrency sweep + deterministic fixtures**: Bench harness (not in CI): \`script/bench.ts\` + \`script/bench-sweep.ts\`. Scripts: \`bun run bench\`, \`bench:save\` (\`.bench/baseline.json\`), \`bench:compare\` (fails >20% regression), \`bench:sweep\` (concurrency tuning). Flags: \`--size\`, \`--op\`, \`--repo \\` (refuses \`--save-baseline\`), \`--runs\`, \`--warmup\`, \`--json\`. Fixtures in \`test/fixtures/bench/\` use xorshift32-seeded generator for byte-identical trees in \`os.tmpdir()/sentry-cli-bench/\`. Baselines gitignored. Op labels match Sentry span names. \`withBenchDb(fn)\` scopes \`SENTRY\_CONFIG\_DIR\`; \`clearDsnDetectionCache()\` between cold iterations. \*\*CONCURRENCY\_LIMIT\*\*: sweep showed knee=availableParallelism for walker-fed workloads; pure I/O wants 16-32. Walker dominates; kept at \`min(16, max(2, availableParallelism()))\`. * **Event view cross-org fallback chain: project → org → all orgs**: Event view cross-org fallback chain: \`fetchEventWithContext\` in \`src/commands/event/view.ts\` tries (1) \`getEvent(org, project, eventId)\`, (2) \`resolveEventInOrg(org, eventId)\`, (3) \`findEventAcrossOrgs(eventId, { excludeOrgs })\`. Extracted into \`tryEventFallbacks()\` for Biome complexity limit. \`excludeOrgs\` only set when same-org search returned null (not on transient error). Both catch blocks re-throw \`AuthError\` while swallowing transient errors. Warning distinguishes same-org/different-project from different-org via \`crossOrg.org === org\`. @@ -1078,12 +1066,9 @@ mock.module("./some-module", () => ({ * **Init-wizard grep/glob adapters must sandbox search.path via safePath before passing to scan**: Init-wizard grep/glob tools in \`src/lib/init/tools/{grep,glob}.ts\` are thin adapters over \`collectGrep\`/\`collectGlob\` from \`src/lib/scan/\`. The scan module trusts \`opts.path\` — per \`src/lib/scan/types.ts\`, sandbox enforcement is the adapter's responsibility. Adapters MUST call \`safePath(payload.cwd, search.path)\` from \`init/tools/shared.ts\` before forwarding; skipping this regresses the \`rejects grep paths outside the init sandbox\` test. \`safePath\` returns an absolute path + walks ancestors with \`realpathSync\` to block symlink escapes. Two-layer defense: \`validateToolSandbox(payload, context.directory)\` in the registry pre-checks \`payload.cwd\` is inside project root; \`safePath\` guards per-tool sub-paths. Wire contract (GrepPayload/GlobPayload in \`init/types.ts\`) is minimal: pattern/path?/include? for grep, patterns\[]/path?/maxResults? for glob — no caseInsensitive/fixedStrings/etc on the wire today. - -* **Literal prefilter for grep — extract longest non-metachar run from regex**: \`src/lib/scan/literal-extract.ts::extractInnerLiteral()\` extracts longest literal substring (≥3 chars, not whitespace/punctuation) from a regex source. Used in \`grep.ts::readAndGrep\` as a fast path: \`indexOf(literal)\` to find candidate lines, then verify with full regex. 18% win on \`import.\*from\`-style patterns. Three dispatch paths: (1) pure literal → whole-buffer regex (V8 optimizes purely-literal patterns; prefilter adds overhead), (2) regex with extractable literal → literal prefilter, (3) no literal or \`multiline: false\` → whole-buffer fallback. \*\*Char-class opaqueness gotcha\*\*: \`hasTopLevelAlternation\` must skip over \`\[...]\` as opaque — \`\[(]foo|bar\` has literal \`(\` inside the class; incorrectly incrementing paren depth hides top-level \`|\`. - -* **PR review workflow: Cursor Bugbot + Seer + human cycle**: Use \`gh pr checks \ --json state,link\` + \`gh run view --log-failed\` to detect CI failures. Fetch unresolved review threads via \`gh api graphql\` with reviewThreads query filtering \`isResolved: false\` + \`isMinimized: false\`. After addressing findings: post reply via \`gh api repos/.../pulls/comments/\/replies\` then resolve thread via \`resolveReviewThread\` GraphQL mutation. Bots auto-resolve threads when they detect the fix in a follow-up commit. Repeat until zero unresolved + all CI green. \`mergeStateStatus: UNSTABLE\` means non-blocking checks still running (bots); \`BLOCKED\` means required CI pending; \`CLEAN\` + \`MERGEABLE\` = ready to merge. Repo uses squash-merge only (\`mergeCommit: false, rebase: false, squash: true\` from gh api). +* **PR review workflow: Cursor Bugbot + Seer + human cycle**: PR review workflow (Cursor Bugbot + Seer + human): \`gh pr checks \ --json state,link\` + \`gh run view --log-failed\` for CI. Unresolved threads via \`gh api graphql\` with \`reviewThreads\` query filtering \`isResolved:false\`+\`isMinimized:false\`. After fixes: reply via \`gh api repos/.../pulls/comments/\/replies\` then resolve via \`resolveReviewThread\` mutation. Bots auto-resolve on detected fix. Statuses: \`UNSTABLE\`=non-blocking bots running, \`BLOCKED\`=required CI pending, \`CLEAN\`+\`MERGEABLE\`=ready. Repo is squash-merge only. Expect 4-6 rounds on subtle regex/Unicode PRs. Bugbot findings usually real but occasionally assume PCRE/Python semantics (e.g. \`\[]abc]\` is class-with-\`]\` in PCRE but empty class in JS) — verify with reproduction. Dispatch self-review subagent between rounds. -* **URL resolution chain and persistent default integration point**: CLI init sequence: \`startCli()\` in \`cli.ts\`: (1) fast-path \`\_\_complete\` dispatch, (2) \`preloadProjectContext(cwd)\` finds project root + \`.sentryclirc\`, calls \`applySentryCliRcEnvShim()\`, (3) persistent defaults applied, (4) \`runCli(args)\` with \`withTelemetry()\`. Heavy modules use \`await import()\` for fast completion. \`preloadProjectContext\` is try/catch wrapped. Sentry URL resolution priority: \`SENTRY\_HOST\` > \`SENTRY\_URL\` > \`.sentryclirc \[defaults] url\` > metadata \`defaults.url\` > \`sentry.io\`. Persistent URL default written to \`env.SENTRY\_URL\` only if both \`SENTRY\_HOST\` and \`SENTRY\_URL\` unset. +* **URL resolution chain and persistent default integration point**: CLI init sequence in \`startCli()\` (\`cli.ts\`): (1) fast-path \`\_\_complete\` dispatch, (2) \`preloadProjectContext(cwd)\` finds project root + \`.sentryclirc\`, calls \`applySentryCliRcEnvShim()\`, (3) persistent defaults applied, (4) \`runCli(args)\` with \`withTelemetry()\`. Heavy modules use \`await import()\` for fast completion. \`preloadProjectContext\` is try/catch wrapped. Sentry URL resolution priority: \`SENTRY\_HOST\` > \`SENTRY\_URL\` > \`.sentryclirc \[defaults] url\` > metadata \`defaults.url\` > \`sentry.io\`. Persistent URL default written to \`env.SENTRY\_URL\` only if both \`SENTRY\_HOST\` and \`SENTRY\_URL\` unset. diff --git a/src/lib/scan/walker.ts b/src/lib/scan/walker.ts index eda681713..cd03fcbb7 100644 --- a/src/lib/scan/walker.ts +++ b/src/lib/scan/walker.ts @@ -39,8 +39,8 @@ * Entries already yielded remain valid. */ -import type { Dirent } from "node:fs"; -import { readdir, stat } from "node:fs/promises"; +import { type Dirent, readdirSync, statSync } from "node:fs"; +import { stat } from "node:fs/promises"; import path from "node:path"; import { handleFileError } from "../dsn/fs-utils.js"; import { logger } from "../logger.js"; @@ -48,7 +48,6 @@ import { classifyByExtension, readHeadAndSniff } from "./binary.js"; import { DEFAULT_SKIP_DIRS, MAX_FILE_SIZE, - normalizePath, TEXT_EXTENSIONS, } from "./constants.js"; import { IgnoreStack } from "./ignore.js"; @@ -56,6 +55,14 @@ import type { IgnoreMatcher, WalkEntry, WalkOptions } from "./types.js"; const log = logger.withTag("scan-walk"); +/** + * Native path separator cached to avoid `path.sep` property lookup + * on every entry in the hot loop. `/` on POSIX, `\` on Windows. + */ +const NATIVE_SEP = path.sep; +/** True when the native separator is POSIX (`/`) — skip normalizePath. */ +const POSIX_NATIVE = NATIVE_SEP === path.posix.sep; + /** Entry on the walker's DFS stack. */ type DirFrame = { absDir: string; @@ -86,6 +93,12 @@ type WalkContext = { startedAt: number; stack: DirFrame[]; visitedInodes: Set; + /** + * Precomputed `cfg.cwd.length + 1` — used to slice the cwd prefix + * off absolute paths to produce relative paths. Cached on the + * context so the hot loop doesn't recompute. + */ + cwdPrefixLen: number; }; /** @@ -124,6 +137,7 @@ async function* walkFilesImpl(opts: WalkOptions): AsyncGenerator { startedAt, stack, visitedInodes, + cwdPrefixLen: cfg.cwd.length + 1, }; try { @@ -133,7 +147,7 @@ async function* walkFilesImpl(opts: WalkOptions): AsyncGenerator { stats.dirsVisited += 1; stats.maxDepthReached = Math.max(stats.maxDepthReached, frame.depth); - const entries = await listDirEntries(frame.absDir); + const entries = listDirEntries(frame.absDir); // Dentry-driven nested .gitignore loading: ONLY call loadFromDir // when a .gitignore file is actually present in the directory // listing we already read. This avoids a failed open + thrown @@ -194,8 +208,18 @@ async function processEntry( if (entry.isSymbolicLink() && !cfg.followSymlinks) { return null; } - const abs = path.join(frame.absDir, entry.name); - const rel = normalizePath(path.relative(cfg.cwd, abs)); + // String-concat `abs` rather than `path.join` — measured ~10× faster + // in V8 (no normalization pass on inputs we already know are clean: + // `frame.absDir` is absolute-and-unslashed, `entry.name` has no + // separators per POSIX dirent semantics). + const abs = frame.absDir + NATIVE_SEP + entry.name; + // Slice under `cwd` rather than `path.relative` — measured ~11× + // faster. Safe here because every `abs` is guaranteed to be under + // `cfg.cwd` by construction (we only descend from `cwd`). On + // Windows, convert native `\` to POSIX `/` for downstream + // consumers (the `ignore` package requires POSIX paths). + const relNative = abs.slice(ctx.cwdPrefixLen); + const rel = POSIX_NATIVE ? relNative : relNative.replaceAll(NATIVE_SEP, "/"); // For regular dirs/files, the Dirent already tells us the type. // For symlinks (when `followSymlinks: true`), we need to `stat()` @@ -236,8 +260,17 @@ async function processEntry( return null; } if (cfg.extensions !== undefined) { - const ext = path.extname(entry.name).toLowerCase(); - if (ext === "" || !cfg.extensions.has(ext)) { + // Manual extname — `path.extname + toLowerCase` costs ~9ms for + // 13k calls in V8 (lots of branches checking for . and + // separator boundaries); `lastIndexOf + slice + toLowerCase` + // costs ~7ms. Not a huge win per-call but adds up on hot walks. + const name = entry.name; + const dot = name.lastIndexOf("."); + if (dot <= 0) { + return null; + } + const ext = name.slice(dot).toLowerCase(); + if (!cfg.extensions.has(ext)) { return null; } } @@ -315,26 +348,50 @@ type FileCoords = { fileDepth: number; }; -/** Classify, stat, and build a `WalkEntry`. Returns null on fs errors. */ +/** + * Classify, stat, and build a `WalkEntry`. Returns null on fs errors. + * + * Uses `statSync` rather than `Bun.file(absPath).size` / `.lastModified`. + * Measured ~15% faster per call (30ms vs 36ms for 10k files in the + * synthetic/large fixture) because it avoids the `Bun.file()` handle + * allocation when we only want stat data. When `recordMtimes: false` + * (the grep default), the same single `statSync` call serves both + * the size check and the stat read — no extra syscall needed. + */ async function tryYieldFile( coords: FileCoords, cfg: NormalizedOptions, stats: WalkStats ): Promise { + let statResult: { size: number; mtimeMs: number }; try { - const file = Bun.file(coords.absPath); - const size = file.size; - if (size > cfg.maxFileSize) { - stats.filesSkippedBySize += 1; - return null; - } - const isBinary = await classifyFile(coords.absPath, size, cfg); + const s = statSync(coords.absPath); + statResult = { size: s.size, mtimeMs: s.mtimeMs }; + } catch (error) { + handleFileError(error, { + operation: "scan.walk.stat", + path: coords.absPath, + }); + return null; + } + + if (statResult.size > cfg.maxFileSize) { + stats.filesSkippedBySize += 1; + return null; + } + + try { + const isBinary = await classifyFile(coords.absPath, statResult.size, cfg); stats.filesYielded += 1; return { absolutePath: coords.absPath, relativePath: coords.relPath, - size, - mtime: cfg.recordMtimes ? file.lastModified : 0, + size: statResult.size, + // Floor mtimeMs so DSN cache keys (which compare on floored + // values — see `code-scanner.ts::sourceMtimes` docs) stay + // stable across invocations. Raw mtimeMs is a float that can + // differ by ~1e-6 between reads of the same inode. + mtime: cfg.recordMtimes ? Math.floor(statResult.mtimeMs) : 0, isBinary, depth: coords.fileDepth, }; @@ -405,15 +462,20 @@ function compareByName(a: Dirent, b: Dirent): number { /** * Read all entries in a directory, sorted by name for determinism. * - * Uses `readdir({withFileTypes})` rather than `opendir` — the former - * returns the full list in a single await, the latter yields via an - * async iterator with one microtask per entry. Measurable win on hot - * walks (see PR 1.5 bench data). Sort is retained so yield order is - * filesystem-independent, which tests rely on. + * Uses synchronous `readdirSync({withFileTypes})` rather than the + * async variant. Per-call cost is ~11µs p50 / 65µs max on a 10k-file + * fixture — well below the ~4ms min latency of a microtask tick, so + * blocking the event loop briefly per directory is cheap AND avoids + * the ~60µs microtask overhead each async readdir incurs. Net: the + * sync readdir is 2-3× faster on walks with many small directories, + * which is every realistic CLI/codebase workload. + * + * Sort is retained so yield order is filesystem-independent, which + * tests rely on. */ -async function listDirEntries(dir: string): Promise { +function listDirEntries(dir: string): Dirent[] { try { - const entries = await readdir(dir, { withFileTypes: true }); + const entries = readdirSync(dir, { withFileTypes: true }); entries.sort(compareByName); return entries; } catch (error) {