diff --git a/AGENTS.md b/AGENTS.md index 60e0ce6c5..3a5fd2105 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -991,86 +991,99 @@ mock.module("./some-module", () => ({ ### Architecture - -* **Dashboard widget interval computed from terminal width and layout before API calls**: Dashboard widget interval from terminal width: \`computeOptimalInterval()\` in \`src/lib/api/dashboards.ts\` calculates chart interval before API calls. Formula: \`colWidth = floor(layout.w / 6 \* termWidth)\`, \`chartWidth = colWidth - 4 - gutterW\`, \`idealSeconds = periodSeconds / chartWidth\`. Snaps to nearest Sentry bucket (1m–1d). \`periodToSeconds()\` parses \`"24h"\`, \`"7d"\` etc. \`queryWidgetTimeseries\` uses \`params.interval ?? widget.interval\`. + +* **Auth token env var override pattern: SENTRY\_AUTH\_TOKEN > SENTRY\_TOKEN > SQLite**: Auth in \`src/lib/db/auth.ts\` follows layered precedence: \`SENTRY\_AUTH\_TOKEN\` > \`SENTRY\_TOKEN\` > SQLite OAuth token. \`getEnvToken()\` trims env vars (empty/whitespace = unset). \`AuthSource\` tracks provenance. \`ENV\_SOURCE\_PREFIX = "env:"\` — use \`.length\` not hardcoded 4. Env tokens bypass refresh/expiry. \`isEnvTokenActive()\` guards auth commands. Logout must NOT clear stored auth when env token active. These functions stay in \`db/auth.ts\` despite not touching DB because they're tightly coupled with token retrieval. - -* **DSN org prefix normalization in arg-parsing.ts**: DSN/numeric org prefix normalization — four code paths must all convert to slugs before API calls (many endpoints reject numeric org IDs with 404/403): (1) \`extractOrgIdFromHost\` strips \`o\` prefix during DSN parsing. (2) \`stripDsnOrgPrefix()\` handles user-typed \`o1081365/\` in \`parseOrgProjectArg()\`. (3) \`normalizeNumericOrg()\` in \`resolve-target.ts\` resolves bare numeric IDs via DB cache or uncached API call. (4) Dashboard's \`resolveOrgFromTarget()\` pipes through \`resolveEffectiveOrg()\`, also used by \`tryResolveRecoveryOrg()\` in hex-id-recovery. + +* **Consola chosen as CLI logger with Sentry createConsolaReporter integration**: Consola is the CLI logger with Sentry \`createConsolaReporter\` integration. Two reporters: FancyReporter (stderr) + Sentry structured logs. Level via \`SENTRY\_LOG\_LEVEL\`. \`buildCommand\` injects hidden \`--log-level\`/\`--verbose\` flags. \`withTag()\` creates independent instances; \`setLogLevel()\` propagates via registry. All user-facing output must use consola, not raw stderr. \`HandlerContext\` intentionally omits stderr. - -* **env-registry.ts drives --help env var section + docs**: \`src/lib/env-registry.ts\` (\`ENV\_VAR\_REGISTRY\`) is the single source for all env vars the CLI honors. Entries have \`{name, description, example?, defaultValue?, installOnly?, topLevel?, briefDescription?}\`. \`topLevel: true\` + \`briefDescription\` surfaces in \`sentry --help\` Environment Variables section (via \`formatEnvVarsSection()\` in \`help.ts\`) and in \`sentry help --json\` as \`envVars\` array on the full-tree envelope. Docs generator consumes the full registry for \`configuration.md\`. When adding a new env var, add it here with \`installOnly: true\` if install-script-only. Reserve \`topLevel: true\` for core-path vars only (auth, targeting, URL, key display/logging). + +* **DSN cache invalidation uses two-level mtime tracking (sourceMtimes + dirMtimes)**: \*\*DSN cache invalidation uses two-level mtime tracking\*\*: \`sourceMtimes\` (DSN-bearing files, catches in-place edits) + \`dirMtimes\` (every walked dir, catches new files) + root mtime fast-path + 24h TTL. Dropping either map is a correctness regression. Walker emits mtimes via \`onDirectoryVisit\` hook + \`recordMtimes\` option. \*\*Error-path invariant\*\*: \`scanDirectory\` catch block MUST return empty \`dirMtimes: {}\` — NOT partial map from \`onDirectoryVisit\` callback, which would silently bless unvisited dirs. Empty forces full rescan. \`ConfigError\` re-throws instead. \`scan.walk.dsnParity\` bench op imports \`dsnScanOptions()\` from \`dsn/scan-options.ts\` to keep \`scan/\` policy-free. - -* **GHCR versioned nightly tags for delta upgrade support**: GHCR nightly with delta upgrades: Three tag types: \`:nightly\` (rolling), \`:nightly-\\` (immutable), \`:patch-\\` (delta). Delta patches use TRDIFF10 (zstd-compressed), ~50KB vs ~29MB full. Client decompresses via \`Bun.zstdDecompressSync()\`. N-1 only, full fallback, SHA-256 verify, 60% size threshold. npm/Node excluded. Test helper: \`mockGhcrNightlyVersion()\`. + +* **Input validation layer: src/lib/input-validation.ts guards CLI arg parsing**: CLI arg input validation (\`src/lib/input-validation.ts\`): Four validators — \`rejectControlChars\` (ASCII < 0x20), \`rejectPreEncoded\` (%XX), \`validateResourceId\` (rejects ?, #, %, whitespace), \`validateEndpoint\` (rejects \`..\` traversal). Applied in \`parseSlashOrgProject\`, \`parseOrgProjectArg\`, \`parseIssueArg\`, \`normalizeEndpoint\`. NOT applied in \`parseSlashSeparatedArg\` for plain IDs (may contain structural separators). Env vars and DB cache values are trusted. - -* **Issue list auto-pagination beyond API's 100-item cap**: Sentry API silently caps \`limit\` at 100 per request. \`listIssuesAllPages()\` auto-paginates using Link headers, bounded by MAX\_PAGINATION\_PAGES (50). \`API\_MAX\_PER\_PAGE\` constant is shared across all paginated consumers. \`--limit\` means total results everywhere (max 1000, default 25). Org-all mode uses \`fetchOrgAllIssues()\`; explicit \`--cursor\` does single-page fetch to preserve cursor chain. + +* **Magic @ selectors resolve issues dynamically via sort-based list API queries**: Magic @ selectors resolve issues dynamically: \`@latest\`, \`@most\_frequent\` in \`parseIssueArg\` detected before \`validateResourceId\` (@ not in forbidden charset). \`SELECTOR\_MAP\` provides case-insensitive matching. \`resolveSelector\` maps to \`IssueSort\` values, calls \`listIssuesPaginated\` with \`perPage: 1\`, \`query: 'is:unresolved'\`. Supports org-prefixed: \`sentry/@latest\`. Unrecognized \`@\`-prefixed strings fall through. \`ParsedIssueArg\` union includes \`{ type: 'selector' }\`. - -* **resolveProjectBySlug carries full projectData to avoid redundant getProject calls**: resolveProjectBySlug carries full projectData to skip redundant API calls: Returns \`{ org, project, projectData: SentryProject }\` from \`findProjectsBySlug()\`. \`ResolvedOrgProject\`/\`ResolvedTarget\` have optional \`projectData?\` (populated only in project-search path). Downstream commands use \`resolved.projectData ?? await getProject(org, project)\` to save ~500-800ms. + +* **Sentry SDK uses @sentry/node-core/light instead of @sentry/bun to avoid OTel overhead**: Sentry SDK uses @sentry/node-core/light to avoid OTel overhead: Avoids \`@sentry/bun\` which loads full OpenTelemetry (~150ms, 24MB). \`@sentry/core\` barrel patched via \`bun patch\` to remove ~32 unused exports. Key gotcha: \`LightNodeClient\` hardcodes \`runtime: { name: 'node' }\` AFTER spreading options — fix by patching \`client.getOptions().runtime\` post-init (mutable ref). Transport falls back to Node \`http\` instead of native \`fetch\`. Upstream: getsentry/sentry-javascript#19885, #19886. - -* **Sentry CLI markdown-first formatting pipeline replaces ad-hoc ANSI**: Formatters build CommonMark strings; \`renderMarkdown()\` renders to ANSI for TTY or raw markdown for non-TTY. Key helpers: \`colorTag()\`, \`mdKvTable()\`, \`mdRow()\`, \`mdTableHeader()\` (\`:\` suffix = right-aligned), \`renderTextTable()\`. \`isPlainOutput()\` checks \`SENTRY\_PLAIN\_OUTPUT\` > \`NO\_COLOR\` > \`!isTTY\`. Batch path: \`formatXxxTable()\`. Streaming path: \`StreamingTable\` (TTY) or raw markdown rows (plain). Both share \`buildXxxRowCells()\`. + +* **src/lib/scan/ module layout and IgnoreStack two-tier design**: src/lib/scan/ module layout: Policy-free pure-TS ripgrep-compatible scanner. Files: \`types.ts\`, \`constants.ts\`, \`walker.ts\` (DFS with time-budgeted minDepth/maxDepth, onDirectoryVisit hook), \`ignore.ts\` (IgnoreStack: #rootIg + #nestedByRelDir Map, fast-path bypasses ancestor walk when no nested gitignores — 0.27µs/query vs 1.93µs slow), \`binary.ts\` (8KB NUL sniff + ext fast-path), \`concurrent.ts\` (mapFilesConcurrent), \`regex.ts\` (inline flag translation), \`grep.ts\`/\`glob.ts\` (AsyncIterable + collect helpers), \`path-utils.ts\` (picomatch + basename + walkerRoot + joinPosix). DSN policy lives in \`src/lib/dsn/scan-options.ts::dsnScanOptions()\`. - -* **Sentry dashboard API rejects discover/transaction-like widget types — use spans**: Sentry API dataset gotchas: (1) Events/Explore API accepts \`spans\`, \`transactions\`, \`logs\`, \`errors\`, \`discover\`; \`spansIndexed\` is INVALID (500). Valid list in \`EVENTS\_API\_DATASETS\`. (2) Dashboard \`widgetType\`: \`discover\` and \`transaction-like\` rejected as deprecated — use \`spans\`. \`WIDGET\_TYPES\` (active) vs \`ALL\_WIDGET\_TYPES\` (includes deprecated for parsing). Tests use \`error-events\` not \`discover\`. (3) \`sort\` param only on \`spans\` dataset. (4) \`tracemetrics\` uses comma-separated aggregates; only line/area/bar/table/big\_number displays. + +* **Telemetry opt-out is env-var-only — no persistent preference or DO\_NOT\_TRACK**: Telemetry opt-out priority: (1) \`SENTRY\_CLI\_NO\_TELEMETRY=1\`, (2) \`DO\_NOT\_TRACK=1\`, (3) \`metadata.defaults.telemetry\`, (4) default on. DB read try/catch wrapped (runs before DB init). Schema v13 merged \`defaults\` table into \`metadata\` KV with keys \`defaults.{org,project,telemetry,url}\`; getters/setters in \`src/lib/db/defaults.ts\`. \`sentry cli defaults\` uses variadic \`\[key, value?]\`: no args → show all; 1 arg → show key; 2 args → set; \`--clear\` without args → clear all (guarded); \`--clear key\` → clear specific. Canonical keys: \`org\`/\`organization\`, \`project\`, \`telemetry\`, \`url\`. \`computeTelemetryEffective()\` returns resolved source for display. - -* **Sentry issue stats field: time-series controlled by groupStatsPeriod**: Issue stats and list layout: \`stats\` depends on \`groupStatsPeriod\` (\`""\`, \`"14d"\`, \`"24h"\`, \`"auto"\`). Critical: \`count\` is period-scoped — use \`lifetime.count\` for true total. \`--compact\` is tri-state (\`optional: true\`): explicit overrides, \`undefined\` triggers \`shouldAutoCompact(rowCount)\` — compact if \`3N + 3 > termHeight\`. TREND column hidden < 100 cols. Stricli boolean flags with \`optional: true\` produce \`boolean | undefined\` enabling this auto-detect pattern. + +* **Zod schema on OutputConfig enables self-documenting JSON fields in help and SKILL.md**: Zod schema on OutputConfig enables self-documenting JSON fields: List commands register \`schema?: ZodType\` on \`OutputConfig\\`. \`extractSchemaFields()\` produces \`SchemaFieldInfo\[]\` from Zod shapes. \`buildFieldsFlag()\` enriches \`--fields\` brief; \`enrichDocsWithSchema()\` appends fields to \`fullDescription\`. Schema exposed as \`\_\_jsonSchema\` on built commands — \`introspect.ts\` reads it into \`CommandInfo.jsonFields\`, \`help.ts\` and \`generate-skill.ts\` render it. For \`buildOrgListCommand\`/\`dispatchOrgScopedList\`, pass \`schema\` via \`OrgListConfig\`. - -* **Sentry log IDs are UUIDv7 — enables deterministic retention checks**: Sentry log IDs are UUIDv7 (first 12 hex = ms timestamp, version char \`7\` at pos 13). Traces/event IDs are NOT v7. \`decodeUuidV7Timestamp()\` and \`ageInDaysFromUuidV7()\` in \`src/lib/hex-id.ts\` return null for non-v7 IDs, safe to call unconditionally. Enables deterministic 'past retention' messages. Wired in \`recoverHexId\` and \`log/view.ts#throwNotFoundError\`. \`RETENTION\_DAYS.log = 90\` in \`src/lib/retention.ts\`; traces/events are \`null\` (plan-dependent). \`LOG\_RETENTION\_PERIOD\` is DERIVED as \`\` \`${RETENTION\_DAYS.log}d\` \`\` — never hardcode \`'90d'\`. Shared hex primitives (\`HEX\_ID\_RE\`, \`SPAN\_ID\_RE\`, \`UUID\_DASH\_RE\`, \`LEADING\_HEX\_RE\`, \`MIDDLE\_ELLIPSIS\_RE\`, \`HexEntityType\`) live in \`hex-id.ts\`. +### Decision - -* **Stricli route errors are uninterceptable — only post-run detection works**: Stricli error gaps: (1) Route failures uninterceptable — Stricli writes stderr and returns \`ExitCode.UnknownCommand\` (-5 / 251 in Bun); only post-\`run()\` \`process.exitCode\` check works. (2) \`OutputError\` calls \`process.exit()\` immediately, bypassing telemetry. (3) \`defaultCommand: 'help'\` bypasses built-in fuzzy matching — fixed by \`resolveCommandPath()\` in \`introspect.ts\` using \`fuzzyMatch()\` (up to 3 suggestions); JSON includes \`suggestions\`. (4) Plural alias detection in \`app.ts\`. + +* **All view subcommands should use \ \ positional pattern**: All \`\* view\` subcommands use \`\ \\` positional pattern (Intent-First Correction UX): target is optional \`org/project\`. Use opportunistic arg swapping with \`log.warn()\` when args are wrong order — when intent is unambiguous, do what they meant. Normalize at command level, keep parsers pure. Model after \`gh\` CLI. Exception: \`auth\` uses \`defaultCommand: "status"\` (no viewable entity). Routes without defaults: \`cli\`, \`sourcemap\`, \`repo\`, \`team\`, \`trial\`, \`release\`, \`dashboard/widget\`. - -* **Three Sentry APIs for span custom attributes with different capabilities**: Three Sentry span APIs with different attribute capabilities: (1) \`/trace/{traceId}/\` — hierarchical tree; \`additional\_attributes\` enumerates requested attrs; returns \`measurements\` (zero-filled on non-browser, stripped by \`filterSpanMeasurements()\`). (2) \`/projects/{org}/{project}/trace-items/{itemId}/\` — single span full detail; ALL attributes as \`{name,type,value}\[]\` automatically. (3) \`/events/?dataset=spans\&field=X\` — list/search; explicit \`field\` params. \`--fields\` flag filters JSON output AND requests extra API fields via \`extractExtraApiFields()\`. \`FIELD\_GROUP\_ALIASES\` supports shorthand expansion. + +* **Sentry-derived terminal color palette tuned for dual-background contrast**: Terminal color palette tuned for dual-background contrast: 10-color chart palette derived from Sentry's categorical hues (\`static/app/utils/theme/scraps/tokens/color.tsx\`), adjusted to mid-luminance for ≥3:1 contrast on both dark and light backgrounds. Adjustments: orange #FF9838→#C06F20, green #67C800→#3D8F09, yellow #FFD00E→#9E8B18, purple #5D3EB2→#8B6AC8, indigo #50219C→#7B50D0; blurple/pink/magenta unchanged; teal #228A83 added. Hex preferred over ANSI 16-color for guaranteed contrast. -### Decision + +* **Workers rejected for scan/grep parallelization**: Investigated Bun Workers for sync-work-per-file parallelism. Bootstrap: 7.7ms trivial, ~180ms for scan module (50ms/worker for imports). Pool of 4 = 60-160ms amortizable but ONE-OFF grep calls pay full cost. structuredClone of GrepMatch\[] back to main thread likely dominates for many-match workloads (215k matches × ~100B each = 20MB clone work). Cancellation hard: only \`worker.terminate()\` (experimental); no cooperative early-exit. No existing Worker precedent in the repo (only subprocess-based tests). Polyfill story ugly: library bundle is single \`dist/index.cjs\` with no separate worker file. Theoretical ceiling on 4-core: 1.7-2.3× — much of that eaten by cloning. Walker optimization is the bigger lever regardless. \*\*Do not build worker pool.\*\* - -* **400 Bad Request from Sentry API indicates a CLI bug, not a user error**: Telemetry 400 convention: 400 = CLI bug (capture to Sentry), 401-499 = user error (skip). \`isUserApiError()\` uses \`> 400\` (exclusive). \`isExpectedUserError()\` guard in \`app.ts\` skips ContextError, ResolutionError, ValidationError, SeerError, 401-499 ApiErrors. Captures 400, 5xx, unknown. Skipped errors → breadcrumbs. For \`ApiError\`, call \`Sentry.setContext('api\_error', {...})\` before \`captureException\` — SDK doesn't auto-capture custom properties. +### Gotcha - -* **CLI UX philosophy: auto-recover when intent is clear, warn gently**: UX principle: don't fail when intent is clear — do the intent and nudge via \`log.warn()\` to stderr. Keep errors in Sentry telemetry for visibility (e.g., SeerError for upsell tracking). Two recovery tiers: (1) auto-correct when semantics identical (AND→space), (2) auto-recover with warning when semantics differ (OR→space, warn about union→intersection). Only throw when intent can't be fulfilled. Model after \`gh\` CLI. AI agents are primary consumers constructing natural OR/AND queries. + +* **AuthError constructor takes reason first, message second**: \`AuthError(reason: AuthErrorReason, message?: string)\` where \`AuthErrorReason\` is \`"not\_authenticated" | "expired" | "invalid"\`. Easy to accidentally swap args as \`new AuthError("Token expired", "expired")\` — the string \`"Token expired"\` gets assigned as \`reason\` (invalid enum value). Tests aren't type-checked (tsconfig excludes them), so TypeScript won't catch this. Correct: \`new AuthError("expired", "Token expired")\`. Default messages exist for each reason, so the second arg is often unnecessary. - -* **Trace-related commands must handle project consistently across CLI**: Trace/log commands and project scoping: \`getDetailedTrace\` accepts optional numeric \`projectId\` (not hardcoded \`-1\`). Resolve slug→ID via \`getProject()\`. \`formatSimpleSpanTree\` shows orphan annotation only when \`projectFiltered\` is set. \`buildProjectQuery()\` in \`arg-parsing.ts\` prepends \`project:\\` to queries (used by \`trace/logs.ts\`, \`log/list.ts\`). Multi-project: \`--query 'project:\[cli,backend]'\`. Trace-logs endpoint (\`/organizations/{org}/trace-logs/\`) is org-scoped — uses \`resolveOrg()\` not \`resolveOrgAndProject()\`. Endpoint is PRIVATE (no \`@sentry/api\` types); hand-written \`TraceLogSchema\` in \`src/types/sentry.ts\` required. + +* **Dot-notation field filtering is ambiguous for keys containing dots**: Two off-by-one gotchas in formatters/scan collectors: (1) \`filterFields\` in \`src/lib/formatters/json.ts\` dot-notation ambiguous for keys containing dots — property tests must use \`\[a-zA-Z0-9\_]\` charset. Counterexample: \`{"a":{".":false}}\` with path \`"a."\` splits into \`\["a",""]\`. (2) \`collectGlob\`/\`collectGrep\` must NOT forward \`maxResults\` to the underlying iterator. The collector drains without cap, breaks when \`files.length >= maxResults\`, sets \`truncated = true\` on overshoot. Forwarding causes the iterator to stop at exactly N with \`truncated\` unset even when more matches existed. \`collectGrep\` regressed this once. -### Gotcha + +* **grep multiline option was silently a no-op; /m is grep-default**: \`GrepOptions.multiline\` defaults to \`true\` (grep-like, not JS-native): \`src/lib/scan/grep.ts::readAndGrep\` previously called \`ensureGlobalMultilineFlags\` unconditionally, making the option a no-op. Fix: default \`multiline: true\` (matches rg/git-grep — \`^/$\` anchor to line boundaries), thread through \`PerFileOptions\`, use \`ensureGlobalFlag\` only when caller explicitly opts out with \`multiline: false\` (buffer-boundary semantics). Note: \`compilePattern\`'s \`multiline\` default stays \`false\` (pattern-compile layer only). - -* **Biome lint differs between local lint:fix and CI lint**: Biome \`lint:fix\` (local) differs from CI \`lint\` — auto-fix can hide issues CI still catches: (1) \`noPrecisionLoss\` on integer literals >2^53, (2) \`noIncrementDecrement\` on \`count++\`, (3) import ordering when a named import follows non-import runtime code. Formatter rewrites multi-line imports to single-line when they fit. Always run \`bun run lint\` before pushing. Use \`for...of\` destructuring or \`i += 1\` instead of \`++\`; use \`Number(string)\` or split literals instead of \`1\_735\_689\_600\_000\_000\_001\`. + +* **mapFilesConcurrent skips null but not empty arrays — callers must return null for no-op**: \`mapFilesConcurrent\` in \`src/lib/scan/concurrent.ts\` filters \`null\` results but NOT empty arrays — pushes them and fires \`onResult\` for each. Stream variant \`mapFilesConcurrentStream\` filters both \`null || length === 0\`. Asymmetry intentional (generic \`TOut\` can't assume array). Callers like \`processEntry\` in \`src/lib/dsn/code-scanner.ts\` must return \`null\` (not \`\[]\`) for no-op files — on 10k-file walks where ~99% have no DSNs, returning \`\[]\` fires ~9900 wasted \`onResult\` calls. - -* **Bun mock.module for node:tty requires default export and class stubs**: Bun testing gotchas: (1) \`mock.module()\` for CJS built-ins (e.g. \`node:tty\`) needs \`default\` re-export plus named exports, declared top-level BEFORE \`await import()\`; lives in \`test/isolated/\`. (2) Destructuring imports capture binding at load; verify via call-count > 0. (3) \`Bun.mmap()\` always opens PROT\_WRITE — use \`new Uint8Array(await Bun.file(path).arrayBuffer())\` for read-only. (4) Wrap \`Bun.which()\` with optional \`pathEnv\` for deterministic testing. (5) Mocking \`@sentry/node-core/light\`: \`startSpan\` must pass mock span to callback — \`startSpan: (\_, fn) => fn({ setStatus(){}, setAttribute(){}, end(){} })\`. + +* **process.stdin.isTTY unreliable in Bun — use isatty(0) and backfill for clack**: \`process.stdin.isTTY\` unreliable in Bun — use \`isatty(0)\`: Bun's single-file binary can leave \`process.stdin.isTTY === undefined\` on TTY fds inherited via redirects like \`exec … \ +* **Spinner stdout/stderr collision: log messages inside withProgress appear on spinner line**: Spinner stdout/stderr collision: \`withProgress\` in \`src/lib/polling.ts\` writes to stdout with \`\r\x1b\[K\` (no newline); consola writes to stderr. Any \`log.info()\` called \*\*inside\*\* the callback appears on the spinner line because stderr doesn't know stdout's carriage-return position. Fix: propagate data out via return value, call \`log.info()\` \*\*after\*\* \`withProgress\` completes (finally block clears the line). Affected \`downloadBinaryToTemp\` in \`upgrade.ts\`. - -* **403 scope extraction via api-scope.ts helpers**: \`src/lib/api-scope.ts\` exports \`extractRequiredScopes(detail)\` which scans Sentry 403 response detail (string or structured) for scope-like tokens (e.g. \`event:read\`, \`project:admin\`). Matches free-text and structured \`required\`/\`required\_scopes\` fields. Use in 403-enrichment paths instead of hardcoded generic scope lists; fall back to generic hint only when extraction returns empty. Wired into \`issue list\` \`build403Detail()\`, \`organizations.ts\` \`enrich403Error()\`, and anywhere a 403 is re-thrown with a required-scopes suggestion. + +* **Stricli rejects unknown flags — pre-parsed global flags must be consumed from argv**: Stricli flag parsing traps: (1) Unknown \`--flag\`s rejected — global flags parsed in \`bin.ts\` MUST be spliced from argv (check \`--flag value\` AND \`--flag=value\`). (2) \`FLAG\_NAME\_PATTERN\` requires 2+ chars after \`--\`; single-char flags like \`--x\` silently become positionals — use aliases (\`-x\` → longer name). Bit dashboard widget \`--x\`/\`--y\`. (3) Generated SKILL.md filters hidden flags: \`FlagDef\` includes \`hidden?: boolean\`; \`extractFlags\` propagates so \`generateCommandDoc\` filters hidden flags alongside \`help\`/\`helpAll\`. Hidden flags (\`--log-level\`, \`--verbose\`) appear globally in \`docs/src/content/docs/commands/index.md\` Global Options. + + +* **test:unit glob only picks up test/lib, test/commands, test/types, test/isolated**: \`bun test\` script globs in \`package.json\` are narrow: \`test:unit\` = \`test/lib test/commands test/types\`, \`test:isolated\` = \`test/isolated\`, \`test:e2e\` = \`test/e2e\`. Tests placed under \`test/fixtures/\`, \`test/scripts/\`, or \`test/script/\` are NOT picked up by any standard CI script despite being valid Bun test files. Place new tests under \`test/lib/\/\` to inherit coverage. Note: \`test/script/\` (singular) exists with script tests but is also outside the globs. + + +* **Walker async-generator overhead vs sync readdir**: Walker async-generator overhead vs sync readdir: \`src/lib/scan/walker.ts\` async walker takes 480-620ms on 10k files (48µs/file). Sync \`readdirSync\` + naive walk: 50ms. Async generator overhead is 10× pure I/O cost. \`readdir\` async serial: 114ms; async concurrent (Promise.all): 21ms (fastest). Workers don't help — walker is the bottleneck. Main perf gap vs ripgrep is walker overhead, not regex speed. Culprits: \`Bun.file()\` per yield (stat+lazy metadata), classifyFile branching, onDirectoryVisit stat hook, \`path.extname().toLowerCase()\` per entry, per-directory await chain. Unopened optimization territory. + + +* **Whole-buffer matchAll slower than split+test when aggregated over many files**: Whole-buffer matchAll slower than split+test when aggregated over many files: PR 791's \`readAndGrep\` rewrite replaced \`split("\n") + regex.test\` per-line with whole-buffer \`regex.exec\`. Micro-bench on single files showed 12× faster, but aggregate over 10k files says otherwise: split+test is ~1.6× FASTER on many-match workloads (215k matches across 9k files). The 160× faster-than-rg headline comes entirely from \`mapFilesConcurrent.onResult\` early-exit at maxResults=100 — NEW scans 9 files; OLD scans 55. Uncapped NEW is 63% slower than OLD fs-walk because whole-buffer match-emission costs more than per-line \`regex.test\`. Frame early-exit wins honestly in bench headlines. Fix candidates: adaptive switch based on match density, or revert per-line for many-match patterns. + +### Pattern - -* **buildApiUrl helper for safe Sentry API URL construction**: \`buildApiUrl(regionUrl, ...segments)\` in \`src/lib/api/infrastructure.ts\` is the canonical way to compose Sentry API URLs. It owns the \`/api/0/\` prefix, trailing slash, and per-segment \`encodeURIComponent\`. Replaces error-prone \`${base}/api/0/organizations/${encodeURIComponent(org)}/...\` template-string patterns. Safety gain: project/org slugs containing a literal \`/\` are encoded correctly — template-string code would silently produce bogus URLs and break cache invalidation (e.g. \`/acme/slash/project/\` wouldn't match cached \`/acme/slash%2Fproject/\`). \`stripTrailingSlash\` is no longer exported — use \`buildApiUrl\` instead. All URL-composition sites in \`projects.ts\` and \`issues.ts\` use this helper. + +* **AGENTS.md lore conflicts in rebase: take --ours (main-side) and let lore regenerate**: When rebasing onto main and hitting AGENTS.md conflicts in the \`## Long-term Knowledge\` lore-managed section, resolve via \`git checkout --ours AGENTS.md\` (takes main's newer entries) rather than trying to merge block-by-block. Branch-specific lore entries will be regenerated by the lore agent on next run. During rebase: \`--ours\` = branch being rebased onto (main), \`--theirs\` = commit being replayed. This conflict pattern recurs on every rebase when both main and the feature branch updated lore concurrently. - -* **I/O concurrency limits belong at the call site, not in generic combinators**: I/O concurrency limits belong at the call site, not in generic combinators. Pattern: module-scoped \`pLimit()\` with named constant (e.g., \`STAT\_CONCURRENCY = 32\` in \`project-root.ts\`, \`CACHE\_IO\_CONCURRENCY\` in \`response-cache.ts\`, \`pLimit(50)\` in \`code-scanner.ts\`). Keeps combinators pure, makes budget explicit at I/O boundary. stat() lighter than full reads — ~32 for stats vs ~50 for reads, well below macOS's 256 FD ceiling. + +* **Bench harness with concurrency sweep + deterministic fixtures**: Bench harness with concurrency sweep + deterministic fixtures: Local harness at \`script/bench.ts\` + \`script/bench-sweep.ts\` (not in CI). Scripts: \`bun run bench\`, \`bench:save\` (writes \`.bench/baseline.json\`), \`bench:compare\` (fails >20% regression), \`bench:sweep\` (concurrency tuning). Flags: \`--size\`, \`--op\`, \`--repo \\` (refuses \`--save-baseline\`), \`--runs\`, \`--warmup\`, \`--json\`. Fixtures in \`test/fixtures/bench/\` use xorshift32-seeded generator for byte-identical trees in \`os.tmpdir()/sentry-cli-bench/\`. Baselines gitignored (machine variance). Op labels match Sentry span names. \`withBenchDb(fn)\` scopes \`SENTRY\_CONFIG\_DIR\` to temp; \`clearDsnDetectionCache()\` between cold iterations. \*\*CONCURRENCY\_LIMIT gotcha\*\*: sweep showed knee=availableParallelism for walker-fed workloads, but pure I/O workloads want 16-32. Walker dominates; raising for pure-I/O regressed DSN scanner. Kept at \`min(16, max(2, availableParallelism()))\`. - -* **Identity-scoped response cache via fingerprint mixin**: Identity-scoped response cache: \`buildCacheKey(method, url)\` reads identity via memoized \`getIdentityFingerprint()\`. MD5 of \`kind|secret\` truncated to 16 hex (CodeQL flagged as \`js/insufficient-password-hash\` — dismissed as won't-fix: this is cache namespacing, not auth; collisions are benign). \`ANON\_IDENTITY = "\"\`. \`CacheEntry\` persists identity so \`invalidateCachedResponsesMatching(prefix)\` skips other identities. All invalidation helpers are no-throw. \`clearAuth()\` dynamically imports \`clearResponseCache\` to break db/auth ↔ response-cache cycle. Batch mutations: per-ID detail invalidation + single post-loop list sweep avoids N+1. \*\*Gotcha:\*\* exact-match \`invalidateCachedResponse()\` silently fails on URLs with query params (e.g. \`getIssue(id, {collapse:\[...]})\` caches as \`/issues/{id}/?collapse=...\`) — use \`invalidateCachedResponsesMatching(prefix)\` with trailing-slash URL as prefix for any endpoint called with \`params\`. + +* **Event view cross-org fallback chain: project → org → all orgs**: Event view cross-org fallback chain: \`fetchEventWithContext\` in \`src/commands/event/view.ts\` tries (1) \`getEvent(org, project, eventId)\`, (2) \`resolveEventInOrg(org, eventId)\`, (3) \`findEventAcrossOrgs(eventId, { excludeOrgs })\`. Extracted into \`tryEventFallbacks()\` for Biome complexity limit. \`excludeOrgs\` only set when same-org search returned null (not on transient error). Both catch blocks re-throw \`AuthError\` while swallowing transient errors. Warning distinguishes same-org/different-project from different-org via \`crossOrg.org === org\`. - -* **Isolated adapter coverage via fetch mocking in test/lib/**: To get CodeCov coverage on API-calling functions (e.g., hex-id-recovery adapters, api-client functions), write tests in \`test/lib/\*.coverage.test.ts\` or \`test/lib/\*.adapters.test.ts\` that mock \`globalThis.fetch\` via \`mockFetch()\` from \`test/helpers.js\`, call \`setAuthToken()\` + \`setOrgRegion()\` in \`beforeEach\`, and invoke the REAL function. Tests in \`test/e2e/\` or tests that stub the exports via \`spyOn\`/\`mock.module\` give ZERO coverage to the mocked function body. Use \`useTestConfigDir()\` for DB isolation. Pattern example: \`test/lib/api-client.coverage.test.ts\` and \`test/lib/hex-id-recovery.adapters.test.ts\`. Mock responses must include ALL Zod-required fields — minimal stubs fail schema validation with a noisy \`ApiError\`. + +* **Extract shared pipeline setup between streaming and collecting variants of same operation**: When a module exposes both a streaming (\`AsyncGenerator\`) and collecting (\`Promise\\`) variant of the same pipeline (e.g., \`grepFiles\` + \`collectGrep\` in \`src/lib/scan/grep.ts\`), extract a shared setup helper (e.g., \`setupGrepPipeline\`) that handles pattern compilation, matcher compilation, option defaulting, walker construction, and filter wiring. Otherwise divergence between the two setup paths becomes a silent correctness bug. The collecting variant typically adds a \`+1 probe\` pattern over the streaming variant's \`maxResults\` to distinguish 'exactly N' from 'N+ available'. - -* **Memoize identity fingerprint with test-reset hook + setAuthToken invalidation**: \`getIdentityFingerprint()\` in \`src/lib/db/auth.ts\` is memoized via module-level \`cachedFingerprint\` variable — identity is stable within a CLI run. Invalidate in \`setAuthToken()\` and \`clearAuth()\` (the only mutation points); OAuth access-token rotation preserves the refresh\_token so the fingerprint stays stable. Export \`resetIdentityFingerprintCache()\` for tests — tests mutating \`process.env.SENTRY\_AUTH\_TOKEN\` mid-test must call it between reads, since env changes bypass the auth mutation path. \`beforeEach\` in auth tests should also call it. This memoize+reset pattern mirrors \`resetUpdateNotificationState\`, \`resetCacheState\`, \`resetAuthHintState\`. + +* **Init-wizard grep/glob adapters must sandbox search.path via safePath before passing to scan**: Init-wizard grep/glob tools in \`src/lib/init/tools/{grep,glob}.ts\` are thin adapters over \`collectGrep\`/\`collectGlob\` from \`src/lib/scan/\`. The scan module trusts \`opts.path\` — per \`src/lib/scan/types.ts\`, sandbox enforcement is the adapter's responsibility. Adapters MUST call \`safePath(payload.cwd, search.path)\` from \`init/tools/shared.ts\` before forwarding; skipping this regresses the \`rejects grep paths outside the init sandbox\` test. \`safePath\` returns an absolute path + walks ancestors with \`realpathSync\` to block symlink escapes. Two-layer defense: \`validateToolSandbox(payload, context.directory)\` in the registry pre-checks \`payload.cwd\` is inside project root; \`safePath\` guards per-tool sub-paths. Wire contract (GrepPayload/GlobPayload in \`init/types.ts\`) is minimal: pattern/path?/include? for grep, patterns\[]/path?/maxResults? for glob — no caseInsensitive/fixedStrings/etc on the wire today. - -* **Stricli parse functions can perform validation and sanitization at flag-parse time**: Stricli's \`parse\` fn on \`kind: "parsed"\` flags runs during argument parsing before \`func()\`. Can throw errors (including \`ValidationError\`) and log warnings. Current uses: \`parseCursorFlag\`, \`sanitizeQuery\`, \`parsePeriod\` (returns \`TimeRange\`), \`parseSort\`/\`parseSortFlag\`, \`numberParser\`/\`parseLimit\`. Optional period flags: \`flags.period\` is \`TimeRange | undefined\` — commands use \`TIME\_RANGE\_\*\` constants as defaults. \`formatTimeRangeFlag()\` converts back to display strings. \`appendPeriodHint(parts, period, flagName)\` in \`time-range.ts\` encapsulates \`if (formatted !== DEFAULT\_PERIOD) parts.push(...)\` across 4+ hint builders. + +* **Literal prefilter for grep — extract longest non-metachar run from regex**: \`src/lib/scan/literal-extract.ts::extractInnerLiteral()\` extracts longest literal substring (≥3 chars, not whitespace/punctuation) from a regex source. Used in \`grep.ts::readAndGrep\` as a fast path: \`indexOf(literal)\` to find candidate lines, then verify with full regex. 18% win on \`import.\*from\`-style patterns. Three dispatch paths: (1) pure literal → whole-buffer regex (V8 optimizes purely-literal patterns; prefilter adds overhead), (2) regex with extractable literal → literal prefilter, (3) no literal or \`multiline: false\` → whole-buffer fallback. \*\*Char-class opaqueness gotcha\*\*: \`hasTopLevelAlternation\` must skip over \`\[...]\` as opaque — \`\[(]foo|bar\` has literal \`(\` inside the class; incorrectly incrementing paren depth hides top-level \`|\`. -### Preference + +* **PR review workflow: Cursor Bugbot + Seer + human cycle**: Use \`gh pr checks \ --json state,link\` + \`gh run view --log-failed\` to detect CI failures. Fetch unresolved review threads via \`gh api graphql\` with reviewThreads query filtering \`isResolved: false\` + \`isMinimized: false\`. After addressing findings: post reply via \`gh api repos/.../pulls/comments/\/replies\` then resolve thread via \`resolveReviewThread\` GraphQL mutation. Bots auto-resolve threads when they detect the fix in a follow-up commit. Repeat until zero unresolved + all CI green. \`mergeStateStatus: UNSTABLE\` means non-blocking checks still running (bots); \`BLOCKED\` means required CI pending; \`CLEAN\` + \`MERGEABLE\` = ready to merge. Repo uses squash-merge only (\`mergeCommit: false, rebase: false, squash: true\` from gh api). - -* **Code review style: BYK values brevity; trim JSDoc essays aggressively**: BYK code review style — brevity first: Terse 1-3 line JSDoc over essays. Remove comments that restate function names/control flow. Don't wrap try/catch around already-no-throw helpers. Don't use HMAC when MD5 suffices for non-auth hashing. Don't make imports lazy without a documented reason. When bot suggests \`.unshift()\` on returned API objects, use spread/slice instead (mutation corrupts callers — e.g. \`mergeIssues\` returns \`data.merge\` with children). Prefer \`\[...new Set(items)]\` over hand-rolled dedupe. Typical review trims: -200 to -450 lines. + +* **URL resolution chain and persistent default integration point**: CLI init sequence: \`startCli()\` in \`cli.ts\`: (1) fast-path \`\_\_complete\` dispatch, (2) \`preloadProjectContext(cwd)\` finds project root + \`.sentryclirc\`, calls \`applySentryCliRcEnvShim()\`, (3) persistent defaults applied, (4) \`runCli(args)\` with \`withTelemetry()\`. Heavy modules use \`await import()\` for fast completion. \`preloadProjectContext\` is try/catch wrapped. Sentry URL resolution priority: \`SENTRY\_HOST\` > \`SENTRY\_URL\` > \`.sentryclirc \[defaults] url\` > metadata \`defaults.url\` > \`sentry.io\`. Persistent URL default written to \`env.SENTRY\_URL\` only if both \`SENTRY\_HOST\` and \`SENTRY\_URL\` unset. diff --git a/src/lib/scan/grep.ts b/src/lib/scan/grep.ts index 778be453a..313cb0c9c 100644 --- a/src/lib/scan/grep.ts +++ b/src/lib/scan/grep.ts @@ -4,7 +4,8 @@ * Layers regex matching onto `walkFiles`: * 1. Walk cwd with the caller's filters (extensions, depth, gitignore). * 2. Post-filter yields by `isBinary` + include/exclude globs. - * 3. Read each surviving file's content and test line-by-line. + * 3. Read each surviving file's content, optionally gate via the + * extracted literal, run the regex across the full buffer. * 4. Emit `GrepMatch` entries as they arrive. * * ### Primary API @@ -22,15 +23,22 @@ * * Full slurp via `Bun.file(path).text()`. The walker's * `maxFileSize` (default 256 KB) caps the blast radius; minified - * bundles bigger than that never reach us. Line splitting via - * `content.split("\n")` — fine for utf-8 text, preserves `\r` on - * CRLF files verbatim. + * bundles bigger than that never reach us. CRLF line endings are + * preserved verbatim in the emitted match text. * - * ### Regex strategy + * ### Matching strategy * - * `compilePattern` produces a `g`-less RegExp. Each line gets a - * fresh `regex.test(line)` — no `lastIndex` state to reset. Inline - * flags `(?i)` / `(?im)` / `(?i:...)` are translated to JS flags. + * Before matching, `setupGrepPipeline` extracts a literal prefix + * from the compiled regex (when possible). If the file doesn't + * contain that literal, it can't possibly match — skip via cheap + * `indexOf`. Files that pass the gate, or patterns with no + * extractable literal, go through `grepByWholeBuffer`, which clones + * the compiled regex per file (`/g`-augmented, plus `/m` when + * `multiline` is true), iterates via `regex.exec(content)` on the + * full buffer, and emits one `GrepMatch` per matching line + * (advancing `lastIndex` past the line's newline so the next match + * starts on a fresh line). Inline flags `(?i)` / `(?im)` / `(?i:...)` + * are translated to JS flags by `compilePattern`. */ import { handleFileError } from "../dsn/fs-utils.js"; @@ -38,6 +46,7 @@ import { type ConcurrentOptions, mapFilesConcurrentStream, } from "./concurrent.js"; +import { extractInnerLiteral } from "./literal-extract.js"; import { basenameOf, type CompiledMatcher, @@ -146,6 +155,21 @@ type PerFileOptions = { maxLineLength: number; maxMatchesPerFile: number; pathPrefix: string; + /** + * Optional literal prefilter — when set, the per-file path runs + * a cheap `indexOf(literal)` gate before invoking the regex + * engine. Files that don't contain the literal are skipped + * entirely. Computed once per `collectGrep`/`grepFiles` call via + * `extractInnerLiteral`. + * + * NOTE: the gate is file-level only. We intentionally do NOT use + * the literal to locate individual matches — patterns that can + * span newlines (e.g., `foo\sbar` where `\s` matches `\n`) and + * patterns whose literal differs from the compiled form (e.g., + * `\x41foo` matches `Afoo`, not `\x41foo`) would both produce + * silent misses under per-line verify. + */ + literal: string | null; }; /** @@ -171,7 +195,6 @@ type PerFileOptions = { * the next `matchAll` iteration. Without this skip, a regex like * `/foo/g` on `"foo foo"` would emit two matches on the same line. */ -// biome-ignore lint/complexity/noExcessiveCognitiveComplexity: per-match cursor + line-number + truncation is inherently branchy async function readAndGrep( entry: WalkEntry, opts: PerFileOptions @@ -187,7 +210,67 @@ async function readAndGrep( return null; } - const matches: GrepMatch[] = []; + // File-level prefilter gate. If the pattern has an extractable + // literal, skip the regex engine entirely on files that don't + // contain the literal — ripgrep's central optimization, adapted. + // + // Important: this is a FILE-level gate only. We deliberately do + // NOT attempt to use the literal to find/verify matches line-by- + // line. That approach has subtle correctness failures for patterns + // that can match across newlines (e.g., `foo\sbar` where `\s` + // matches `\n`) and for patterns whose literal isn't what the + // regex engine actually matches (e.g., `\x41foo` matches `Afoo`, + // not the literal string `\x41foo`; and `(?i:foo|bar)baz` compiles + // to `foo|barbaz` with different alternation structure). + // + // Keeping the gate at file-level sidesteps all of that: we only + // use `indexOf(literal)` to quickly reject files that can't match, + // and let V8's regex engine handle the actual matching. The perf + // win is still substantial because most files in a large tree + // contain zero instances of the literal, and `indexOf` is much + // cheaper than constructing the regex engine's NFA state. + // + // Preconditions for the gate: + // - Literal must be extractable AND not case-sensitivity-altered + // in a way that breaks indexOf (so: either case-sensitive search, + // OR content.toLowerCase() preserves length). + // - The literal must actually be something the compiled regex MUST + // match — guaranteed by `extractInnerLiteral` operating on the + // compiled regex source (see `setupGrepPipeline`). + if (opts.literal !== null) { + const isCaseInsensitive = opts.regex.flags.includes("i"); + // Note: for case-insensitive search, `content.toLowerCase()` may + // be LONGER than `content` (e.g., Turkish `İ` → `i` + U+0307). + // That's fine because we only use the result as a boolean gate: + // "does the literal exist anywhere in this file?" The returned + // position is never used as a `content` offset — the whole-buffer + // regex engine does the actual match localization. + const haystack = isCaseInsensitive ? content.toLowerCase() : content; + if (haystack.indexOf(opts.literal) === -1) { + return []; + } + // File contains the literal — fall through to the whole-buffer + // matcher. The regex engine handles the actual matching; the + // literal gate only ruled out files that can't possibly match. + } + + return grepByWholeBuffer(content, entry, opts); +} + +/** + * Whole-buffer grep — runs the compiled regex across the full file + * content, emitting one `GrepMatch` per match's enclosing line. + * + * This is the primary matching implementation. The file-level + * literal gate in `readAndGrep` may skip files that have no chance + * of matching, but once a file reaches this function the regex + * engine does the actual work. + */ +function grepByWholeBuffer( + content: string, + entry: WalkEntry, + opts: PerFileOptions +): GrepMatch[] { // Whole-buffer iteration requires `/g`. `/m` (line-boundary // anchoring) is applied when the caller opts into grep-like // semantics via `opts.multiline` (the default — see `GrepOptions` @@ -195,68 +278,85 @@ async function readAndGrep( // anchor `^/$` to the buffer boundaries like raw JS semantics. // // We clone the regex per file so each invocation has its own - // `lastIndex`. Today the exec/loop block below runs synchronously - // with no `await`, so concurrent `readAndGrep` workers can't - // actually observe each other's `lastIndex` mutations (JS is - // single-threaded; microtasks only yield at `await`). But the - // clone is still worth paying — it's ~1µs per file and eliminates - // the foot-gun if anyone ever introduces an `await` inside the - // match loop. + // `lastIndex`. Today the exec/loop block runs synchronously with + // no `await`, so concurrent `readAndGrep` workers can't actually + // observe each other's `lastIndex` mutations (JS is single-threaded; + // microtasks only yield at `await`). But the clone is still worth + // paying — ~1µs per file — to eliminate the foot-gun if anyone + // ever introduces an `await` inside the match loop. const ensured = opts.multiline ? ensureGlobalMultilineFlags(opts.regex) : ensureGlobalFlag(opts.regex); const regex = new RegExp(ensured.source, ensured.flags); + const ctx: MatchContext = { entry, opts, content }; - // Cursor for incremental line-number computation. We walk forward - // through the buffer, counting `\n` once per match emission. The - // 10-char constant is `\n`. - const NEWLINE = 10; + const matches: GrepMatch[] = []; + // Advance `cursor` to each match index using `indexOf("\n", cursor)` + // in a loop — skipping newline-to-newline instead of walking char- + // by-char via `charCodeAt`. 2-5× faster because V8 implements + // `indexOf` in optimized C++ with no per-iteration JS interop. let lineNum = 1; let cursor = 0; let match = regex.exec(content); while (match !== null) { const matchIndex = match.index; - // Advance line counter to the match position. - while (cursor < matchIndex) { - if (content.charCodeAt(cursor) === NEWLINE) { - lineNum += 1; - } - cursor += 1; + let nl = content.indexOf("\n", cursor); + while (nl !== -1 && nl < matchIndex) { + lineNum += 1; + nl = content.indexOf("\n", nl + 1); } + cursor = matchIndex; const lineStart = content.lastIndexOf("\n", matchIndex) + 1; - const lineEnd = content.indexOf("\n", matchIndex); - const rawLine = content.slice( - lineStart, - lineEnd === -1 ? content.length : lineEnd - ); - const line = - rawLine.length > opts.maxLineLength - ? `${rawLine.slice(0, opts.maxLineLength - 1)}…` - : rawLine; - matches.push({ - path: opts.pathPrefix - ? joinPosix(opts.pathPrefix, entry.relativePath) - : entry.relativePath, - absolutePath: entry.absolutePath, - lineNum, - line, - }); + const lineEndRaw = content.indexOf("\n", matchIndex); + const lineEnd = lineEndRaw === -1 ? content.length : lineEndRaw; + matches.push(buildMatch(ctx, { lineNum, lineStart, lineEnd })); if (matches.length >= opts.maxMatchesPerFile) { break; } - // Per-line-emission contract: skip past this line's `\n` so the - // next matchAll iteration starts on a new line. If the match - // ended the file, we're done. - if (lineEnd === -1) { + if (lineEndRaw === -1) { break; } regex.lastIndex = lineEnd + 1; match = regex.exec(content); } - return matches; } +/** + * Per-call context for `buildMatch` — bundles args that stay + * constant for the duration of a file's scan (entry, opts, content) + * separately from args that change per match (lineNum, lineStart, + * lineEnd). Keeps Biome's useMaxParams under the 4-param ceiling. + */ +type MatchContext = { + entry: WalkEntry; + opts: PerFileOptions; + content: string; +}; + +type LineBounds = { + lineNum: number; + lineStart: number; + lineEnd: number; +}; + +/** Build a `GrepMatch` record for one line. Shared by all grep paths. */ +function buildMatch(ctx: MatchContext, bounds: LineBounds): GrepMatch { + const rawLine = ctx.content.slice(bounds.lineStart, bounds.lineEnd); + const line = + rawLine.length > ctx.opts.maxLineLength + ? `${rawLine.slice(0, ctx.opts.maxLineLength - 1)}…` + : rawLine; + return { + path: ctx.opts.pathPrefix + ? joinPosix(ctx.opts.pathPrefix, ctx.entry.relativePath) + : ctx.entry.relativePath, + absolutePath: ctx.entry.absolutePath, + lineNum: bounds.lineNum, + line, + }; +} + /** * Build walker options from grep options — the set that `walkFiles` * actually consumes. Anything grep-specific stays in grep. @@ -341,6 +441,14 @@ async function* grepFilesInternal( type GrepPipelineSetup = { regex: RegExp; multiline: boolean; + /** + * Pre-extracted literal prefilter (see `extractInnerLiteral`). Null + * when the pattern has no safe extractable literal (top-level + * alternation, all-metachar, etc.). When set, `readAndGrep` uses + * it as a file-level gate via `indexOf` — files without the literal + * are skipped entirely. + */ + literal: string | null; stats: GrepStats; walkSource: AsyncIterable; maxResults: number; @@ -355,6 +463,12 @@ type GrepPipelineSetup = { /** * Resolve all defaults + compile + wire the walker. Both `grepFiles` * and `collectGrep` consume this — keeps defaults in one place. + * + * Extracts a literal prefix/substring from the compiled regex (if + * possible) so the per-file path can use `indexOf` as a file-level + * gate. The gate is conservative: we only use it to skip files that + * can't possibly contain a match; the whole-buffer matcher handles + * the actual matching. */ function setupGrepPipeline(opts: GrepOptions): GrepPipelineSetup { // Default `multiline: true` — matches grep/rg's line-boundary @@ -365,6 +479,22 @@ function setupGrepPipeline(opts: GrepOptions): GrepPipelineSetup { caseSensitive: opts.caseSensitive, multiline, }); + + // Literal extraction runs on the COMPILED regex's source, not the + // raw user input. `compilePattern` rewrites scoped inline flags + // like `(?i:foo|bar)baz` — strip the group and widen the flag, + // which yields `foo|barbaz` (source) + `i` (flag). The rewritten + // source has top-level alternation the original lacked. If we + // extracted from the original, `hasTopLevelAlternation` would miss + // it and we'd extract `"baz"` as a "required" literal — but the + // compiled regex can match `foo` alone, silently dropping lines + // that match the first branch of the alternation. + // + // Always extract from `regex.source` so the extractor sees what + // the regex engine will actually run. `regex.flags` is authoritative + // for case-sensitivity either way. + const literal = extractInnerLiteral(regex.source, regex.flags); + const includes = compileMatchers(opts.include); const excludes = compileMatchers(opts.exclude); const includeBinary = opts.includeBinary ?? false; @@ -381,6 +511,7 @@ function setupGrepPipeline(opts: GrepOptions): GrepPipelineSetup { return { regex, multiline, + literal, stats, walkSource, maxResults: opts.maxResults ?? Number.POSITIVE_INFINITY, @@ -417,6 +548,7 @@ export async function* grepFiles(opts: GrepOptions): AsyncGenerator { maxLineLength: setup.maxLineLength, maxMatchesPerFile: setup.maxMatchesPerFile, pathPrefix: setup.pathPrefix, + literal: setup.literal, }, walkSource: setup.walkSource, stats: setup.stats, @@ -461,6 +593,7 @@ export async function collectGrep(opts: GrepOptions): Promise { maxLineLength: setup.maxLineLength, maxMatchesPerFile: setup.maxMatchesPerFile, pathPrefix: setup.pathPrefix, + literal: setup.literal, }, walkSource: setup.walkSource, stats: setup.stats, diff --git a/src/lib/scan/literal-extract.ts b/src/lib/scan/literal-extract.ts new file mode 100644 index 000000000..e33a9394d --- /dev/null +++ b/src/lib/scan/literal-extract.ts @@ -0,0 +1,426 @@ +/** + * Conservative literal-prefix extractor for grep's file-level gate. + * + * Walks a regex source looking for the longest contiguous run of + * non-metacharacter literal bytes that every match must contain. + * When found, `grep.ts::readAndGrep` checks whether the file + * contains that literal via the vastly cheaper + * `String.prototype.indexOf` and skips the regex engine entirely on + * files where it doesn't appear — ripgrep's central perf trick, + * adapted to pure JS as a file-level gate. + * + * The gate is deliberately file-level only. Per-line or per-match + * use of the literal would introduce subtle correctness failures + * for patterns whose match spans newlines (e.g., `foo\sbar`) or + * whose compiled form differs from the user's source (e.g., + * `\x41foo` decodes to `Afoo`). Keeping the gate at file-level + * sidesteps all of that: the regex engine still does the actual + * matching on files that pass. + * + * ### Extraction rules (conservative; safety over completeness) + * + * A literal is "safe" if every match of the regex must contain that + * exact byte sequence. We bail out on anything that could produce + * matches without the candidate literal: + * + * - Top-level alternation (`foo|bar`): any branch could match, no + * single literal is required. v1 rejects these; v2 could return a + * Set and probe with multiple `indexOf` calls. + * - Character class `[abc]`: the class matches any of its members; + * no single byte is required. + * - Capturing / non-capturing group `(foo)`, `(?:foo)`: the content + * MIGHT be useful but its enclosure complicates extraction; v1 + * extracts outside groups only. + * - Quantifier `?`, `*`, `+`, `{0,n}`: the quantified atom may not + * appear. Drop the preceding character and break the current run. + * (For `{1,n}` and `{n,}` with n≥1 the preceding atom IS required, + * but v1 doesn't distinguish — all quantifiers break the run.) + * - Escape sequences `\d`, `\w`, `\s`, etc.: not literal; break the + * run. Escaped literal chars (`\.`, `\/`, `\(`) could be included + * but v1 takes the safe path and breaks the run on any `\`. + * - Anchors `^`, `$`: not bytes in the match; break the run without + * invalidating the prefix/suffix. + * - Lookaround `(?=…)`, `(?!…)`, `(?<=…)`, `(? { + if (current.length > best.length) { + best = current; + } + current = ""; + }; + + while (i < source.length) { + const c = source[i]; + + if (c === "\\") { + // Escape sequence. Three cases: + // + // 1. Escaped metacharacter (`\.`, `\/`, `\\`, etc.) — represents + // a literal byte, extends the current run. + // + // 2. Multi-char escape sequences (`\x41`, `\u0041`, `\u{1F600}`, + // `\cA`, `\k`) — these produce SPECIFIC characters but + // the tail (e.g., `41` after `\x`) is NOT a literal run + // extender. Skip past the whole sequence and break the run. + // (A future pass could decode the escape and contribute its + // char to the run — complex, bail-safe for v1.) + // + // 3. Character-class escapes (`\d`, `\w`, `\b`, `\t`, etc.) — + // not literals. Advance 2 and break the run. + const next = source[i + 1]; + if (next === undefined) { + // Trailing `\` — malformed, bail on this run. + commit(); + i += 2; + continue; + } + if (ESCAPED_LITERAL_CHARS.has(next)) { + // Case 1: `\.`, `\/`, `\\` etc. — extends the literal run. + current += next; + i += 2; + continue; + } + // Case 2 / 3: compute how far to advance past the whole escape + // sequence. Anything we don't recognize defaults to 2 (single- + // char escape like `\d`, `\b`, `\t`, `\n`, `\r`, `\s`, `\W`, etc.). + commit(); + i += escapeSequenceLength(source, i); + continue; + } + + if (c === "[") { + // Character class — content isn't a single required literal. + commit(); + i = skipCharacterClass(source, i); + continue; + } + + if (c === "(") { + // Group or lookaround — skip. v1 doesn't peek inside. + commit(); + i = skipGroup(source, i); + continue; + } + + if (QUANTIFIER_CHARS.has(c ?? "") || c === "{") { + // The preceding character was quantified — it may not appear + // in the actual match. Drop it from the current run. + if (current.length > 0) { + current = current.slice(0, -1); + } + commit(); + // Advance past the quantifier (and past `{n,m}` if present). + i += 1; + if (c === "{") { + while (i < source.length && source[i] !== "}") { + i += 1; + } + i += 1; // past the `}` + } else if (source[i] === "?") { + // Lazy quantifier `??`, `*?`, `+?` + i += 1; + } + continue; + } + + if (META_CHARS.has(c ?? "")) { + // Other metacharacter (`.`, `^`, `$`): break the run but don't + // drop anything from current. + commit(); + i += 1; + continue; + } + + // Literal byte — extend the current run. + current += c; + i += 1; + } + commit(); + + if (best.length < MIN_LITERAL_LEN) { + return null; + } + if (ALL_WHITESPACE_RE.test(best) || ALL_PUNCTUATION_RE.test(best)) { + // All whitespace or all punctuation — too high a match rate. + return null; + } + + return flags.includes("i") ? best.toLowerCase() : best; +} + +/** + * Is there a top-level `|` (alternation) in the pattern? + * + * Character classes `[...]` are OPAQUE to this scan — their interior + * contains literal bytes (including literal `(`, `)`, `|`) that must + * not be interpreted as grouping or alternation metacharacters. We + * skip the whole class in one step when `[` is seen. + */ +function hasTopLevelAlternation(source: string): boolean { + let depthParen = 0; + let i = 0; + while (i < source.length) { + const c = source[i]; + if (c === "\\") { + // Skip the whole escape sequence. An escaped char is never a + // grouping/alternation metacharacter, and multi-char escapes + // like `\x41`, `\u0041`, `\k` have tails that must not + // be mistaken for `|`, `(`, `)`. + i += escapeSequenceLength(source, i); + continue; + } + if (c === "[") { + // Character class: content is opaque. Skip past the first + // unescaped `]` (classes are NOT nestable — `[[]` matches a + // single literal `[`). This prevents `[(]` from being + // mistaken for an opening paren and corrupting `depthParen`, + // which would hide a subsequent top-level `|` alternation. + i = skipCharacterClass(source, i); + continue; + } + if (c === "(") { + depthParen += 1; + } else if (c === ")") { + depthParen -= 1; + } else if (c === "|" && depthParen === 0) { + return true; + } + i += 1; + } + return false; +} + +/** + * Compute the total length (in source characters) of an escape + * sequence starting at `i` (where `source[i] === "\\"`). Handles: + * + * - `\xHH` — hex byte (4 source chars) + * - `\uHHHH` — unicode BMP code point (6 source chars) + * - `\u{H+}` — braced unicode code point (variable length) + * - `\cX` — control char (3 source chars) + * - `\k` — named backref (variable length) + * - `\p{Name}`, `\P{Name}` — Unicode property escape (variable) + * - Anything else (single-char escape like `\d`, `\w`, `\b`, `\t`, + * escaped metachars, numeric backrefs) — 2 source chars. + * + * Used by the extractor to correctly skip past the ENTIRE escape + * sequence so its tail characters don't get misinterpreted as + * literal bytes. `\x41foo` must advance 4 chars past `\x41`, not + * just 2 past `\x` — otherwise the extractor sees `41foo` as + * literals and wrongly extracts `"41foo"`. + */ +function escapeSequenceLength(source: string, i: number): number { + const next = source[i + 1]; + if (next === undefined) { + return 2; + } + // Braced: `\u{H+}`, `\p{...}`, `\P{...}` — consume until closing `}`. + if ((next === "u" || next === "p" || next === "P") && source[i + 2] === "{") { + const close = source.indexOf("}", i + 3); + if (close === -1) { + return 2; + } + return close - i + 1; + } + // Fixed-width multi-char escapes. + if (next === "x") { + return 4; // \x + 2 hex digits + } + if (next === "u") { + return 6; // \u + 4 hex digits + } + if (next === "c") { + return 3; // \c + 1 control char + } + // Named backref: `\k`. + if (next === "k" && source[i + 2] === "<") { + const close = source.indexOf(">", i + 3); + if (close === -1) { + return 2; + } + return close - i + 1; + } + // Default: single-char escape (`\d`, `\w`, `\b`, `\t`, etc., or + // a single-digit numeric backref `\1`). + return 2; +} + +/** + * Advance past a character class `[...]` starting at `i` + * (where `source[i] === "["`). Unlike parens/groups, `[` inside + * `[...]` is NOT nestable — it's a literal `[` character. Only the + * first unescaped `]` closes the class. + * + * Handles escape sequences (`\]` stays inside the class). + * + * Returns the index just past the closing `]`, or `source.length` if + * the class is unterminated (malformed regex; caller should still + * advance past it). + */ +function skipCharacterClass(source: string, i: number): number { + let j = i + 1; + while (j < source.length) { + const c = source[j]; + if (c === "\\") { + j += escapeSequenceLength(source, j); + continue; + } + if (c === "]") { + return j + 1; + } + j += 1; + } + return j; +} + +/** + * Advance past a balanced `(...)` group starting at `i` (where + * `source[i] === "("`). Handles nested groups and escaped chars. + * Returns the index just past the matching `)`. + * + * CRITICAL: `[...]` character classes are OPAQUE to group-depth + * tracking. A `)` or `(` inside a class is a literal byte, not a + * grouping metacharacter, and MUST NOT change depth. Without this, + * a pattern like `(foo[)]bar)?baz` would exit the group at the `)` + * inside `[)]`, making the extractor treat `bar` as top-level and + * missing the fact that the whole group is optional. + * + * NOTE: Regex character classes are NOT nestable — `[[]` is a + * class containing a literal `[`, per regex spec. + */ +function skipGroup(source: string, i: number): number { + let depth = 1; + let j = i + 1; + while (j < source.length && depth > 0) { + const c = source[j]; + if (c === "\\") { + j += escapeSequenceLength(source, j); + continue; + } + if (c === "[") { + // Skip the whole character class — `(` and `)` inside it + // are literal and must not affect group depth. + j = skipCharacterClass(source, j); + continue; + } + if (c === "(") { + depth += 1; + } else if (c === ")") { + depth -= 1; + } + j += 1; + } + return j; +} diff --git a/test/lib/scan/grep.test.ts b/test/lib/scan/grep.test.ts index 357faed4b..14c3883c6 100644 --- a/test/lib/scan/grep.test.ts +++ b/test/lib/scan/grep.test.ts @@ -555,3 +555,294 @@ describe("collectGrep — pre-compiled RegExp isolation", () => { } }); }); + +describe("collectGrep — literal prefilter fast path", () => { + /** + * These tests exercise the `grepByLiteralPrefilter` path that kicks + * in when the pattern has an extractable literal but isn't itself a + * pure literal (e.g., `import.*from` → literal `import`). The tests + * verify both correctness (same results as whole-buffer) and that + * the multiline-false fallback path produces buffer-anchored + * matches. + */ + + test("regex with extractable literal produces same matches as whole-buffer", async () => { + const { cwd, cleanup } = makeSandbox({ + "a.ts": [ + "import { foo } from 'bar';", + "const x = 42;", + "import { baz } from 'qux';", + "// not an import statement", + "ximport_typeXYZ", // contains "import" but no "from" after + ].join("\n"), + }); + try { + const { matches } = await collectGrep({ + cwd, + pattern: "import.*from", + }); + // Two true matches: lines 1 and 3. Line 5 has "import" but no + // "from" after — literal prefilter catches it as a candidate + // but regex verify rejects it. + expect(matches.map((m) => m.lineNum)).toEqual([1, 3]); + } finally { + cleanup(); + } + }); + + test("literal prefilter correctly handles escape sequences like Sentry\\.init", async () => { + const { cwd, cleanup } = makeSandbox({ + "a.ts": [ + "import * as Sentry from 'sentry';", + "Sentry.init({ dsn: '...' });", + "const x = Sentryxinit; // not a match", + ].join("\n"), + }); + try { + const { matches } = await collectGrep({ + cwd, + pattern: "Sentry\\.init", + }); + expect(matches).toHaveLength(1); + expect(matches[0].lineNum).toBe(2); + } finally { + cleanup(); + } + }); + + test("case-insensitive literal prefilter finds all casings", async () => { + const { cwd, cleanup } = makeSandbox({ + "a.ts": ["IMPORT X FROM Y", "import y from z", "Import Z From W"].join( + "\n" + ), + }); + try { + const { matches } = await collectGrep({ + cwd, + pattern: "import.*from", + caseSensitive: false, + }); + expect(matches).toHaveLength(3); + } finally { + cleanup(); + } + }); + + test("literal prefilter correctly handles file with zero literal hits", async () => { + // Pattern `import.*from` extracts literal "import". This file + // contains no "import" substring at all — prefilter short- + // circuits without running the regex. + const { cwd, cleanup } = makeSandbox({ + "a.ts": "const x = 42;\nconst y = x + 1;\n", + }); + try { + const { matches } = await collectGrep({ + cwd, + pattern: "import.*from", + }); + expect(matches).toEqual([]); + } finally { + cleanup(); + } + }); + + test("pure literal patterns still match correctly (via file-level gate + whole-buffer)", async () => { + // `SENTRY_DSN` is a pure literal. The extractor returns it, and + // the file-level gate uses it to cheaply reject files without + // the literal. Files that contain it fall through to the whole- + // buffer matcher, which finds all match positions via the regex + // engine (V8 handles pure-literal patterns efficiently). + const { cwd, cleanup } = makeSandbox({ + "a.ts": "const SENTRY_DSN = 'abc';\nconst other = 1;\nSENTRY_DSN again", + }); + try { + const { matches } = await collectGrep({ + cwd, + pattern: "SENTRY_DSN", + }); + expect(matches.map((m) => m.lineNum)).toEqual([1, 3]); + } finally { + cleanup(); + } + }); + + test("patterns with top-level alternation go through whole-buffer", async () => { + // `foo|bar` has no extractable literal (extractor bails on + // alternation). Must use whole-buffer path. + const { cwd, cleanup } = makeSandbox({ + "a.ts": "has foo\nhas bar\nhas qux", + }); + try { + const { matches } = await collectGrep({ + cwd, + pattern: "foo|bar", + }); + expect(matches.map((m) => m.lineNum)).toEqual([1, 2]); + } finally { + cleanup(); + } + }); + + test("character-class with parens + alternation finds all branches (Cursor bug #1)", async () => { + // Regression: `[(]foo|bar` previously extracted `foo` as a + // required literal, causing the prefilter to silently miss + // lines matching only the `bar` branch. The fix makes char + // classes opaque to alternation detection. + const { cwd, cleanup } = makeSandbox({ + "only-bar.txt": "bar line without the paren-branch\n", + "only-paren.txt": "(foo line with the paren-branch\n", + "both.txt": "bar then (foo on different lines\n(foo again\nbar again\n", + }); + try { + const { matches } = await collectGrep({ + cwd, + pattern: "[(]foo|bar", + }); + // All four lines should match: two in `both.txt`, one each + // in the single-branch files. + const lineDigest = matches.map((m) => `${m.path}:${m.lineNum}`).sort(); + expect(lineDigest).toEqual([ + "both.txt:1", + "both.txt:2", + "both.txt:3", + "only-bar.txt:1", + "only-paren.txt:1", + ]); + } finally { + cleanup(); + } + }); + + test("optional group containing class with paren matches all branches (Cursor #1 followup)", async () => { + // Regression: the group-depth tracker inside `extractInnerLiteral` + // didn't treat `[...]` as opaque, so `(ABC[)]DEF)?GHI` extracted + // `DEF` as a required literal. The prefilter then silently + // missed lines matching only the `GHI` branch (group optional). + // + // After the fix, the whole group is correctly skipped and `GHI` + // is identified as the post-group required literal, so both + // match shapes are found. + const { cwd, cleanup } = makeSandbox({ + "only-ghi.txt": "GHI line without the optional prefix\n", + "full-match.txt": "ABC)DEFGHI full match line\n", + }); + try { + const { matches } = await collectGrep({ + cwd, + pattern: "(ABC[)]DEF)?GHI", + }); + const lineDigest = matches.map((m) => `${m.path}:${m.lineNum}`).sort(); + expect(lineDigest).toEqual(["full-match.txt:1", "only-ghi.txt:1"]); + } finally { + cleanup(); + } + }); + + test("case-insensitive prefilter handles Unicode length-changing lowercase (Cursor #2)", async () => { + // Regression: `toLowerCase()` on content containing Turkish `İ` + // (U+0130) produces `i` + U+0307, making the lowercased haystack + // LONGER than the original. `haystack.indexOf(literal)` returned + // positions that didn't align with `content`, causing + // line-boundary math to point at the wrong line or miss later + // matches entirely (because `lineEnd + 1` advanced the search + // cursor past real match positions). + // + // Fix: when `toLowerCase().length !== content.length`, fall + // back to the whole-buffer regex path, which has no cross-string + // position dependency. + const { cwd, cleanup } = makeSandbox({ + "turkish.ts": [ + "İİİİİİİ line 1 with Sentry.init here", + "İİİİİİİ line 2 plain", + "İİİİİİİ line 3 with Sentry.init again", + "İİİİİİİ line 4 plain", + ].join("\n"), + }); + try { + const { matches } = await collectGrep({ + cwd, + pattern: "Sentry\\.init", + caseSensitive: false, + }); + // Both Sentry.init matches on lines 1 and 3 should be found. + // Before the fix, only line 1 was found — the lineEnd from + // content got passed back to haystack's indexOf and skipped + // past line 3's match. + expect(matches.map((m) => m.lineNum)).toEqual([1, 3]); + } finally { + cleanup(); + } + }); + + test("cross-newline patterns (foo\\sbar) find matches spanning line boundaries (review followup)", async () => { + // Regression: the old prefilter did per-line verify via + // `verifyRegex.test(lineText)`, which failed to detect matches + // that span newlines (e.g., `\s` matches `\n`). After + // simplifying the prefilter to a file-level gate only, the + // regex engine handles cross-newline matches correctly. + const { cwd, cleanup } = makeSandbox({ + "same-line.txt": "foo bar on one line\n", + "split.txt": "foo\nbar\n", // \s (which includes \n) joins "foo" and "bar" + }); + try { + const { matches } = await collectGrep({ + cwd, + pattern: "foo\\sbar", + }); + const paths = matches.map((m) => m.path).sort(); + expect(paths).toEqual(["same-line.txt", "split.txt"]); + } finally { + cleanup(); + } + }); + + test("multi-char escapes like \\x41 find matches for the decoded char (review followup)", async () => { + // Regression: the old extractor advanced `\x41` by 2 chars, + // leaving `41` as a "literal" that wrongly flagged the file + // gate. Pattern `\x41foo` matches `Afoo` (the compiled regex + // decodes `\x41` to `A`), so files containing `Afoo` should + // match and files containing literal `41foo` should NOT. + const { cwd, cleanup } = makeSandbox({ + "actual-match.txt": "Afoo here\n", // contains decoded `A` + foo + "literal-41foo.txt": "41foo here\n", // contains literal "41foo" text + }); + try { + const { matches } = await collectGrep({ + cwd, + pattern: "\\x41foo", + }); + expect(matches.map((m) => m.path)).toEqual(["actual-match.txt"]); + } finally { + cleanup(); + } + }); + + test("scoped inline flags that introduce top-level alternation are safely handled (Cursor #3)", async () => { + // Regression: `compilePattern` rewrites `(?i:foo|bar)baz` into + // `foo|barbaz` + the global `i` flag. The rewritten source has + // top-level alternation the original lacked. If the literal + // extractor runs on the ORIGINAL source, `hasTopLevelAlternation` + // doesn't see the `|` and extracts `"baz"` as a required literal. + // But the compiled regex matches `foo` alone — lines with just + // `foo` get silently dropped by the prefilter. + // + // Fix: extract from `regex.source` (post-compile) so the + // extractor sees what the engine will run. `hasTopLevelAlternation` + // now sees the `foo|barbaz` and correctly returns null — the + // prefilter is skipped and both branches match. + const { cwd, cleanup } = makeSandbox({ + "foo-only.txt": "foo line without that other word\n", + "barbaz.txt": "contains barbaz here\n", + }); + try { + const { matches } = await collectGrep({ + cwd, + pattern: "(?i:foo|bar)baz", + }); + const paths = matches.map((m) => m.path).sort(); + expect(paths).toEqual(["barbaz.txt", "foo-only.txt"]); + } finally { + cleanup(); + } + }); +}); diff --git a/test/lib/scan/literal-extract.test.ts b/test/lib/scan/literal-extract.test.ts new file mode 100644 index 000000000..d0aef08a7 --- /dev/null +++ b/test/lib/scan/literal-extract.test.ts @@ -0,0 +1,202 @@ +import { describe, expect, test } from "bun:test"; +import { extractInnerLiteral } from "../../../src/lib/scan/literal-extract.js"; + +describe("extractInnerLiteral — basic patterns", () => { + test.each([ + // [pattern, flags, expected literal] + ["import.*from", "", "import"], // quantifier breaks run + ["function\\s+\\w+", "", "function"], // \s, \w are not literals + ["http://", "", "http://"], // no metachars at all + ["SENTRY_DSN", "", "SENTRY_DSN"], // pure literal + ["hello world", "", "hello world"], // literal with space + ["^foo$", "", "foo"], // anchors break run, leaving foo + ["^foo", "", "foo"], // leading anchor only + ["foo$", "", "foo"], // trailing anchor only + ])("extracts %p with flags %p → %p", (pattern, flags, expected) => { + expect(extractInnerLiteral(pattern, flags)).toBe(expected); + }); +}); + +describe("extractInnerLiteral — escaped metacharacters", () => { + test.each([ + ["Sentry\\.init", "", "Sentry.init"], // \. is literal dot + ["foo\\.bar", "", "foo.bar"], + ["a\\/b", "", "a/b"], // escaped slash + ["\\\\foo", "", "\\foo"], // escaped backslash + ["a\\(b\\)", "", "a(b)"], // escaped parens + ["a\\[b\\]", "", "a[b]"], // escaped brackets + ])("recognizes escaped literal %p → %p", (pattern, flags, expected) => { + expect(extractInnerLiteral(pattern, flags)).toBe(expected); + }); +}); + +describe("extractInnerLiteral — escape sequences that break runs", () => { + test.each([ + ["\\bfoo\\b", "foo"], // \b anchor, not literal b + ["\\w+foo\\d+", "foo"], // \w, \d are classes + ["\\tfoo\\t", "foo"], // \t is tab escape + ["\\nfoo\\n", "foo"], // \n is newline escape + ["\\sfoo\\s", "foo"], // \s is whitespace class + ])("breaks run on non-literal escape %p → %p", (pattern, expected) => { + expect(extractInnerLiteral(pattern, "")).toBe(expected); + }); +}); + +describe("extractInnerLiteral — returns null", () => { + test.each([ + [".*", ""], // no literal content + [".", ""], // single metachar + ["foo|bar", ""], // top-level alternation + ["(foo|bar)", ""], // group with alternation + ["[abc]", ""], // character class only + ["a?", ""], // quantified single char (too short after drop) + ["ab", ""], // below MIN_LITERAL_LEN (3) + [" ", ""], // all whitespace + ["!!!", ""], // all punctuation + ])("returns null for %p", (pattern, flags) => { + expect(extractInnerLiteral(pattern, flags)).toBeNull(); + }); +}); + +describe("extractInnerLiteral — case-insensitive flag", () => { + test("lowercases the extracted literal when flags include i", () => { + expect(extractInnerLiteral("SENTRY_DSN", "i")).toBe("sentry_dsn"); + expect(extractInnerLiteral("Import.*From", "i")).toBe("import"); + }); + + test("leaves the literal case-sensitive without i flag", () => { + expect(extractInnerLiteral("SENTRY_DSN", "")).toBe("SENTRY_DSN"); + expect(extractInnerLiteral("SENTRY_DSN", "gm")).toBe("SENTRY_DSN"); + }); +}); + +describe("extractInnerLiteral — quantifier handling", () => { + test("drops the quantified character from the run", () => { + // `abc*def` — c is quantified (may be absent), so "ab" is one + // run (length 2, below threshold) and "def" (length 3) wins. + expect(extractInnerLiteral("abc*def", "")).toBe("def"); + }); + + test("handles ? quantifier", () => { + expect(extractInnerLiteral("a?bcdef", "")).toBe("bcdef"); + }); + + test("handles + quantifier", () => { + // `abc+def` — with +, c IS required (1+ of), but the extractor + // conservatively drops the preceding char. "ab" too short, + // "def" wins. + expect(extractInnerLiteral("abc+def", "")).toBe("def"); + }); + + test("handles {n,m} quantifier", () => { + expect(extractInnerLiteral("abc{2,3}def", "")).toBe("def"); + }); + + test("longest run wins when multiple exist", () => { + // `short.*longer_run.*short` — `longer_run` wins over `short`. + expect(extractInnerLiteral("short.*longer_run.*short", "")).toBe( + "longer_run" + ); + }); +}); + +describe("extractInnerLiteral — character-class opaqueness (Cursor bug #1)", () => { + /** + * Regression: `[(]foo|bar` previously extracted `"foo"` because + * the alternation detector treated `(` inside `[...]` as an + * opening paren, which corrupted `depthParen` and hid the + * top-level `|` that followed. The prefilter then silently + * skipped lines matching only the `bar` branch. + * + * The fix: character classes are opaque to the alternation + * detector. When we see `[`, skip past the first unescaped `]` + * in one step. Also `[...]` is NOT nestable — `[[]` is a class + * containing a literal `[`. + */ + test.each([ + ["[(]foo|bar", null], // ( inside class + top-level alternation + ["[)]foo|bar", null], + ["[[]foo|bar", null], // [ inside class (nested-looking but flat) + ["[|]foo|bar", null], // | inside class is literal; second | is top-level + ["[(|)]foo|bar", null], // class containing (, |, ) + ])("returns null for %p — alternation hidden by class corrupting paren depth", (pattern, expected) => { + expect(extractInnerLiteral(pattern, "")).toBe(expected); + }); + + test.each([ + ["foo[(]bar", "foo"], // ( inside class, no alternation → longest run "foo" + ["foo[abc]bar", "foo"], // normal class between literals + ["[\\]]foo", "foo"], // escaped ] inside class + ["[^abc]def", "def"], // negated class + ["[a-z]{3,5}MARKER", "MARKER"], // class + quantifier + literal + ])("extracts %p → %p (no alternation present)", (pattern, expected) => { + expect(extractInnerLiteral(pattern, "")).toBe(expected); + }); +}); + +describe("extractInnerLiteral — class-in-group opaqueness (Cursor bug #1 followup)", () => { + /** + * Sibling of the earlier char-class-opaqueness bug: the group- + * skipper `skipGroup` (formerly `skipToMatching`) didn't treat + * `[...]` as opaque, so a `)` inside a char class closed the + * enclosing group early. For `(ABC[)]DEF)?GHI`, the extractor + * thought `DEF` was outside the optional group (hence "required"), + * and used it as the prefilter — silently missing lines that + * matched only `GHI`. + * + * Fix: `skipGroup` now calls `skipCharacterClass` when it sees + * `[`, making classes opaque to group-depth tracking. + */ + test.each([ + // Optional group containing a literal paren inside a class — + // the content of the group is NOT required; only what follows is. + ["(ABC[)]DEF)?GHI", "GHI"], + ["(foo[)]bar)?baz", "baz"], + ["(foo[(]bar)?baz", "baz"], + ["(A[[]B)?CDE", "CDE"], + ["(A[\\)]B)?CDE", "CDE"], + ])("extracts the post-group literal when group contains a class with ( or )", (pattern, expected) => { + expect(extractInnerLiteral(pattern, "")).toBe(expected); + }); + + test.each([ + ["(foo)bar", "bar"], // regular group + literal + ["(foo)?bar", "bar"], // optional group + literal + ["(abc)?xyz", "xyz"], // sanity check + ["((foo))?bar", "bar"], // nested groups (skipGroup handles nesting) + ])("control: regular groups still skip correctly", (pattern, expected) => { + expect(extractInnerLiteral(pattern, "")).toBe(expected); + }); +}); + +describe("extractInnerLiteral — multi-char escape sequences (review followup)", () => { + /** + * Regression: `\x41` is a hex-escape encoding `A` — the literal + * is 4 source chars long (`\`, `x`, `4`, `1`). The old extractor + * always advanced escape sequences by 2, leaving `41` behind as + * "literal" text. For `\x41foo` the extractor returned `"41foo"` + * but the compiled regex matches `Afoo` — silent miss on the + * gate (file containing `Afoo` but not `41foo` got skipped). + * + * Fix: `escapeSequenceLength` correctly computes the full length + * of `\x..`, `\u....`, `\u{...}`, `\cX`, `\k`, and + * `\p{...}` / `\P{...}` sequences. The extractor now skips past + * the WHOLE sequence and breaks the run (we don't try to decode + * the escape into its character — conservative). + */ + test.each([ + // Multi-char escapes — tail must NOT be extracted as literal + ["\\x41foo", "foo"], // \x41 = A; skip + foo is the literal + ["\\u0041foo", "foo"], // \u0041 = A + ["\\u{1F600}foo", "foo"], // braced unicode + ["\\cAfoo", "foo"], // control-A + ["\\kfoo", "foo"], // named backref + ["\\p{L}foo", "foo"], // Unicode property + ["\\P{L}foo", "foo"], // negated Unicode property + // Pre-escape literals still work (first-tied-longest) + ["bar\\x41foo", "bar"], // both runs length 3; first wins + ["longer_bar\\x41foo", "longer_bar"], // longer_bar > foo, longer wins + ])("correctly skips multi-char escape %p → %p", (pattern, expected) => { + expect(extractInnerLiteral(pattern, "")).toBe(expected); + }); +});