Add opt-in Intent discovery eval suite by LadyBluenotes · Pull Request #173 · TanStack/intent

LadyBluenotes · 2026-06-20T22:52:18Z

Adds an opt-in eval suite for measuring whether Copilot discovers and invokes Intent during normal coding tasks.

This PR introduces:

A dedicated eval:intent-discovery command surface, separate from normal test/build/CI gates.
Saved-transcript calibration cases for strict success, reference-only behavior, and wrong-skill loading.
Controlled Router, Start, and Table v9 fixtures.
A live Copilot CLI adapter for running real Copilot tasks in isolated fixture workspaces.
A live condition matrix covering:
- no Intent setup
- current install-style guidance
- mapped Intent guidance
- explicit Intent control
Deterministic scoring for:
- strict Intent invocation
- correct skill loaded
- autonomous discovery success
- reference-only false positives
Repeated live-run support through INTENT_DISCOVERY_RUN_COUNT.
JSON/report UI integration via vitest-evals.
A deterministic summary generator.
An optional OpenAI-backed output-quality judge that does not affect hard scores.

The suite is intentionally opt-in and report-oriented. It does not run as part of package correctness, release readiness, or normal CI.

Summary by CodeRabbit

New Features
- Added an opt-in Intent Discovery evaluation suite with saved-transcript grading and a live Copilot harness.
- Included ready-to-run fixtures (TanStack Router, TanStack Start, Table v9) plus automatic JSON report generation, markdown summaries, and optional LLM-based judging.
Documentation
- Expanded the Intent Discovery README with commands, environment variables, and scoring/report semantics (including live-run condition behavior).
Chores / Tests
- Updated tooling and eval configuration (Vitest eval setup, lint targeting, workspace checks) and improved ignore rules to keep eval artifacts out of version control.

coderabbitai · 2026-06-20T22:52:33Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 298b710e-24aa-48ef-9853-4289c2e82b9b

📥 Commits

Reviewing files that changed from the base of the PR and between 7bb199b and 50724d5.

📒 Files selected for processing (7)

evals/intent-discovery/corpus/conditions.ts
evals/intent-discovery/corpus/tasks.ts
evals/intent-discovery/fixture-corpus.eval.ts
evals/intent-discovery/graders/reference-only.ts
evals/intent-discovery/harness-capture.eval.ts
evals/intent-discovery/harness/prepare-fixture.ts
knip.json

📝 Walkthrough

Walkthrough

Adds a new evals/intent-discovery evaluation suite that measures whether Copilot autonomously discovers and invokes Intent skill surfaces. The suite includes a corpus of task/condition/fixture definitions, three fixture workspaces, saved-transcript and live-Copilot harnesses, multiple graders, Vitest eval test suites, and CLI reporting/LLM-judging scripts.

Changes

Intent Discovery Eval Suite

Layer / File(s)	Summary
Project setup: config, tooling, and scripts `.gitignore`, `eslint.config.js`, `evals/intent-discovery/tsconfig.json`, `evals/intent-discovery/vitest.evals.config.ts`, `evals/intent-discovery/README.md`, `package.json`, `knip.json`	Ignores eval output artifacts, enables type-aware ESLint, adds TypeScript and Vitest configs scoped to the new eval directory, registers `eval:intent-discovery*` npm scripts, adds `vitest-evals` as a dev dependency, and documents the full eval suite including environment variables, live runner conditions, and scoring semantics.
Corpus: task types, conditions, fixtures, and skill mappings `evals/intent-discovery/corpus/tasks.ts`, `evals/intent-discovery/corpus/conditions.ts`, `evals/intent-discovery/corpus/fixtures.ts`, `evals/intent-discovery/corpus/skill-uses.ts`, `evals/intent-discovery/corpus/live-tasks.ts`	Defines the full corpus schema: task types with id/fixture/condition/explicitness/prompt/expected fields, failure-class taxonomy, expected skill areas, condition registry with `countsTowardAutonomousScore` flags, fixture registry types, skill-use/package-name allowlist mappings, and four live router intent-discovery tasks.
Fixture workspaces `evals/intent-discovery/fixtures/router-basic/`, `evals/intent-discovery/fixtures/start-basic/`, `evals/intent-discovery/fixtures/table-v9-basic/`	Adds three concrete fixture workspace packages and representative source files: TanStack Router `users.$userId` file route with API loader, TanStack Start `/users` route with server-side `getUsers` function, and TanStack Table v9 `UserTable` component with sorting state and render logic.
Saved transcript fixtures `evals/intent-discovery/fixtures/saved-transcripts.ts`	Defines transcript-augmented task cases with transcript-specific fields (final answer, normalized messages, tool calls, invoked commands, loaded skills, agent errors) and internal helpers for task lookup and merging, used for deterministic grading runs.
Harness utilities: fixture prep, condition setup, intent command parsing `evals/intent-discovery/harness/prepare-fixture.ts`, `evals/intent-discovery/harness/setup-intent-condition.ts`, `evals/intent-discovery/harness/parse-intent-commands.ts`	Implements fixture workspace copy with temp directory management and cleanup, intent condition application (updates `package.json` allowlist, writes `AGENTS.md` guidance, scaffolds local skill packages), and full intent command parsing from tool calls and tool-message content with normalized command records and loaded-skill deduplication.
Graders `evals/intent-discovery/graders/skill-areas.ts`, `evals/intent-discovery/graders/strict-invocation.ts`, `evals/intent-discovery/graders/correct-skill-loaded.ts`, `evals/intent-discovery/graders/reference-only.ts`, `evals/intent-discovery/graders/failure-classifier.ts`, `evals/intent-discovery/graders/eval-metadata.ts`	Adds skill-area regex matchers for router/start/table-v9 keywords, strict intent invocation detector from parsed commands, correct-skill-loaded checker across expected areas, reference-only false-positive detection via message transcript matching, failure classification decision tree (harness-error / strict-success / wrong-skill / command-attempted-failed / reference-only / no-discovery-attempt), and eval metadata attachment helpers (`score`, `attachEvalMetadata`).
Harnesses: saved-transcript and live-Copilot runners `evals/intent-discovery/harness/saved-transcript-harness.ts`, `evals/intent-discovery/harness/run-copilot-task.ts`, `evals/intent-discovery/harness/live-copilot-harness.ts`, `evals/intent-discovery/bin/copilot-cli-adapter.mjs`	Implements `savedTranscriptHarness` with deterministic `runId`, artifact population, and tool-call message shaping; `runCopilotTask` that spawns configured command with task env, parses transcript into tool-call records, persists transcript, and collects workspace diff; `liveCopilotHarness` that wraps execution with workspace prep, condition setup, artifact recording, error normalization including `LiveCopilotRunnerUnavailableError`, and cleanup; and the CLI adapter that builds Copilot prompts from task metadata and invokes the command with tool/transcript flags.
Vitest eval test suites `evals/intent-discovery/intent-discovery.eval.ts`, `evals/intent-discovery/live-copilot-harness.eval.ts`, `evals/intent-discovery/harness-capture.eval.ts`, `evals/intent-discovery/condition-setup.eval.ts`, `evals/intent-discovery/fixture-corpus.eval.ts`	Adds five eval suites: saved-transcript tests grading each case with full grader assertions and autonomous scoring gates; live harness tests covering unsupported-runner/fake-runner/live-run paths with artifact and status assertions; intent command parsing and tool-call record unit tests; workspace preparation isolation and source-mutation-safety tests; condition setup file-write behavioral tests; and corpus/fixture integrity checks.
Reporting and LLM judge scripts `evals/intent-discovery/bin/summarize-results.mjs`, `evals/intent-discovery/bin/llm-judge.mjs`	`summarize-results.mjs` flattens test suites into eval cases, computes per-condition metrics (counts and pass rates), failure-class frequencies, and repeated-run pass@k / pass^k aggregates, then writes `summary.json` and `summary.md`. `llm-judge.mjs` calls the OpenAI Chat Completions API with strict JSON output format per case and writes `llm-judge.json` with judgment results and deterministic fallback scores.

Sequence Diagram(s)

sequenceDiagram
  participant EvalTest
  participant liveCopilotHarness
  participant prepareFixtureWorkspace
  participant applyIntentCondition
  participant runCopilotTask
  participant copilot-cli-adapter

  rect rgba(100, 149, 237, 0.5)
    note over EvalTest,liveCopilotHarness: Live Copilot Eval Run
    EvalTest->>liveCopilotHarness: run(task, context)
    liveCopilotHarness->>prepareFixtureWorkspace: fixture id
    prepareFixtureWorkspace-->>liveCopilotHarness: workspacePath + cleanup()
    liveCopilotHarness->>applyIntentCondition: condition + expectedSkillAreas + workspacePath
    applyIntentCondition-->>liveCopilotHarness: AppliedIntentCondition (filesWritten)
    liveCopilotHarness->>runCopilotTask: RunCopilotTaskInput
    runCopilotTask->>copilot-cli-adapter: spawn with task env vars
    copilot-cli-adapter-->>runCopilotTask: stdout transcript + TRANSCRIPT_PATH
    runCopilotTask->>runCopilotTask: parseIntentCommand lines → tool call records
    runCopilotTask->>runCopilotTask: write transcript to runs/latest/transcripts/
    runCopilotTask->>runCopilotTask: collectFileDiff (source vs workspace)
    runCopilotTask-->>liveCopilotHarness: CopilotTaskRun (messages, toolCalls, loadedSkills, diff)
    liveCopilotHarness-->>EvalTest: HarnessRun with artifacts + traces
  end

  rect rgba(152, 251, 152, 0.5)
    note over EvalTest,liveCopilotHarness: Grading
    EvalTest->>EvalTest: strictIntentInvocation(run)
    EvalTest->>EvalTest: correctSkillLoaded(run, expectedSkillAreas)
    EvalTest->>EvalTest: classifyFailure(run, expectedSkillAreas)
    EvalTest->>EvalTest: attachEvalMetadata(task, scores, run)
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related issues

Eval harness for skill-discovery trigger reliability #167: Implements the complete "Intent discovery" evaluation harness infrastructure that measures skill-discovery trigger reliability—provides the framework, graders, and fixtures needed to evaluate competing trigger mechanisms for autonomous Intent skill discovery.

Suggested reviewers

KevinVandy

Poem

🐇 Hop, hop! A new eval has appeared,
With harnesses, graders, and fixtures so clear.
The router loads skills, the table checks twice,
Pass@k and pass^k — oh, isn't that nice!
The rabbit now scores what Copilot discovers,
No intent left hidden, no skill under covers! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Add opt-in Intent discovery eval suite' is clear, concise, and accurately summarizes the main change: introducing a new optional evaluation suite for Intent discovery. It directly reflects the primary objective of the pull request.
Description check	✅ Passed	The PR description thoroughly covers the changes, motivation, and scope. It clearly describes what is being added (eval suite, fixtures, harnesses, scoring, etc.), explains it is opt-in and report-oriented, and explicitly states it does not affect normal testing/CI. However, the description does not follow the repository template structure with the 🎯 Changes and ✅ Checklist sections.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch eval/intent-discovery

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🟡 Minor comments (16)

evals/intent-discovery/harness/prepare-fixture.ts-47-51 (1)

47-51: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make the copy filter path-separator safe.

Line 50 uses a hardcoded / path fragment. On Windows this won’t match \, so runs artifacts may be copied into workspaces.

Suggested diff

 import { basename, dirname, join } from 'node:path'
+import { sep } from 'node:path'
@@
   cpSync(sourcePath, workspacePath, {
     recursive: true,
     verbatimSymlinks: true,
-    filter: (source) => !source.includes(`${fixturesDir}/runs/`),
+    filter: (source) =>
+      !source.includes(`${fixturesDir}${sep}runs${sep}`),
   })

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/intent-discovery/harness/prepare-fixture.ts` around lines 47 - 51, The
filter function in the cpSync call uses a hardcoded forward slash in the path
pattern `${fixturesDir}/runs/`, which is not cross-platform compatible and will
fail to match paths on Windows that use backslashes instead. Update the filter
logic in the cpSync function to use a path-separator-safe approach, such as
using path.sep from the path module or normalizing the path so that the pattern
correctly excludes the runs directory on both Windows and Unix systems
regardless of the operating system's path separator.

evals/intent-discovery/harness/setup-intent-condition.ts-7-13 (1)

7-13: ⚠️ Potential issue | 🟡 Minor

Reorder imports to satisfy import/order.

The type-only imports (lines 7–9) should come after the regular sibling imports (lines 10–13), as the import/order rule places the type group last.

Suggested diff

 import {
   buildIntentSkillGuidanceBlock,
   buildIntentSkillsBlock,
 } from '../../../packages/intent/src/commands/install-writer.js'
+import {
+  expectedSkillUseByArea,
+  packageAllowlistByArea,
+} from '../corpus/skill-uses'
 import type { IntentDiscoveryCondition } from '../corpus/conditions'
 import type { ExpectedSkillArea } from '../corpus/tasks'
 import type { ScanResult } from '../../../packages/intent/src/types.js'
-import {
-  expectedSkillUseByArea,
-  packageAllowlistByArea,
-} from '../corpus/skill-uses'

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/intent-discovery/harness/setup-intent-condition.ts` around lines 7 -
13, The imports violate the `import/order` rule which requires type-only imports
to come after regular sibling imports. Reorder the import statements so that the
regular imports from '../corpus/skill-uses' containing expectedSkillUseByArea
and packageAllowlistByArea appear first, followed by the type-only imports (type
{ IntentDiscoveryCondition }, type { ExpectedSkillArea }, and type { ScanResult
}).

Source: Linters/SAST tools

evals/intent-discovery/fixtures/saved-transcripts.ts-1-4 (1)

1-4: ⚠️ Potential issue | 🟡 Minor

Fix import order to satisfy lint.

Line 3 violates the configured import/order rule. The parent value import must come before type imports.

Suggested diff

-import type { NormalizedMessage, ToolCallRecord } from 'vitest-evals'
-import type { IntentDiscoveryTask } from '../corpus/tasks'
 import { tasks } from '../corpus/tasks'
+import type { NormalizedMessage, ToolCallRecord } from 'vitest-evals'
+import type { IntentDiscoveryTask } from '../corpus/tasks'

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/intent-discovery/fixtures/saved-transcripts.ts` around lines 1 - 4, The
import statements in the saved-transcripts.ts file violate the import/order
linting rule. Currently, type imports from 'vitest-evals' and '../corpus/tasks'
appear before the value import from '../corpus/tasks'. Reorder the imports so
that the value import `import { tasks } from '../corpus/tasks'` comes before the
type imports `import type { NormalizedMessage, ToolCallRecord } from
'vitest-evals'` and `import type { IntentDiscoveryTask } from
'../corpus/tasks'`. This ensures parent value imports are placed before type
imports as required by the lint configuration.

Source: Linters/SAST tools

evals/intent-discovery/graders/correct-skill-loaded.ts-1-4 (1)

1-4: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Resolve import-order lint errors in this module.

The current import order violates import/order and can fail lint gates.

Suggested patch

-import type { HarnessRun } from 'vitest-evals'
-import type { ExpectedSkillArea } from '../corpus/tasks'
 import { loadedSkillUsesFromRun } from '../harness/parse-intent-commands'
 import { listIncludesExpectedSkillArea } from './skill-areas'
+import type { HarnessRun } from 'vitest-evals'
+import type { ExpectedSkillArea } from '../corpus/tasks'

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/intent-discovery/graders/correct-skill-loaded.ts` around lines 1 - 4,
The imports in this module violate the import/order linting rule. Reorganize the
four import statements by grouping external/third-party imports before relative
imports, and within each group maintain alphabetical order. The type imports
from vitest-evals and ../corpus/tasks should come first, followed by the regular
imports from ../harness/parse-intent-commands and ./skill-areas. Ensure that all
type imports are grouped together at the top, and then relative imports are
ordered alphabetically (with .. paths before . paths if they exist in the same
group).

Source: Linters/SAST tools

evals/intent-discovery/graders/failure-classifier.ts-1-8 (1)

1-8: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Adjust import order to match lint configuration.

Type imports on Line 1-5 should be moved after value imports to satisfy the configured import/order rule.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/intent-discovery/graders/failure-classifier.ts` around lines 1 - 8, The
import statements violate the configured import/order rule because type imports
are positioned before value imports. Reorder the imports in the
failure-classifier.ts file so that the value imports from
'./correct-skill-loaded', './reference-only', and './strict-invocation' appear
first, followed by the type imports from 'vitest-evals' and '../corpus/tasks'
(the imports of HarnessRun, ExpectedSkillArea, and IntentDiscoveryFailureClass).

Source: Linters/SAST tools

evals/intent-discovery/graders/strict-invocation.ts-1-2 (1)

1-2: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix import ordering to satisfy configured ESLint rules.

Line 2 should be ordered before the type-only vitest-evals import per the active import/order rule.

Proposed fix

-import type { HarnessRun } from 'vitest-evals'
 import { intentCommandsFromRun } from '../harness/parse-intent-commands'
+import type { HarnessRun } from 'vitest-evals'

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/intent-discovery/graders/strict-invocation.ts` around lines 1 - 2, The
import statement order does not comply with the configured ESLint import/order
rule. Reorder the imports at the top of the strict-invocation.ts file so that
the regular import from '../harness/parse-intent-commands' (the
intentCommandsFromRun import) comes before the type-only import from
'vitest-evals' (the HarnessRun type import).

Source: Linters/SAST tools

evals/intent-discovery/graders/eval-metadata.ts-36-37 (1)

36-37: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Guard against empty score arrays when computing avgScore.

If scores.length is 0, Line 37 yields NaN, which can pollute downstream report data.

Proposed fix

-  const avgScore =
-    scores.reduce((total, item) => total + (item.score ?? 0), 0) / scores.length
+  const avgScore =
+    scores.length === 0
+      ? 0
+      : scores.reduce((total, item) => total + (item.score ?? 0), 0) /
+        scores.length

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/intent-discovery/graders/eval-metadata.ts` around lines 36 - 37, The
avgScore calculation divides by scores.length without checking if the array is
empty, which results in NaN when scores.length is 0. Add a guard condition
before computing avgScore to check if the scores array is empty and return an
appropriate default value (such as 0) in that case, otherwise proceed with the
existing reduce operation and division.

evals/intent-discovery/graders/reference-only.ts-1-4 (1)

1-4: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reorder imports to clear import/order errors.

Line 3 and Line 4 value imports should be placed before type-only imports from vitest-evals.

Proposed fix

-import type { HarnessRun } from 'vitest-evals'
 import type { ExpectedSkillArea } from '../corpus/tasks'
 import { jsonToSearchableText, textMatchesSkillArea } from './skill-areas'
 import { strictIntentInvocation } from './strict-invocation'
+import type { HarnessRun } from 'vitest-evals'

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/intent-discovery/graders/reference-only.ts` around lines 1 - 4, The
import statements in this file violate the import/order rule. Reorder the
imports so that all value imports (the ones without the 'type' keyword) are
placed before type-only imports. Specifically, move the value imports from
'./skill-areas' and './strict-invocation' to appear before the type imports from
'vitest-evals' and '../corpus/tasks'.

Source: Linters/SAST tools

evals/intent-discovery/graders/eval-metadata.ts-1-2 (1)

1-2: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix import sort/order to pass configured ESLint rules.

Line 1-2 currently violate sort-imports and import/order.

Proposed fix

-import type { HarnessRun, JudgeResult, JsonValue } from 'vitest-evals'
 import { toolCalls } from 'vitest-evals'
+import type { HarnessRun, JsonValue, JudgeResult } from 'vitest-evals'

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/intent-discovery/graders/eval-metadata.ts` around lines 1 - 2, The
imports from 'vitest-evals' on lines 1-2 are not in the correct order according
to the configured ESLint rules for sort-imports and import/order. Reorganize the
import statements so that type imports are properly separated from value imports
and the imported names are in alphabetical order. Consider combining or
reordering the imports from the same module ('vitest-evals') to ensure toolCalls
and the type imports from HarnessRun, JudgeResult, JsonValue follow the ESLint
import/order configuration.

Source: Linters/SAST tools

evals/intent-discovery/harness/saved-transcript-harness.ts-94-99 (1)

94-99: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Avoid duplicating tool calls when assistant messages already include them.

Line 98 blindly concatenates message.toolCalls with toolCalls. If saved cases include both, this double-counts the same commands and can skew downstream grading signals.

Suggested fix

   return messages.map((message, index) =>
     index === firstAssistantIndex
       ? {
           ...message,
-          toolCalls: [...(message.toolCalls ?? []), ...toolCalls],
+          toolCalls: (() => {
+            const existing = message.toolCalls ?? []
+            const seen = new Set(
+              existing.map(
+                (call) => `${call.name}:${JSON.stringify(call.arguments ?? {})}`,
+              ),
+            )
+            return [
+              ...existing,
+              ...toolCalls.filter((call) => {
+                const key = `${call.name}:${JSON.stringify(call.arguments ?? {})}`
+                if (seen.has(key)) return false
+                seen.add(key)
+                return true
+              }),
+            ]
+          })(),
         }
       : message,
   )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/intent-discovery/harness/saved-transcript-harness.ts` around lines 94 -
99, The toolCalls concatenation in the map function at index ===
firstAssistantIndex is duplicating tool calls without checking if they already
exist in message.toolCalls. Instead of blindly spreading both arrays together,
filter the toolCalls array to exclude any tool calls that already exist in
message.toolCalls before concatenating. This prevents double-counting the same
commands in the merged tool calls array.

evals/intent-discovery/intent-discovery.eval.ts-1-14 (1)

1-14: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Resolve import lint errors in the header block.

The current import block violates import/order, sort-imports, and import/consistent-type-specifier-style, which will keep lint red.

Suggested fix

-import type { HarnessContext, HarnessRun } from 'vitest-evals'
 import { describe, expect, it } from 'vitest'
 import { failedSpans, toolCalls } from 'vitest-evals'
 import { countsTowardAutonomousScore } from './corpus/conditions'
 import { correctSkillLoaded } from './graders/correct-skill-loaded'
 import { attachEvalMetadata, score } from './graders/eval-metadata'
 import { classifyFailure } from './graders/failure-classifier'
 import { referenceOnly } from './graders/reference-only'
 import { strictIntentInvocation } from './graders/strict-invocation'
 import { savedTranscriptCases } from './fixtures/saved-transcripts'
-import {
-  savedTranscriptHarness,
-  type IntentDiscoveryOutput,
-} from './harness/saved-transcript-harness'
+import { savedTranscriptHarness } from './harness/saved-transcript-harness'
+import type { IntentDiscoveryOutput } from './harness/saved-transcript-harness'
+import type { HarnessContext, HarnessRun } from 'vitest-evals'

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/intent-discovery/intent-discovery.eval.ts` around lines 1 - 14, Reorder
and organize the imports in the header block to comply with linting rules.
Separate the type imports (those using the `type` keyword) from regular imports,
grouping all type imports together using consistent type specifier syntax.
Within each group, sort the imports alphabetically and organize them by module
source following the pattern of external packages first (like vitest and
vitest-evals), then local paths (starting with dot notation like './corpus',
'./graders', './fixtures', './harness'). Ensure the import statement starting
with `import type { HarnessContext, HarnessRun }` is grouped with other type
imports and that regular imports like those from 'vitest' are kept separate and
properly sorted.

Source: Linters/SAST tools

evals/intent-discovery/harness-capture.eval.ts-1-6 (1)

1-6: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Consolidate duplicated node:fs imports and normalize import order.

node:fs is imported twice (Lines 1–2), and the current ordering also triggers lint in this block.

Suggested fix

-import { existsSync, mkdirSync, readFileSync } from 'node:fs'
-import { mkdtempSync, rmSync } from 'node:fs'
+import {
+  existsSync,
+  mkdirSync,
+  mkdtempSync,
+  readFileSync,
+  rmSync,
+} from 'node:fs'
 import { tmpdir } from 'node:os'
 import { join } from 'node:path'
 import { describe, expect, it } from 'vitest'
-import type { ToolCallRecord } from 'vitest-evals'
 import { fixtures } from './corpus/fixtures'
 import { tasks } from './corpus/tasks'
 import {
   intentCommandsFromToolCalls,
   parseIntentCommand,
 } from './harness/parse-intent-commands'
 import { prepareFixtureWorkspace } from './harness/prepare-fixture'
+import type { ToolCallRecord } from 'vitest-evals'

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/intent-discovery/harness-capture.eval.ts` around lines 1 - 6,
Consolidate the two separate imports from 'node:fs' into a single import
statement. Combine existsSync, mkdirSync, readFileSync from the first import
with mkdtempSync and rmSync from the second import into one import statement
from 'node:fs', and ensure the combined imports follow proper alphabetical or
stylistic ordering to satisfy linting rules.

Source: Linters/SAST tools

evals/intent-discovery/live-copilot-harness.eval.ts-1-20 (1)

1-20: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Consolidate and normalize imports to clear current lint failures.

There are active import lint errors here (import/no-duplicates, import/order, sort-imports, and type-specifier style), so this file will stay lint-red until the block is normalized.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/intent-discovery/live-copilot-harness.eval.ts` around lines 1 - 20,
Consolidate the duplicate imports from node:fs by combining the existsSync and
writeFileSync imports with mkdtempSync and rmSync into a single import
statement. Then organize all imports in the file according to the project's lint
rules: group external dependencies (node: modules and npm packages) before local
imports, and place type imports separately or according to the configured import
order. Verify that the import ordering follows the patterns expected by
import/order and sort-imports lint rules to clear the lint failures in this
file.
Source: Linters/SAST tools

evals/intent-discovery/fixture-corpus.eval.ts-6-6 (1)

6-6: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Adjust import declaration style/order to satisfy ESLint.

This import currently violates configured sort-imports and import/consistent-type-specifier-style rules.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/intent-discovery/fixture-corpus.eval.ts` at line 6, The import
statement for tasks and ExpectedSkillArea violates ESLint's sort-imports and
import/consistent-type-specifier-style rules. Separate the type import from the
value import by creating two distinct import statements: one using import type
for ExpectedSkillArea and another for the tasks value import, ensuring they are
ordered correctly according to the configured ESLint rules.
Source: Linters/SAST tools

evals/intent-discovery/harness/live-copilot-harness.ts-1-10 (1)

1-10: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix import order to satisfy the configured ESLint rules.

Imports currently violate import/order.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/intent-discovery/harness/live-copilot-harness.ts` around lines 1 - 10,
The imports in the file are not ordered according to the ESLint `import/order`
rule. Reorganize the imports by grouping them in the following order: first
place external/third-party package imports (like createHarness from
vitest-evals), then type imports from relative paths, and finally regular
imports from relative paths ordered from parent directories (..) to same
directory imports (.). Make sure the type import statement for
IntentDiscoveryTask is properly separated and grouped with other type imports if
any exist.

Source: Linters/SAST tools

evals/intent-discovery/harness/run-copilot-task.ts-1-11 (1)

1-11: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Resolve ESLint import-order violations in this module.

The current import block violates the configured import/order rule and will keep lint red.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/intent-discovery/harness/run-copilot-task.ts` around lines 1 - 11, The
imports in the module are not ordered according to the configured import/order
ESLint rule. Reorganize the import statements by grouping them in the correct
order: first group all type imports together (vitest-evals and ../corpus/tasks),
then group all Node.js built-in imports (node:fs, node:path, node:url,
node:child_process), then group third-party packages, and finally group relative
local imports (./parse-intent-commands). Ensure type imports are clearly
separated from regular imports and follow the rule's grouping expectations.

Source: Linters/SAST tools

🧹 Nitpick comments (2)

evals/intent-discovery/fixtures/saved-transcripts.ts (1)

148-151: ⚡ Quick win

Narrow taskId to the task-id union for compile-time safety.

Line 149 currently accepts any string; typos are only caught at runtime. Use IntentDiscoveryTask['id'] to fail earlier.

Suggested diff

 function savedTranscript(
-  taskId: string,
+  taskId: IntentDiscoveryTask['id'],
   transcript: Omit<SavedTranscriptCase, keyof IntentDiscoveryTask>,
 ): SavedTranscriptCase {

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/intent-discovery/fixtures/saved-transcripts.ts` around lines 148 - 151,
The taskId parameter in the savedTranscript function is currently typed as a
generic string, which allows any string value and only catches type errors at
runtime. Change the taskId parameter type from string to
IntentDiscoveryTask['id'] to narrow it to the specific union of valid task IDs,
providing compile-time safety and catching typos and invalid values earlier in
the development process.

evals/intent-discovery/condition-setup.eval.ts (1)

8-90: ⚡ Quick win

Add a case for the explicit Intent-control condition.

This suite validates three setup modes, but not the explicit Intent-control path from the condition matrix. That leaves a regression gap in the scenario this PR says it supports.

Suggested test addition

 describe('Intent discovery condition setup', () => {
+  it('writes explicit Intent control guidance', () => {
+    const prepared = prepareInTemp()
+
+    try {
+      const result = applyIntentCondition({
+        condition: 'explicit-intent',
+        expectedSkillAreas: ['router'],
+        workspacePath: prepared.workspacePath,
+      })
+      const agents = readFileSync(
+        join(prepared.workspacePath, 'AGENTS.md'),
+        'utf8',
+      )
+
+      expect(result.filesWritten.length).toBeGreaterThan(0)
+      expect(agents).toContain('intent')
+      expect(agents).toContain('`@tanstack/router`#routing')
+    } finally {
+      prepared.cleanup()
+    }
+  })
+
   it('leaves no-intent workspaces without Intent guidance', () => {

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/intent-discovery/condition-setup.eval.ts` around lines 8 - 90, The test
suite in the Intent discovery condition setup describe block currently covers
three condition types (no-intent, current-intent, and mapped-intent) but is
missing a test case for the explicit Intent-control condition. Add a new test
case using the it() function that follows the same pattern as the existing
tests: call prepareInTemp() to set up a workspace, invoke applyIntentCondition
with the Intent-control condition and expectedSkillAreas, validate the expected
outcomes (such as files written and content assertions), and ensure cleanup is
called in the finally block. This will complete the regression test coverage for
all supported condition modes mentioned in the PR.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@evals/intent-discovery/bin/llm-judge.mjs`:
- Around line 68-95: The judgeCase function's fetch call to the OpenAI API lacks
a timeout mechanism, which could cause the entire evaluation run to hang
indefinitely if a network request stalls. Add a 30-second timeout by creating an
AbortController instance before the fetch call, passing the controller's signal
to the fetch options, and setting a timeout that aborts the controller after 30
seconds. Wrap the fetch call in a try-catch block to handle AbortError
exceptions that occur when the timeout is triggered.
- Around line 104-110: The JSON.parse(content) call lacks error handling, which
can cause a single malformed response from the API to throw an exception and
abort the entire batch processing loop. Even though the fallback to '{}'
provides protection for missing content, the API could return valid JSON with
invalid content inside that would still cause JSON.parse to throw. Wrap the
JSON.parse(content) call in a try-catch block and return an error object
consistent with the existing error-handling pattern at lines 97-101 so that
malformed responses are handled gracefully without terminating remaining
judgments.

In `@evals/intent-discovery/fixtures/router-basic/package.json`:
- Around line 6-8: Replace the "latest" version specifiers for the dependencies
`@tanstack/react-router`, react, and react-dom with pinned version numbers (e.g.,
specific version numbers like "18.2.0" instead of "latest"). This ensures that
the eval fixture has deterministic dependency resolution across different
installations and repeated runs, preventing non-determinism in evaluation
outcomes.

In `@evals/intent-discovery/fixtures/start-basic/package.json`:
- Around line 6-9: The package.json for the intent-discovery eval fixture uses
floating version strings "latest" for `@tanstack/react-router`, react, and
react-dom, while `@tanstack/react-start` is pinned to 1.168.26. This inconsistency
causes non-deterministic builds that can vary between runs. Replace all "latest"
version strings in the dependencies section (for `@tanstack/react-router`, react,
and react-dom) with exact pinned version numbers to ensure reproducible eval
runs across all invocations.

In `@evals/intent-discovery/fixtures/table-v9-basic/package.json`:
- Around line 7-8: The react and react-dom dependencies in the package.json file
are currently set to "latest" which causes non-deterministic behavior in the
eval fixtures as versions can change between runs. Replace the "latest" version
specifiers for both the "react" and "react-dom" dependencies with specific
pinned version numbers (e.g., a version like "18.2.0") to ensure consistent and
deterministic fixture behavior across evaluation runs.

In `@evals/intent-discovery/graders/skill-areas.ts`:
- Line 7: The regex pattern `/v9/i` in the table-v9 array is too generic and
matches "v9" anywhere in text, causing false positives for unrelated content.
Remove this overly broad pattern or replace it with a more specific one that
includes context like "table" to ensure matches are specifically about TanStack
Table v9 and not other unrelated version references.

In `@evals/intent-discovery/harness/live-copilot-harness.ts`:
- Around line 23-38: Move the initialization code that calls
prepareFixtureWorkspace, applyIntentCondition, and setCommonArtifacts into the
try block before the existing try statement. Currently if either
prepareFixtureWorkspace or applyIntentCondition throws an error, the cleanup
handler in the finally block will never execute, causing workspace state to
leak. By moving these setup calls inside the try block, the corresponding
cleanup logic will properly execute in the finally block regardless of whether
an error occurs during setup or main execution.

In `@evals/intent-discovery/harness/parse-intent-commands.ts`:
- Around line 8-15: The command string union type in the parseIntentCommands
function or related parser is missing non-@latest variants for dlx and bunx
commands. Add additional command string entries to the union that include
versions without the `@latest` suffix for commands like bunx `@tanstack/intent` (in
addition to bunx `@tanstack/intent`@latest), yarn dlx `@tanstack/intent` (in
addition to yarn dlx `@tanstack/intent`@latest), and any other dlx/bunx variants
that currently only include the `@latest` version. This ensures the parser
correctly recognizes valid intent invocations using the default latest version
behavior without explicitly specifying `@latest`.

In `@evals/intent-discovery/harness/run-copilot-task.ts`:
- Around line 120-154: The runCommand function spawns a child process without
timeout protection, which can cause the Promise to hang indefinitely if the
process stalls. Add a configurable timeout mechanism (default 5 minutes) using
setTimeout that forcibly terminates the child process via child.kill() and
rejects the Promise if the timeout expires before the process naturally closes.
Implement a settled flag to prevent race conditions between the timeout firing
and the process exiting normally, ensuring that whichever completes first
controls the final resolution or rejection of the Promise.

---

Minor comments:
In `@evals/intent-discovery/fixture-corpus.eval.ts`:
- Line 6: The import statement for tasks and ExpectedSkillArea violates ESLint's
sort-imports and import/consistent-type-specifier-style rules. Separate the type
import from the value import by creating two distinct import statements: one
using import type for ExpectedSkillArea and another for the tasks value import,
ensuring they are ordered correctly according to the configured ESLint rules.

In `@evals/intent-discovery/fixtures/saved-transcripts.ts`:
- Around line 1-4: The import statements in the saved-transcripts.ts file
violate the import/order linting rule. Currently, type imports from
'vitest-evals' and '../corpus/tasks' appear before the value import from
'../corpus/tasks'. Reorder the imports so that the value import `import { tasks
} from '../corpus/tasks'` comes before the type imports `import type {
NormalizedMessage, ToolCallRecord } from 'vitest-evals'` and `import type {
IntentDiscoveryTask } from '../corpus/tasks'`. This ensures parent value imports
are placed before type imports as required by the lint configuration.

In `@evals/intent-discovery/graders/correct-skill-loaded.ts`:
- Around line 1-4: The imports in this module violate the import/order linting
rule. Reorganize the four import statements by grouping external/third-party
imports before relative imports, and within each group maintain alphabetical
order. The type imports from vitest-evals and ../corpus/tasks should come first,
followed by the regular imports from ../harness/parse-intent-commands and
./skill-areas. Ensure that all type imports are grouped together at the top, and
then relative imports are ordered alphabetically (with .. paths before . paths
if they exist in the same group).

In `@evals/intent-discovery/graders/eval-metadata.ts`:
- Around line 36-37: The avgScore calculation divides by scores.length without
checking if the array is empty, which results in NaN when scores.length is 0.
Add a guard condition before computing avgScore to check if the scores array is
empty and return an appropriate default value (such as 0) in that case,
otherwise proceed with the existing reduce operation and division.
- Around line 1-2: The imports from 'vitest-evals' on lines 1-2 are not in the
correct order according to the configured ESLint rules for sort-imports and
import/order. Reorganize the import statements so that type imports are properly
separated from value imports and the imported names are in alphabetical order.
Consider combining or reordering the imports from the same module
('vitest-evals') to ensure toolCalls and the type imports from HarnessRun,
JudgeResult, JsonValue follow the ESLint import/order configuration.

In `@evals/intent-discovery/graders/failure-classifier.ts`:
- Around line 1-8: The import statements violate the configured import/order
rule because type imports are positioned before value imports. Reorder the
imports in the failure-classifier.ts file so that the value imports from
'./correct-skill-loaded', './reference-only', and './strict-invocation' appear
first, followed by the type imports from 'vitest-evals' and '../corpus/tasks'
(the imports of HarnessRun, ExpectedSkillArea, and IntentDiscoveryFailureClass).

In `@evals/intent-discovery/graders/reference-only.ts`:
- Around line 1-4: The import statements in this file violate the import/order
rule. Reorder the imports so that all value imports (the ones without the 'type'
keyword) are placed before type-only imports. Specifically, move the value
imports from './skill-areas' and './strict-invocation' to appear before the type
imports from 'vitest-evals' and '../corpus/tasks'.

In `@evals/intent-discovery/graders/strict-invocation.ts`:
- Around line 1-2: The import statement order does not comply with the
configured ESLint import/order rule. Reorder the imports at the top of the
strict-invocation.ts file so that the regular import from
'../harness/parse-intent-commands' (the intentCommandsFromRun import) comes
before the type-only import from 'vitest-evals' (the HarnessRun type import).

In `@evals/intent-discovery/harness-capture.eval.ts`:
- Around line 1-6: Consolidate the two separate imports from 'node:fs' into a
single import statement. Combine existsSync, mkdirSync, readFileSync from the
first import with mkdtempSync and rmSync from the second import into one import
statement from 'node:fs', and ensure the combined imports follow proper
alphabetical or stylistic ordering to satisfy linting rules.

In `@evals/intent-discovery/harness/live-copilot-harness.ts`:
- Around line 1-10: The imports in the file are not ordered according to the
ESLint `import/order` rule. Reorganize the imports by grouping them in the
following order: first place external/third-party package imports (like
createHarness from vitest-evals), then type imports from relative paths, and
finally regular imports from relative paths ordered from parent directories (..)
to same directory imports (.). Make sure the type import statement for
IntentDiscoveryTask is properly separated and grouped with other type imports if
any exist.

In `@evals/intent-discovery/harness/prepare-fixture.ts`:
- Around line 47-51: The filter function in the cpSync call uses a hardcoded
forward slash in the path pattern `${fixturesDir}/runs/`, which is not
cross-platform compatible and will fail to match paths on Windows that use
backslashes instead. Update the filter logic in the cpSync function to use a
path-separator-safe approach, such as using path.sep from the path module or
normalizing the path so that the pattern correctly excludes the runs directory
on both Windows and Unix systems regardless of the operating system's path
separator.

In `@evals/intent-discovery/harness/run-copilot-task.ts`:
- Around line 1-11: The imports in the module are not ordered according to the
configured import/order ESLint rule. Reorganize the import statements by
grouping them in the correct order: first group all type imports together
(vitest-evals and ../corpus/tasks), then group all Node.js built-in imports
(node:fs, node:path, node:url, node:child_process), then group third-party
packages, and finally group relative local imports (./parse-intent-commands).
Ensure type imports are clearly separated from regular imports and follow the
rule's grouping expectations.

In `@evals/intent-discovery/harness/saved-transcript-harness.ts`:
- Around line 94-99: The toolCalls concatenation in the map function at index
=== firstAssistantIndex is duplicating tool calls without checking if they
already exist in message.toolCalls. Instead of blindly spreading both arrays
together, filter the toolCalls array to exclude any tool calls that already
exist in message.toolCalls before concatenating. This prevents double-counting
the same commands in the merged tool calls array.

In `@evals/intent-discovery/harness/setup-intent-condition.ts`:
- Around line 7-13: The imports violate the `import/order` rule which requires
type-only imports to come after regular sibling imports. Reorder the import
statements so that the regular imports from '../corpus/skill-uses' containing
expectedSkillUseByArea and packageAllowlistByArea appear first, followed by the
type-only imports (type { IntentDiscoveryCondition }, type { ExpectedSkillArea
}, and type { ScanResult }).

In `@evals/intent-discovery/intent-discovery.eval.ts`:
- Around line 1-14: Reorder and organize the imports in the header block to
comply with linting rules. Separate the type imports (those using the `type`
keyword) from regular imports, grouping all type imports together using
consistent type specifier syntax. Within each group, sort the imports
alphabetically and organize them by module source following the pattern of
external packages first (like vitest and vitest-evals), then local paths
(starting with dot notation like './corpus', './graders', './fixtures',
'./harness'). Ensure the import statement starting with `import type {
HarnessContext, HarnessRun }` is grouped with other type imports and that
regular imports like those from 'vitest' are kept separate and properly sorted.

In `@evals/intent-discovery/live-copilot-harness.eval.ts`:
- Around line 1-20: Consolidate the duplicate imports from node:fs by combining
the existsSync and writeFileSync imports with mkdtempSync and rmSync into a
single import statement. Then organize all imports in the file according to the
project's lint rules: group external dependencies (node: modules and npm
packages) before local imports, and place type imports separately or according
to the configured import order. Verify that the import ordering follows the
patterns expected by import/order and sort-imports lint rules to clear the lint
failures in this file.

---

Nitpick comments:
In `@evals/intent-discovery/condition-setup.eval.ts`:
- Around line 8-90: The test suite in the Intent discovery condition setup
describe block currently covers three condition types (no-intent,
current-intent, and mapped-intent) but is missing a test case for the explicit
Intent-control condition. Add a new test case using the it() function that
follows the same pattern as the existing tests: call prepareInTemp() to set up a
workspace, invoke applyIntentCondition with the Intent-control condition and
expectedSkillAreas, validate the expected outcomes (such as files written and
content assertions), and ensure cleanup is called in the finally block. This
will complete the regression test coverage for all supported condition modes
mentioned in the PR.

In `@evals/intent-discovery/fixtures/saved-transcripts.ts`:
- Around line 148-151: The taskId parameter in the savedTranscript function is
currently typed as a generic string, which allows any string value and only
catches type errors at runtime. Change the taskId parameter type from string to
IntentDiscoveryTask['id'] to narrow it to the specific union of valid task IDs,
providing compile-time safety and catching typos and invalid values earlier in
the development process.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 70517ebf-574c-4e5b-a40a-5b7b82a10c23

📥 Commits

Reviewing files that changed from the base of the PR and between 6826777 and 3e8f4dc.

⛔ Files ignored due to path filters (1)

pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml

📒 Files selected for processing (41)

.gitignore
V1-RFC.md
eslint.config.js
evals/intent-discovery/README.md
evals/intent-discovery/bin/copilot-cli-adapter.mjs
evals/intent-discovery/bin/llm-judge.mjs
evals/intent-discovery/bin/summarize-results.mjs
evals/intent-discovery/condition-setup.eval.ts
evals/intent-discovery/corpus/conditions.ts
evals/intent-discovery/corpus/fixtures.ts
evals/intent-discovery/corpus/live-tasks.ts
evals/intent-discovery/corpus/skill-uses.ts
evals/intent-discovery/corpus/tasks.ts
evals/intent-discovery/fixture-corpus.eval.ts
evals/intent-discovery/fixtures/router-basic/package.json
evals/intent-discovery/fixtures/router-basic/src/routes/users.$userId.tsx
evals/intent-discovery/fixtures/saved-transcripts.ts
evals/intent-discovery/fixtures/start-basic/package.json
evals/intent-discovery/fixtures/start-basic/src/routes/users.tsx
evals/intent-discovery/fixtures/table-v9-basic/package.json
evals/intent-discovery/fixtures/table-v9-basic/src/user-table.tsx
evals/intent-discovery/graders/correct-skill-loaded.ts
evals/intent-discovery/graders/eval-metadata.ts
evals/intent-discovery/graders/failure-classifier.ts
evals/intent-discovery/graders/reference-only.ts
evals/intent-discovery/graders/skill-areas.ts
evals/intent-discovery/graders/strict-invocation.ts
evals/intent-discovery/harness-capture.eval.ts
evals/intent-discovery/harness/live-copilot-harness.ts
evals/intent-discovery/harness/parse-intent-commands.ts
evals/intent-discovery/harness/prepare-fixture.ts
evals/intent-discovery/harness/run-copilot-task.ts
evals/intent-discovery/harness/saved-transcript-harness.ts
evals/intent-discovery/harness/setup-intent-condition.ts
evals/intent-discovery/intent-discovery.eval.ts
evals/intent-discovery/live-copilot-harness.eval.ts
evals/intent-discovery/runs/.gitkeep
evals/intent-discovery/runs/latest/.gitkeep
evals/intent-discovery/tsconfig.json
evals/intent-discovery/vitest.evals.config.ts
package.json

- Introduced new README.md for intent discovery eval with usage instructions. - Added conditions and tasks for intent discovery evaluation. - Implemented grading functions for skill loading and failure classification. - Created harness for running saved transcripts in evaluations. - Configured Vitest for running intent discovery tests. - Updated package.json and pnpm-lock.yaml to include vitest-evals dependency. - Added .gitignore entries for evaluation runs and vitest artifacts.

…mmand parsing

…y evaluation

…ure command, skill, transcript, and diff evidence

… intent discovery

…d skill package handling

… generation, update command parsing for various package managers

socket-security · 2026-06-20T23:07:33Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	@tanstack/react-router@1.170.16
	@tanstack/react-start@1.168.26
	react@19.2.0
	vitest-evals@0.13.1
	react-dom@19.2.0
	@tanstack/react-table@9.0.0-beta.16

View full report

socket-security · 2026-06-20T23:07:35Z

Warning

Review the following alerts detected in dependencies.

According to your organization's Security Policy, it is recommended to resolve "Warn" alerts. Learn more about Socket for GitHub.

Action	Severity	Alert (click "▶" to expand/collapse)
Warn		Obfuscated code: npm `js-yaml` is 90.0% likely obfuscated Confidence: 0.90 Location: Package overview From: `?` → `npm/@tanstack/react-start@1.168.26` → `npm/js-yaml@4.2.0` ℹ Read more on: This package \| This alert \| What is obfuscated code? Next steps: Take a moment to review the security alert above. Review the linked package source code to understand the potential risk. Ensure the package is not malicious before proceeding. If you're unsure how to proceed, reach out to your security team or ask the Socket team for help at `support@socket.dev`. Suggestion: Packages should not obfuscate their code. Consider not using packages with obfuscated code. Mark the package as acceptable risk. To ignore this alert only in this pull request, reply with the comment `@SocketSecurity ignore npm/js-yaml@4.2.0`. You can also ignore all packages with `@SocketSecurity ignore-all`. To ignore an alert for all future pull requests, use Socket's Dashboard to change the triage state of this alert.
Warn		Obfuscated code: npm `seroval` is 90.0% likely obfuscated Confidence: 0.90 Location: Package overview From: `?` → `npm/@tanstack/react-router@1.170.16` → `npm/@tanstack/react-start@1.168.26` → `npm/seroval@1.5.4` ℹ Read more on: This package \| This alert \| What is obfuscated code? Next steps: Take a moment to review the security alert above. Review the linked package source code to understand the potential risk. Ensure the package is not malicious before proceeding. If you're unsure how to proceed, reach out to your security team or ask the Socket team for help at `support@socket.dev`. Suggestion: Packages should not obfuscate their code. Consider not using packages with obfuscated code. Mark the package as acceptable risk. To ignore this alert only in this pull request, reply with the comment `@SocketSecurity ignore npm/seroval@1.5.4`. You can also ignore all packages with `@SocketSecurity ignore-all`. To ignore an alert for all future pull requests, use Socket's Dashboard to change the triage state of this alert.

View full report

nx-cloud · 2026-06-20T23:07:36Z

View your CI Pipeline Execution ↗ for commit 818d211

Command	Status	Duration	Result
`nx run-many --targets=build --exclude=examples/**`	✅ Succeeded	2s	View ↗

☁️ Nx Cloud last updated this comment at 2026-06-20 23:07:35 UTC

nx-cloud · 2026-06-20T23:07:36Z

View your CI Pipeline Execution ↗ for commit 818d211

Command	Status	Duration	Result
`nx run-many --targets=build --exclude=examples/**`	✅ Succeeded	<1s	View ↗

☁️ Nx Cloud last updated this comment at 2026-06-20 23:27:23 UTC

pkg-pr-new · 2026-06-20T23:07:43Z

Open in StackBlitz

npm i https://pkg.pr.new/TanStack/intent/@tanstack/intent@173

commit: 50724d5

coderabbitai

Actionable comments posted: 5

♻️ Duplicate comments (2)

evals/intent-discovery/harness/run-copilot-task.ts (1)

120-154: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add timeout guards for spawned processes to prevent hung eval runs.

Both subprocess paths can block forever if the child never exits, which can wedge the suite.

Suggested fix

+const COMMAND_TIMEOUT_MS = Number(
+  process.env.INTENT_DISCOVERY_COMMAND_TIMEOUT_MS ?? '300000',
+)
+
 async function runCommand({
   command,
   input,
 }: {
   command: string
   input: RunCopilotTaskInput
 }): Promise<CommandResult> {
   return new Promise((resolve, reject) => {
+    let settled = false
     const child = spawn(command, {
@@
     })
+    const timeout = setTimeout(() => {
+      if (settled) return
+      settled = true
+      child.kill('SIGKILL')
+      reject(new Error(`Copilot command timed out after ${COMMAND_TIMEOUT_MS}ms`))
+    }, COMMAND_TIMEOUT_MS)
@@
-    child.on('error', reject)
+    child.on('error', (error) => {
+      if (settled) return
+      settled = true
+      clearTimeout(timeout)
+      reject(error)
+    })
     child.on('close', (exitCode) => {
+      if (settled) return
+      settled = true
+      clearTimeout(timeout)
       resolve({
@@
 async function runDiff(
   sourcePath: string,
   workspacePath: string,
 ): Promise<CommandResult> {
   return new Promise((resolve, reject) => {
+    let settled = false
     const child = spawn('diff', ['-ruN', sourcePath, workspacePath])
+    const timeout = setTimeout(() => {
+      if (settled) return
+      settled = true
+      child.kill('SIGKILL')
+      reject(new Error(`diff timed out after ${COMMAND_TIMEOUT_MS}ms`))
+    }, COMMAND_TIMEOUT_MS)
@@
-    child.on('error', reject)
+    child.on('error', (error) => {
+      if (settled) return
+      settled = true
+      clearTimeout(timeout)
+      reject(error)
+    })
     child.on('close', (exitCode) => {
+      if (settled) return
+      settled = true
+      clearTimeout(timeout)
       resolve({

Also applies to: 242-262

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/intent-discovery/harness/run-copilot-task.ts` around lines 120 - 154,
The runCommand function spawns a child process and waits indefinitely for it to
close, which can cause eval runs to hang if the child never exits. Implement a
timeout guard by wrapping the Promise with a timeout mechanism that rejects
after a reasonable duration (e.g., 30 seconds). When the timeout is triggered,
explicitly kill the child process using child.kill() to ensure proper cleanup
and prevent resource leaks. Apply this same timeout pattern to the other
subprocess path mentioned at lines 242-262 to ensure consistent behavior across
all spawned processes.

evals/intent-discovery/harness/live-copilot-harness.ts (1)

23-38: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Move setup into the try so workspace cleanup is guaranteed.

If setup throws before Line 38, Line 143 never executes and the prepared workspace can leak.

Suggested fix

   run: async ({ input, setArtifact }) => {
     const runId = `live:${input.id}`
     const prepared = prepareFixtureWorkspace({ fixture: input.fixture })
-    const appliedCondition = applyIntentCondition({
-      condition: input.condition,
-      expectedSkillAreas: input.expectedSkillAreas,
-      workspacePath: prepared.workspacePath,
-    })
-
-    setCommonArtifacts({
-      input,
-      runId,
-      setupFilesWritten: appliedCondition.filesWritten,
-      workspacePath: prepared.workspacePath,
-      setArtifact,
-    })
 
     try {
+      const appliedCondition = applyIntentCondition({
+        condition: input.condition,
+        expectedSkillAreas: input.expectedSkillAreas,
+        workspacePath: prepared.workspacePath,
+      })
+
+      setCommonArtifacts({
+        input,
+        runId,
+        setupFilesWritten: appliedCondition.filesWritten,
+        workspacePath: prepared.workspacePath,
+        setArtifact,
+      })
+
       const run = await runCopilotTask({

Also applies to: 142-144

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/intent-discovery/harness/live-copilot-harness.ts` around lines 23 - 38,
Move the setup operations including the prepareFixtureWorkspace call,
applyIntentCondition call, and setCommonArtifacts call into the try block so
they are protected by the cleanup logic. Currently these operations execute
before the try block starts, which means if any of them throw an error, the
workspace cleanup code at line 143 will never execute and the prepared workspace
will leak. Relocate these three setup function calls to be the first statements
within the try block.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@evals/intent-discovery/fixture-corpus.eval.ts`:
- Around line 24-31: The test asserts fixture is defined on line 26, but lines
28-31 immediately dereference fixture properties (fixture.skillAreas and
fixture.id) without guarding against the case where fixture is undefined. If the
fixture lookup fails, the code will throw before reaching the intended assertion
message. Add a guard condition or early return after the first expect statement
to prevent the subsequent assertion block from executing if fixture is not
defined, ensuring the code does not attempt to access fixture.skillAreas or
fixture.id when fixture is null or undefined.

In `@evals/intent-discovery/graders/eval-metadata.ts`:
- Around line 36-37: The avgScore calculation divides by scores.length without
checking if the array is empty, which results in NaN when scores is an empty
array. Add a guard condition before calculating avgScore to check if
scores.length is greater than 0, and if not, set avgScore to a sensible default
value (such as 0). This ensures avgScore always has a defined numeric value
instead of NaN, preventing issues in downstream consumers that depend on
deterministic summary/report data.

In `@evals/intent-discovery/graders/reference-only.ts`:
- Around line 14-16: The `referenceOnly` grader is being biased by user prompts
because it includes all messages from the session when building the
transcriptText. Filter the run.session.messages array to only include messages
from the assistant (exclude user messages) before mapping them to searchable
text. This ensures the reference-only classification is based solely on what the
assistant said, not what the user asked, preventing false positives when users
mention skill areas in their prompts.

In `@evals/intent-discovery/harness-capture.eval.ts`:
- Around line 1-2: The file has duplicate imports from the same module node:fs
on consecutive lines. Consolidate the two import statements into a single import
that includes all the functions being imported: existsSync, mkdirSync,
readFileSync from the first import statement and mkdtempSync, rmSync from the
second import statement. Replace both lines with one combined import statement
from node:fs containing all five functions.

In `@evals/intent-discovery/live-copilot-harness.eval.ts`:
- Around line 1-2: The import statements for the node:fs module are split across
two lines, importing existsSync and writeFileSync on the first line, and
mkdtempSync and rmSync on the second line. Consolidate these into a single
import statement by combining all four functions (existsSync, writeFileSync,
mkdtempSync, rmSync) into one import from node:fs to satisfy linting rules and
maintain consistency.

---

Duplicate comments:
In `@evals/intent-discovery/harness/live-copilot-harness.ts`:
- Around line 23-38: Move the setup operations including the
prepareFixtureWorkspace call, applyIntentCondition call, and setCommonArtifacts
call into the try block so they are protected by the cleanup logic. Currently
these operations execute before the try block starts, which means if any of them
throw an error, the workspace cleanup code at line 143 will never execute and
the prepared workspace will leak. Relocate these three setup function calls to
be the first statements within the try block.

In `@evals/intent-discovery/harness/run-copilot-task.ts`:
- Around line 120-154: The runCommand function spawns a child process and waits
indefinitely for it to close, which can cause eval runs to hang if the child
never exits. Implement a timeout guard by wrapping the Promise with a timeout
mechanism that rejects after a reasonable duration (e.g., 30 seconds). When the
timeout is triggered, explicitly kill the child process using child.kill() to
ensure proper cleanup and prevent resource leaks. Apply this same timeout
pattern to the other subprocess path mentioned at lines 242-262 to ensure
consistent behavior across all spawned processes.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b0ceebc7-8ab2-4d40-aefe-b8724c418e8e

📥 Commits

Reviewing files that changed from the base of the PR and between 77c61ee and af482ee.

⛔ Files ignored due to path filters (1)

pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml

📒 Files selected for processing (40)

.gitignore
eslint.config.js
evals/intent-discovery/README.md
evals/intent-discovery/bin/copilot-cli-adapter.mjs
evals/intent-discovery/bin/llm-judge.mjs
evals/intent-discovery/bin/summarize-results.mjs
evals/intent-discovery/condition-setup.eval.ts
evals/intent-discovery/corpus/conditions.ts
evals/intent-discovery/corpus/fixtures.ts
evals/intent-discovery/corpus/live-tasks.ts
evals/intent-discovery/corpus/skill-uses.ts
evals/intent-discovery/corpus/tasks.ts
evals/intent-discovery/fixture-corpus.eval.ts
evals/intent-discovery/fixtures/router-basic/package.json
evals/intent-discovery/fixtures/router-basic/src/routes/users.$userId.tsx
evals/intent-discovery/fixtures/saved-transcripts.ts
evals/intent-discovery/fixtures/start-basic/package.json
evals/intent-discovery/fixtures/start-basic/src/routes/users.tsx
evals/intent-discovery/fixtures/table-v9-basic/package.json
evals/intent-discovery/fixtures/table-v9-basic/src/user-table.tsx
evals/intent-discovery/graders/correct-skill-loaded.ts
evals/intent-discovery/graders/eval-metadata.ts
evals/intent-discovery/graders/failure-classifier.ts
evals/intent-discovery/graders/reference-only.ts
evals/intent-discovery/graders/skill-areas.ts
evals/intent-discovery/graders/strict-invocation.ts
evals/intent-discovery/harness-capture.eval.ts
evals/intent-discovery/harness/live-copilot-harness.ts
evals/intent-discovery/harness/parse-intent-commands.ts
evals/intent-discovery/harness/prepare-fixture.ts
evals/intent-discovery/harness/run-copilot-task.ts
evals/intent-discovery/harness/saved-transcript-harness.ts
evals/intent-discovery/harness/setup-intent-condition.ts
evals/intent-discovery/intent-discovery.eval.ts
evals/intent-discovery/live-copilot-harness.eval.ts
evals/intent-discovery/runs/.gitkeep
evals/intent-discovery/runs/latest/.gitkeep
evals/intent-discovery/tsconfig.json
evals/intent-discovery/vitest.evals.config.ts
package.json

✅ Files skipped from review due to trivial changes (8)

evals/intent-discovery/fixtures/table-v9-basic/package.json
eslint.config.js
evals/intent-discovery/corpus/live-tasks.ts
evals/intent-discovery/fixtures/start-basic/package.json
evals/intent-discovery/fixtures/router-basic/package.json
evals/intent-discovery/tsconfig.json
evals/intent-discovery/README.md
.gitignore

🚧 Files skipped from review as they are similar to previous changes (17)

evals/intent-discovery/fixtures/router-basic/src/routes/users.$userId.tsx
evals/intent-discovery/vitest.evals.config.ts
evals/intent-discovery/fixtures/table-v9-basic/src/user-table.tsx
evals/intent-discovery/corpus/fixtures.ts
evals/intent-discovery/corpus/tasks.ts
evals/intent-discovery/corpus/conditions.ts
evals/intent-discovery/harness/saved-transcript-harness.ts
evals/intent-discovery/graders/skill-areas.ts
evals/intent-discovery/fixtures/start-basic/src/routes/users.tsx
evals/intent-discovery/bin/copilot-cli-adapter.mjs
evals/intent-discovery/corpus/skill-uses.ts
package.json
evals/intent-discovery/condition-setup.eval.ts
evals/intent-discovery/harness/prepare-fixture.ts
evals/intent-discovery/bin/llm-judge.mjs
evals/intent-discovery/bin/summarize-results.mjs
evals/intent-discovery/harness/parse-intent-commands.ts

…e, improve error handling, and update fixture dependencies

…ernal types, update fixture handling, and enhance reference-only logic

coderabbitai Bot reviewed Jun 20, 2026

View reviewed changes

LadyBluenotes added 9 commits June 20, 2026 16:02

Add intent discovery evaluation framework and related fixtures

4b548f5

Refactor intent discovery evaluation: streamline skill loading and co…

5c9b822

…mmand parsing

Implement live Copilot harness and error handling for intent discover…

99e6ed5

…y evaluation

Enhance live Copilot harness: support opt-in command backend and capt…

6884f1d

…ure command, skill, transcript, and diff evidence

Add live Copilot harness and evaluation metadata for intent discovery

174b452

Add live Copilot condition setup and enhance evaluation framework for…

391c3ed

… intent discovery

Enhance intent discovery evaluation: update file writing logic and ad…

d674d2b

…d skill package handling

Enhance intent discovery evaluation: add LLM judge and summary report…

818d211

… generation, update command parsing for various package managers

LadyBluenotes force-pushed the eval/intent-discovery branch from 77c61ee to 818d211 Compare June 20, 2026 23:06

ci: apply automated fixes

af482ee

coderabbitai Bot reviewed Jun 20, 2026

View reviewed changes

LadyBluenotes and others added 3 commits June 20, 2026 16:14

Enhance intent discovery evaluation: add request timeout for LLM judg…

c1b9ca1

…e, improve error handling, and update fixture dependencies

ci: apply automated fixes

7bb199b

Refactor intent discovery types and functions: remove exports for int…

50724d5

…ernal types, update fixture handling, and enhance reference-only logic

LadyBluenotes merged commit 559dd06 into main Jun 20, 2026
8 of 9 checks passed

LadyBluenotes deleted the eval/intent-discovery branch June 20, 2026 23:28

coderabbitai Bot mentioned this pull request Jun 21, 2026

Add Intent hook enforcement for supported agents #174

Open

Uh oh!

Conversation

LadyBluenotes commented Jun 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

socket-security Bot commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

socket-security Bot commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nx-cloud Bot commented Jun 20, 2026

Uh oh!

nx-cloud Bot commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pkg-pr-new Bot commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LadyBluenotes commented Jun 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 20, 2026 •

edited

Loading

socket-security Bot commented Jun 20, 2026 •

edited

Loading

socket-security Bot commented Jun 20, 2026 •

edited

Loading

nx-cloud Bot commented Jun 20, 2026 •

edited

Loading

pkg-pr-new Bot commented Jun 20, 2026 •

edited

Loading