What
Intent installs guidance into AGENTS.md and relies on the agent to discover and load skills before doing work. In practice agents often skip that step unless the session is heavily prefaced with context about the library. There's currently no way to measure whether a given trigger mechanism actually fires when it should, so we're choosing mechanisms without evidence.
This issue covers a library-agnostic eval harness that scores competing discovery-trigger approaches against each other.
Scope
Eval harness (offline, deterministic, CI-able):
- A blended score per mechanism combining recall, precision, and a cost weight (tunable).
recall/precision measured by matching each mechanism's trigger surface against a labeled corpus; cost from a static per-mechanism model.
- Corpus schema:
{ prompt, needs_skill, expected_pkg }, spanning libraries with skills, libraries without skills (must stay silent), and a synthetic/unknown library (separates "the trigger fired" from "the model already knew the library").
- No library names baked into the mechanism under test.
Candidates scored: a range of approaches — current AGENTS.md guidance (baseline), a no-trigger control, a single universal discovery skill, per-package skills, a hybrid, an injected context block, and tool-based discovery. Approaches that only manifest at runtime carry a declared static cost and are flagged for a later live-measurement pass.
Out of scope
- Live agent-in-the-loop cost measurement (follow-up).
- Building any specific installer.
- Picking a winning mechanism — that's the harness's output, not pre-decided here.
Where this comes from
Reports that Intent skills don't reliably load into context unless the agent is told about the library explicitly. We want the trigger-mechanism choice to be driven by measured reliability rather than assumption.
What
Intent installs guidance into
AGENTS.mdand relies on the agent to discover and load skills before doing work. In practice agents often skip that step unless the session is heavily prefaced with context about the library. There's currently no way to measure whether a given trigger mechanism actually fires when it should, so we're choosing mechanisms without evidence.This issue covers a library-agnostic eval harness that scores competing discovery-trigger approaches against each other.
Scope
Eval harness (offline, deterministic, CI-able):
recall/precisionmeasured by matching each mechanism's trigger surface against a labeled corpus;costfrom a static per-mechanism model.{ prompt, needs_skill, expected_pkg }, spanning libraries with skills, libraries without skills (must stay silent), and a synthetic/unknown library (separates "the trigger fired" from "the model already knew the library").Candidates scored: a range of approaches — current
AGENTS.mdguidance (baseline), a no-trigger control, a single universal discovery skill, per-package skills, a hybrid, an injected context block, and tool-based discovery. Approaches that only manifest at runtime carry a declared static cost and are flagged for a later live-measurement pass.Out of scope
Where this comes from
Reports that Intent skills don't reliably load into context unless the agent is told about the library explicitly. We want the trigger-mechanism choice to be driven by measured reliability rather than assumption.