feat(skillopt): deficiency-detection + edit-proposer engine (stacked on #230)#237
Conversation
Read side of the attribution the foundation writes. listSkillSessions reads the
skills_active rows (filtered on description='skills_active', deduped per session)
into per-session {skills, bucket}; cohortsForSkill partitions a skill's
treatment/control; reconstructSession rebuilds the raw transcript from the
session's captured rows. All queries injected (unit-testable); nothing touches
the model channel. First piece of the deficiency detector. 10 unit tests.
Read side keyed on actual invocations: capture.ts persists each Skill tool_use
as a tool_call row (tool_name=Skill, tool_input={skill}). listSkillInvocations
pulls org-skill (<name>--<author>) invocations; windowAroundInvocation slices the
transcript before/after the invocation turn for tight, cheap judging. Supersedes
availability-based attribution. Injected queries; 8 unit tests.
Skill *invocation* (Skill tool_use, already captured) is a more accurate signal than availability and pins the turn for windowing, so the SessionStart skills_active write + helper + the availability-based cohorts module are removed. The weekly trigger/worker stay. session-start-hook tests revert to the pre-attribution query sequence; trigger tests unchanged.
Pure, free pushback detector: a USER turn right after an ASSISTANT turn matching correction patterns = a hard, observable failure signal, independent of any LLM. Tuned for recall (false positives just cost a judge call; false negatives only under-detect — never churns a good skill). Exposes windowedTurns() from skill-invocations so the anchor reads the same windowed slice the judge will. 6 anchor tests + patterns to tune on real data.
LLM judge of the windowed slice: was the task accomplished CORRECTLY? Prompted to ignore user mood (a praised-but-wrong answer is a FAILURE) — the anti-sycophancy axis the research validated. Runs on the user's claude -p (cheap default model, all tools denied), injected for tests. Conservative: unparseable/errored/empty judgments return success=1 so a flaky judge can only under-detect, never manufacture deficiency. 7 unit tests.
Orchestrates the engine's detection step: group org-skill invocations, window each, run the free anchor, and judge ONLY anchored windows. A confirmed failure needs anchor AND judge agreement (precision). A skill is deficient at >=minInvocations AND >=failureRateThreshold confirmed-failure rate. Returns per-skill stats + deficientCount (the caller's >=5 fire gate). All injected; 2 tests assert the min-n gate and that the judge runs only on anchored windows.
Port of SkillOpt's edit-application: append/insert_after/replace/delete ops, the edit budget (textual learning rate), and the protected <!-- SLOW_UPDATE_START/END --> region (the paper's slow-update) that fast edits must not touch. Pure + deterministic; 7 tests cover each op, the budget, and the protection.
…e backend The paper's backward pass: from a deficient skill's body + its confirmed failures, the optimizer diagnoses the single recurring weakness and emits bounded structured edits (slow-update region off-limits), applied locally to a candidate body — nothing published. Extracts the shared claude -p backend (claude-model.ts) used by both judge and proposer; success-judge refactored onto it. Injected model; tolerant edit parse; 5 proposer tests.
Writes an accepted edit to the live SKILL.md via the native skills dir (never the model channel — PR #223), bumps the frontmatter version (v1-vs-v2), and backs up the prior version for one-command revert. Mechanism only: the worker won't call it on an unvalidated edit (offline gate isn't trustworthy) — reserved for the real-usage A/B gate. 6 fs tests.
runSkillOptCycle ties it together: detect deficient skills, fire only at >=fireThreshold (default 5 — act on a pattern, not noise), then for each: read the body, propose a bounded edit, write a REVIEW PROPOSAL (candidate.md + proposal.json with full evidence). Never auto-overwrites a live skill — live publish is gated on the deferred A/B. All I/O injected; 4 tests cover the gate, proposals, skip-when- not-installed, and a custom threshold.
Replaces the stub: the detached weekly worker now loads config, builds the query, and runs runSkillOptCycle (detect -> >=5 gate -> propose -> write review proposals) over a 30-day invocation lookback. Verified end-to-end against a real org: 40 Skill invocations scanned, 0 anchored -> 0 judge calls -> 0 deficient -> exits at the gate (~14s, $0). No auto-publish; proposals land under <stateDir>/skillopt/proposals.
The paper's meta-skill: an append-only JSONL recording every proposed edit per skill (with an order-independent fingerprint). The cycle feeds prior-tried edits to the proposer as context and dedups — never re-writing an edit already proposed for a skill, so the optimizer doesn't churn. Status field (proposed→applied/reverted) is ready to close the loop once the A/B gate records outcomes. Wired into the worker; meta + dedup tests.
The validation organ. A randomized A/B needs the skill version at invocation time (a capture change we lack), so the feasible gate is longitudinal: after an edit publishes, compare the skill's confirmed-failure rate after vs before — a real drop keeps it, a real rise reverts (from the SKILL.v<old> backup), unclear/too-few = inconclusive. Observational (needs margin + min sample); randomized A/B is the upgrade once invocation-version capture lands. Extracts scoreInvocations (shared by detector + gate); adds untilIso to listSkillInvocations. 7 tests.
|
Need an answer fast? Review this PR in Change Stack to ask focused questions about the PR or a changed range. Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds an end-to-end SkillOpt pipeline: invocation extraction, anchor-based judging, deficiency scoring, LLM-driven edit proposals, edit application/publishing, meta deduplication, a one-shot worker run, and removal of session-start skill-attribution writes. ChangesSkillOpt Core Implementation
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
✅ Action performedReview finished.
|
Coverage ReportScope: files changed in this PR. Enforced threshold: 90% per metric (per file via
File Coverage — 15 files changed
Generated for commit 9a1769a. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (11)
tests/shared/session-anchor.test.ts (1)
8-41: ⚡ Quick winConsider adding a test for evidence truncation.
Line 39 in the implementation truncates evidence to 200 chars, but there's no test verifying this behavior. This is an edge case worth validating.
🧪 Suggested test addition
it("truncates evidence to 200 chars", () => { const longPushback = "no that's wrong " + "x".repeat(300); const r = detectAnchor([a("done"), u(longPushback)]); expect(r.anchored).toBe(true); expect(r.evidence.length).toBe(200); });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/shared/session-anchor.test.ts` around lines 8 - 41, Add a unit test to verify detectAnchor truncates evidence to 200 characters: create a test that calls detectAnchor with an assistant turn followed by a very long user pushback (e.g., "no that's wrong " + "x".repeat(300)), assert anchored is true and that r.evidence.length === 200; place this alongside the other detectAnchor tests and reference the detectAnchor helper and the u/a message builders to construct the turns.src/skillify/session-anchor.ts (2)
27-27: 💤 Low valueBENIGN pattern may over-match on "thank".
The pattern ends with
thankwhich will match "thanksgiving", "thankful", etc. Consider adding word boundary or requiring "thanks" specifically.🔧 Suggested refinement
-const BENIGN = /\b(no (problem|worries|need|biggie)|no,? thanks|all good|works? (now|great|fine|perfectly)|that works|perfect|looks good|thank)/i; +const BENIGN = /\b(no (problem|worries|need|biggie)|no,? thanks|all good|works? (now|great|fine|perfectly)|that works|perfect|looks good|thanks?)\b/i;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/skillify/session-anchor.ts` at line 27, The BENIGN regex constant over-matches because the trailing "thank" will match words like "thanksgiving" or "thankful"; update the BENIGN pattern in session-anchor.ts (symbol: BENIGN) to require a word boundary or explicit pluralization (e.g., use \bthank(s)?\b or otherwise add \b around "thank") so only standalone "thank" or "thanks" are matched; keep the existing alternation structure and flags intact.
23-23: ⚡ Quick winConsider expanding PUSHBACK patterns for better recall.
The regex is tuned for recall but could catch a few more common correction patterns:
- "that fail" should likely be "that failed" (past tense)
- Missing common patterns: "fix", "error", "bug", "doesn't help", "failed"
Since false positives only cost one judge call and the goal is high recall, consider adding these patterns.
📝 Suggested regex enhancement
-const PUSHBACK = /\b(no|nope|wrong|incorrect|not what|that'?s not|does ?n'?t work|did ?n'?t work|do ?n'?t work|wo ?n'?t work|is ?n'?t|that'?s wrong|broke|broken|still (failing|broken|not working|wrong|the same)|try again|undo|revert that|that fail)/i; +const PUSHBACK = /\b(no|nope|wrong|incorrect|not what|that'?s not|does ?n'?t work|did ?n'?t work|do ?n'?t work|wo ?n'?t work|is ?n'?t|that'?s wrong|broke|broken|failed|still (failing|broken|not working|wrong|the same)|try again|undo|revert that|fix (that|this|it)|error|bug)/i;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/skillify/session-anchor.ts` at line 23, Update the PUSHBACK regex (the PUSHBACK constant) to broaden recall: change the "that fail" token to "that failed" and add common correction/error tokens such as "failed", "fix", "fix it", "error", "bug", and the negation "doesn't help" (plus apostrophe-free variants like "doesnt help" and "doesnt") so the pattern matches those user correction phrases as well; keep the existing alternatives and case-insensitive flag intact.tests/shared/skill-invocations.test.ts (1)
64-103: ⚡ Quick winConsider adding test for invocation at session boundaries.
The tests cover invocation in the middle and invocation not found, but don't test when the invocation is at the very start (before=3 but only 0 turns before) or at the very end (after=6 but only 1 turn after). The
Math.max(0, ...)on line 152 of the implementation handles this, but explicit test coverage would validate the boundary behavior.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/shared/skill-invocations.test.ts` around lines 64 - 103, Add explicit tests for invocation-at-session-boundaries in the skill-invocations.test.ts suite: create one test where the Skill tool_call occurs at the very start (invocation index 0) and call windowAroundInvocation with before > available turns (e.g., before: 3) and assert the result only contains the available post-invocation turns; create another test where the Skill tool_call is the final turn and call with after > available turns (e.g., after: 6) and assert the result only contains the available pre-invocation turns. Use the existing helpers (mockQuery, TABLE, windowAroundInvocation) and assertions matching the style of the other tests to validate the Math.max(0, ...) boundary behavior.tests/shared/skill-edit-gate.test.ts (1)
45-51: 💤 Low valueQuery mock could validate both time bounds more precisely.
The mock distinguishes before/after windows by checking
sql.includes(\< '${PUB}'`), but doesn't verify the lower bound (>=). While this works for the test, it doesn't fully validate thatgateEdit` constructs the correct SQL time bounds.Consider adding assertions that verify both
>= beforeSince(for before window) and>= publishIso(for after window) are present in the SQL.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/shared/skill-edit-gate.test.ts` around lines 45 - 51, The query mock (vi.fn assigned to query) must validate both time bounds instead of only checking for `< '${PUB}'`; update the mock to detect the before-window SQL by asserting it contains the lower bound token (>= beforeSince) and the upper bound `< '${PUB}'` and return beforeInvs, and detect the after-window SQL by asserting it contains `>= '${PUB}'` (publishIso) and the corresponding upper bound and return afterInvs; reference the variables and symbols used in the test (query, PUB, beforeSince, publishIso, beforeInvs, afterInvs, gateEdit) so the mock fails the test if gateEdit constructs incorrect time predicates.src/skillify/skill-invocations.ts (2)
155-168: 💤 Low valueElision math may produce odd results for very small maxChars.
If
maxCharsis very small (e.g., < 100), the head/tail split could produce a degenerate result where the elision message itself is longer than the actual content shown.Consider adding a minimum threshold or simpler truncation for small limits.
🛡️ Suggested guard
export async function windowAroundInvocation( query: QueryFn, sessionsTable: string, inv: SkillInvocation, opts: { before?: number; after?: number; maxChars?: number } = {}, ): Promise<string> { const maxChars = opts.maxChars ?? 4000; const slice = await windowedTurns(query, sessionsTable, inv, opts); const joined = slice.map((t) => `${t.role}: ${t.text}`).join("\n\n"); if (joined.length <= maxChars) return joined; + if (maxChars < 100) return joined.slice(0, maxChars) + "…"; // simple truncation for tiny limits const head = joined.slice(0, Math.floor(maxChars * 0.55)); const tail = joined.slice(joined.length - Math.floor(maxChars * 0.45)); return `${head}\n\n…[${joined.length - maxChars} chars elided]…\n\n${tail}`; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/skillify/skill-invocations.ts` around lines 155 - 168, windowAroundInvocation can produce a degenerate output when maxChars is very small because the elision notice can end up longer than the visible head/tail; update windowAroundInvocation to handle small limits by (a) computing the elision message first (e.g., elideMsg = `\n\n…[${joined.length - maxChars} chars elided]…\n\n`) and ensuring head+tail+elideMsg.length <= maxChars, or (b) for very small maxChars (choose a threshold like 100–200) fall back to a simpler truncation strategy such as returning joined.slice(0, maxChars - elideMsg.length) + elideMsg; adjust the head/tail length calculations accordingly so head = Math.max(0, floor((maxChars - elideMsg.length) * 0.55)) and tail uses the remainder, referencing windowAroundInvocation, maxChars, joined, head, tail and the elision message in your changes.
73-79: ⚡ Quick winValidate
sessionsTableas an SQL identifier; elision bounds are mainly OK
sessionsTableinterpolation inlistSkillInvocationsis fed fromloadConfig().sessionsTableName(envHIVEMIND_SESSIONS_TABLE/ default) and flows through server/worker context; there’s no path from request/user input into this value, so the practical SQL-injection risk from users is low. Still, because it’s an identifier inserted into SQL, add a whitelist/validation (or identifier-escaping helper) to protect against misconfiguration/poisoned env.- The elision slicing in
windowAroundInvocationwon’t overlap head/tail (it’s based on the ends ofjoined, and this branch only runs whenjoined.length > maxChars). For very smallmaxChars, head/tail can become empty and the final string can exceedmaxCharsdue to the inserted elision marker; clampmaxCharsif it can be user-controlled.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/skillify/skill-invocations.ts` around lines 73 - 79, Validate or safely-escape the sessionsTable identifier before interpolating it in listSkillInvocations: ensure loadConfig().sessionsTableName is checked against a strict whitelist or regex for valid SQL identifiers (e.g., /^[A-Za-z_][A-Za-z0-9_]*$/) and reject or fallback if it fails, so the query using sessionsTable cannot be malformed; also review windowAroundInvocation to guard against tiny maxChars by clamping maxChars to a sensible minimum or accounting for the elision marker length when computing head/tail from joined so the resulting string never exceeds maxChars and head/tail do not produce empty slices unexpectedly.src/skillify/skillopt-worker.ts (1)
37-49: 💤 Low valueConsider using consistent timestamp across cycle.
Line 45's
recordcallback creates a newDate().toISOString()timestamp for each meta entry, while line 48 passes a separatenowvalue to the cycle. This causesMetaEntry.proposedAtandProposalRecord.createdAtto have slightly different timestamps (potentially seconds apart) for edits in the same cycle run.🔧 Suggested fix for timestamp consistency
+ const now = new Date().toISOString(); const res = await runSkillOptCycle({ query, sessionsTable: config.sessionsTableName, readSkillBody: (name, author) => readSkillBodyFromDisk(skillsRoot, name, author), writeProposal: (rec) => writeProposalToDisk(proposalsRoot, rec), meta: { prior: (n, a) => priorEditSummaries(metaCache, n, a), has: (n, a, edits) => alreadyProposed(metaCache, n, a, edits), - record: (n, a, edits) => { const e = metaEntryFor(n, a, edits, new Date().toISOString()); appendMeta(metaFile, e); metaCache.push(e); }, + record: (n, a, edits) => { const e = metaEntryFor(n, a, edits, now); appendMeta(metaFile, e); metaCache.push(e); }, }, detector: { sinceIso, limit: 5000 }, - now: new Date().toISOString(), + now, });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/skillify/skillopt-worker.ts` around lines 37 - 49, Compute a single timestamp and reuse it for the whole cycle so MetaEntry.proposedAt and ProposalRecord.createdAt match: before calling runSkillOptCycle, create a const (e.g. nowIso = new Date().toISOString()) and pass that as the now argument to runSkillOptCycle; inside the meta.record callback (which currently calls metaEntryFor(n, a, edits, new Date().toISOString())), replace the inline new Date() call with the shared nowIso so metaEntryFor and the cycle use the identical timestamp.src/skillify/skill-edits.ts (2)
32-33: ⚡ Quick winConsider target ambiguity when multiple occurrences exist.
The
indexOfmethod finds only the first occurrence of the target string. If a SKILL.md contains duplicate text patterns and an edit intends to target a later occurrence, this will modify the wrong location. While the LLM proposer should choose sufficiently specific anchors, this limitation could cause incorrect edits in edge cases.💡 Consider adding a comment documenting this limitation
function targetsProtected(skill: string, target: string): boolean { const r = protectedRange(skill); if (!r || !target) return false; + // Note: indexOf finds the FIRST occurrence only. Proposer must choose unambiguous anchors. const idx = skill.indexOf(target); return idx !== -1 && idx >= r[0] && idx < r[1]; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/skillify/skill-edits.ts` around lines 32 - 33, The code uses skill.indexOf(target) which only returns the first occurrence and can pick the wrong match when the same target appears multiple times; update the logic in the function containing variables skill, target, idx, and r so you search for an occurrence that falls inside the range r (e.g., use indexOf(target, r[0]) and if that result >= r[1] or -1, loop with successive indexOf calls until you find an idx such that idx >= r[0] && idx < r[1] or conclude no match), and add a short inline comment noting the remaining ambiguity if multiple identical anchors exist and suggesting making anchors more specific.
60-66: 💤 Low valueEnsure proper newline spacing before protected region.
When appending with a protected region present, the code inserts
content + "\n\n"before the marker but doesn't ensure proper spacing before the content itself. If the document text ends without a newline before the protected marker, the appended content will run into it.🔧 Suggested defensive spacing
case "append": { const content = (e.content ?? "").trim(); if (!content) { report.push("SKIP append: empty content"); break; } const r = protectedRange(s); - if (r) s = s.slice(0, r[0]) + content + "\n\n" + s.slice(r[0]); + if (r) { + const before = s.slice(0, r[0]).replace(/\s*$/, ""); + s = before + "\n\n" + content + "\n\n" + s.slice(r[0]); + } else s = s.replace(/\s*$/, "") + "\n\n" + content + "\n"; ok(`append (+${content.length} chars)`); break; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/skillify/skill-edits.ts` around lines 60 - 66, In the "append" case when a protectedRange(s) exists, ensure you normalize spacing by trimming trailing whitespace from the prefix before the protected marker and inserting the same blank-line padding used in the non-protected branch: compute prefix = s.slice(0, r[0]).replace(/\s*$/, "") then set s = prefix + "\n\n" + content + "\n\n" + s.slice(r[0]); update the code inside case "append" (the protectedRange(s) branch) to use this logic so appended content never runs into the protected marker.src/skillify/skill-publisher.ts (1)
29-38: ⚡ Quick winAdd defensive handling for missing frontmatter.
If
splitFrontmatterreturns empty frontmatter (file has no frontmatter block),bumpVersionreturns empty frontmatter instead of creating a valid frontmatter block with version. The published file will then lack version tracking entirely.🛡️ Suggested defensive handling
export function bumpVersion(frontmatter: string): { frontmatter: string; oldVersion: number; newVersion: number } { + // If frontmatter is empty, create a minimal valid block + if (!frontmatter.trim()) { + return { frontmatter: "---\nversion: 2\n---\n", oldVersion: 1, newVersion: 2 }; + } const m = frontmatter.match(/^version:\s*(\d+)\s*$/m); const oldVersion = m ? parseInt(m[1], 10) : 1; const newVersion = oldVersion + 1; const next = m ? frontmatter.replace(/^version:\s*\d+\s*$/m, `version: ${newVersion}`) : frontmatter.replace(/\n---\n$/, `\nversion: ${newVersion}\n---\n`); return { frontmatter: next, oldVersion, newVersion }; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/skillify/skill-publisher.ts` around lines 29 - 38, bumpVersion doesn't handle an entirely missing/empty frontmatter string: when splitFrontmatter returns '' you should create a proper frontmatter block instead of returning an empty string. Update the bumpVersion function to detect empty or whitespace-only frontmatter, treat oldVersion as 1 and newVersion as 2, and return a frontmatter string like a full frontmatter block containing the version (e.g. '---\nversion: 2\n---\n'); keep existing behavior for when a version line exists (use the existing regex replace) and when frontmatter exists but lacks a version line (insert version before the closing '---'). Ensure you reference bumpVersion and its returned oldVersion/newVersion values so callers remain consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@node_modules`:
- Line 1: Remove the committed machine-specific node_modules symlink from the
repository and stop tracking it: unstage/untrack the tracked node_modules
symlink so it is removed from the repo history going forward, add node_modules/
to .gitignore to prevent re-adding it, and commit the removal and updated
.gitignore. Ensure the repository no longer contains the symbolic link named
"node_modules" and that ".gitignore" includes "node_modules/".
---
Nitpick comments:
In `@src/skillify/session-anchor.ts`:
- Line 27: The BENIGN regex constant over-matches because the trailing "thank"
will match words like "thanksgiving" or "thankful"; update the BENIGN pattern in
session-anchor.ts (symbol: BENIGN) to require a word boundary or explicit
pluralization (e.g., use \bthank(s)?\b or otherwise add \b around "thank") so
only standalone "thank" or "thanks" are matched; keep the existing alternation
structure and flags intact.
- Line 23: Update the PUSHBACK regex (the PUSHBACK constant) to broaden recall:
change the "that fail" token to "that failed" and add common correction/error
tokens such as "failed", "fix", "fix it", "error", "bug", and the negation
"doesn't help" (plus apostrophe-free variants like "doesnt help" and "doesnt")
so the pattern matches those user correction phrases as well; keep the existing
alternatives and case-insensitive flag intact.
In `@src/skillify/skill-edits.ts`:
- Around line 32-33: The code uses skill.indexOf(target) which only returns the
first occurrence and can pick the wrong match when the same target appears
multiple times; update the logic in the function containing variables skill,
target, idx, and r so you search for an occurrence that falls inside the range r
(e.g., use indexOf(target, r[0]) and if that result >= r[1] or -1, loop with
successive indexOf calls until you find an idx such that idx >= r[0] && idx <
r[1] or conclude no match), and add a short inline comment noting the remaining
ambiguity if multiple identical anchors exist and suggesting making anchors more
specific.
- Around line 60-66: In the "append" case when a protectedRange(s) exists,
ensure you normalize spacing by trimming trailing whitespace from the prefix
before the protected marker and inserting the same blank-line padding used in
the non-protected branch: compute prefix = s.slice(0, r[0]).replace(/\s*$/, "")
then set s = prefix + "\n\n" + content + "\n\n" + s.slice(r[0]); update the code
inside case "append" (the protectedRange(s) branch) to use this logic so
appended content never runs into the protected marker.
In `@src/skillify/skill-invocations.ts`:
- Around line 155-168: windowAroundInvocation can produce a degenerate output
when maxChars is very small because the elision notice can end up longer than
the visible head/tail; update windowAroundInvocation to handle small limits by
(a) computing the elision message first (e.g., elideMsg = `\n\n…[${joined.length
- maxChars} chars elided]…\n\n`) and ensuring head+tail+elideMsg.length <=
maxChars, or (b) for very small maxChars (choose a threshold like 100–200) fall
back to a simpler truncation strategy such as returning joined.slice(0, maxChars
- elideMsg.length) + elideMsg; adjust the head/tail length calculations
accordingly so head = Math.max(0, floor((maxChars - elideMsg.length) * 0.55))
and tail uses the remainder, referencing windowAroundInvocation, maxChars,
joined, head, tail and the elision message in your changes.
- Around line 73-79: Validate or safely-escape the sessionsTable identifier
before interpolating it in listSkillInvocations: ensure
loadConfig().sessionsTableName is checked against a strict whitelist or regex
for valid SQL identifiers (e.g., /^[A-Za-z_][A-Za-z0-9_]*$/) and reject or
fallback if it fails, so the query using sessionsTable cannot be malformed; also
review windowAroundInvocation to guard against tiny maxChars by clamping
maxChars to a sensible minimum or accounting for the elision marker length when
computing head/tail from joined so the resulting string never exceeds maxChars
and head/tail do not produce empty slices unexpectedly.
In `@src/skillify/skill-publisher.ts`:
- Around line 29-38: bumpVersion doesn't handle an entirely missing/empty
frontmatter string: when splitFrontmatter returns '' you should create a proper
frontmatter block instead of returning an empty string. Update the bumpVersion
function to detect empty or whitespace-only frontmatter, treat oldVersion as 1
and newVersion as 2, and return a frontmatter string like a full frontmatter
block containing the version (e.g. '---\nversion: 2\n---\n'); keep existing
behavior for when a version line exists (use the existing regex replace) and
when frontmatter exists but lacks a version line (insert version before the
closing '---'). Ensure you reference bumpVersion and its returned
oldVersion/newVersion values so callers remain consistent.
In `@src/skillify/skillopt-worker.ts`:
- Around line 37-49: Compute a single timestamp and reuse it for the whole cycle
so MetaEntry.proposedAt and ProposalRecord.createdAt match: before calling
runSkillOptCycle, create a const (e.g. nowIso = new Date().toISOString()) and
pass that as the now argument to runSkillOptCycle; inside the meta.record
callback (which currently calls metaEntryFor(n, a, edits, new
Date().toISOString())), replace the inline new Date() call with the shared
nowIso so metaEntryFor and the cycle use the identical timestamp.
In `@tests/shared/session-anchor.test.ts`:
- Around line 8-41: Add a unit test to verify detectAnchor truncates evidence to
200 characters: create a test that calls detectAnchor with an assistant turn
followed by a very long user pushback (e.g., "no that's wrong " +
"x".repeat(300)), assert anchored is true and that r.evidence.length === 200;
place this alongside the other detectAnchor tests and reference the detectAnchor
helper and the u/a message builders to construct the turns.
In `@tests/shared/skill-edit-gate.test.ts`:
- Around line 45-51: The query mock (vi.fn assigned to query) must validate both
time bounds instead of only checking for `< '${PUB}'`; update the mock to detect
the before-window SQL by asserting it contains the lower bound token (>=
beforeSince) and the upper bound `< '${PUB}'` and return beforeInvs, and detect
the after-window SQL by asserting it contains `>= '${PUB}'` (publishIso) and the
corresponding upper bound and return afterInvs; reference the variables and
symbols used in the test (query, PUB, beforeSince, publishIso, beforeInvs,
afterInvs, gateEdit) so the mock fails the test if gateEdit constructs incorrect
time predicates.
In `@tests/shared/skill-invocations.test.ts`:
- Around line 64-103: Add explicit tests for invocation-at-session-boundaries in
the skill-invocations.test.ts suite: create one test where the Skill tool_call
occurs at the very start (invocation index 0) and call windowAroundInvocation
with before > available turns (e.g., before: 3) and assert the result only
contains the available post-invocation turns; create another test where the
Skill tool_call is the final turn and call with after > available turns (e.g.,
after: 6) and assert the result only contains the available pre-invocation
turns. Use the existing helpers (mockQuery, TABLE, windowAroundInvocation) and
assertions matching the style of the other tests to validate the Math.max(0,
...) boundary behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: ce0d8923-7d0c-4915-abfb-cfeac3a9842f
📒 Files selected for processing (27)
node_modulessrc/hooks/session-start.tssrc/skillify/claude-model.tssrc/skillify/deficiency-detector.tssrc/skillify/session-anchor.tssrc/skillify/skill-edit-gate.tssrc/skillify/skill-edits.tssrc/skillify/skill-invocations.tssrc/skillify/skill-proposer.tssrc/skillify/skill-publisher.tssrc/skillify/skillopt-engine.tssrc/skillify/skillopt-meta.tssrc/skillify/skillopt-worker.tssrc/skillify/skills-active.tssrc/skillify/success-judge.tstests/claude-code/session-start-hook.test.tstests/shared/deficiency-detector.test.tstests/shared/session-anchor.test.tstests/shared/skill-edit-gate.test.tstests/shared/skill-edits.test.tstests/shared/skill-invocations.test.tstests/shared/skill-proposer.test.tstests/shared/skill-publisher.test.tstests/shared/skillopt-engine.test.tstests/shared/skillopt-meta.test.tstests/shared/skills-active.test.tstests/shared/success-judge.test.ts
💤 Files with no reviewable changes (2)
- tests/shared/skills-active.test.ts
- src/hooks/session-start.ts
A machine-specific absolute node_modules symlink got committed in 1929c96 because .gitignore only had `node_modules/` (matches a directory, not the symlink file). Untracked it and tightened the ignore to `node_modules` so the symlink can't be re-added. The symlink stays locally (for builds); it's just no longer in git.
|
@coderabbitai review |
✅ Action performedReview finished.
|
scoreInvocations built the judge window from windowedTurns without applying the maxChars cap, so a captured turn containing a pasted log/diff could be sent whole to claude -p — slow, costly, or failing. Extracted elide() from windowAroundInvocation and apply it to the window before judging. Test asserts a 5000-char paste is capped under maxChars.
…dex P2) The judge & proposer get untrusted captured transcript text in their prompts, so --disallowed-tools is the prompt-injection guard. The list missed write-capable tools (MultiEdit, NotebookEdit, TodoWrite), leaving an escape hatch in installs that expose them. Enumerated them so a malicious failure example can't act.
The worker only read ~/.claude/skills, so a deficient skill pulled with --to project (<cwd>/.claude/skills) had its invocation detected but no body to edit -> the proposal was silently skipped. The detached worker inherits the SessionStart cwd, so it now falls back to <cwd>/.claude/skills.
|
@coderabbitai review |
✅ Action performedReview finished.
|
On a machine with the Hivemind hooks installed, each judge/proposer claude -p call was captured as a real session row — polluting the very sessions data the detector scans. Spawn the child with HIVEMIND_CAPTURE=false, as the skillify gate runner does.
"thanks, but this is still failing" matched PUSHBACK but BENIGN's bare "thank" also matched, so detectAnchor returned false and the failure was dropped — bad for a recall-oriented stage. Split into STRONG corrections (always anchor, even amid polite words) and the ambiguous bare "no" (benign-gated). Test covers polite corrections.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/shared/deficiency-detector.test.ts (1)
55-74: ⚡ Quick winTighten the assertion threshold for stronger test precision.
The assertion on line 73 uses
< 800, but withmaxChars: 300, the elided output should be approximately 330 characters (165 head + 135 tail + ~30-character elision marker). The current threshold of 800 is 2.67× the expected size, which still proves that huge pastes are capped but won't catch bugs where the cap is only partially effective (e.g., if it capped at 700 instead of ~330).Consider tightening the assertion to
< 450or< 500to more precisely validate the capping behavior while still allowing reasonable overhead.🎯 Proposed tighter assertion
- expect(judgedLen).toBeLessThan(800); // capped — not the ~5000-char paste + expect(judgedLen).toBeLessThan(450); // capped to ~330 chars (maxChars=300 + elision marker overhead)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/shared/deficiency-detector.test.ts` around lines 55 - 74, The test "caps the judged window at maxChars..." currently asserts judgedLen < 800 which is too loose; update the tighter threshold to judgedLen < 500 (or <450) to better reflect maxChars: 300 behavior. Locate the test in tests/shared/deficiency-detector.test.ts (the it block that calls detectDeficientSkills and sets judgedLen via the mocked judge) and change the expectation on judgedLen from expect(judgedLen).toBeLessThan(800) to expect(judgedLen).toBeLessThan(500) so the assertion more precisely verifies the capping.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/shared/deficiency-detector.test.ts`:
- Around line 55-74: The test "caps the judged window at maxChars..." currently
asserts judgedLen < 800 which is too loose; update the tighter threshold to
judgedLen < 500 (or <450) to better reflect maxChars: 300 behavior. Locate the
test in tests/shared/deficiency-detector.test.ts (the it block that calls
detectDeficientSkills and sets judgedLen via the mocked judge) and change the
expectation on judgedLen from expect(judgedLen).toBeLessThan(800) to
expect(judgedLen).toBeLessThan(500) so the assertion more precisely verifies the
capping.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: f702cf06-8dff-4496-abbd-21b4f5227f0e
📒 Files selected for processing (7)
src/skillify/claude-model.tssrc/skillify/deficiency-detector.tssrc/skillify/session-anchor.tssrc/skillify/skill-invocations.tssrc/skillify/skillopt-worker.tstests/shared/deficiency-detector.test.tstests/shared/session-anchor.test.ts
🚧 Files skipped from review as they are similar to previous changes (6)
- src/skillify/session-anchor.ts
- tests/shared/session-anchor.test.ts
- src/skillify/claude-model.ts
- src/skillify/skill-invocations.ts
- src/skillify/deficiency-detector.ts
- src/skillify/skillopt-worker.ts
…ex P2) targetsProtected only checked the target's start offset, so a replace/delete whose target began just before SLOW_UPDATE_START and spanned into the block slipped through and could remove protected longitudinal guidance. Now rejects any target RANGE that overlaps [r0, r1). Test covers a spanning target.
…ths (codex P2) Two findings sharing skill-invocations.ts: - detectAnchor scanned the whole window incl. pre-invocation turns, so a correction from an EARLIER failed attempt (skill used as a repair) was misattributed to this skill. windowedTurns now returns a pivot (first post-invocation turn); detectAnchor takes fromIndex and only fires when both the reaction AND the assistant it reacts to are post-invocation. Test transcripts reordered to real capture order (user → tool_call → assistant → reaction). - splitOrgSkill now rejects refs containing path separators or '..' before name/author are used to build skills/proposals filesystem paths (path-escape guard, matching the pull path's untrusted treatment).
|
@coderabbitai review |
✅ Action performedReview finished.
|
…x P1) A deny-list only blocks named built-ins — any other tool in the user's env (LS, NotebookRead, configured MCP tools) stayed callable, breaking the pure-text isolation for calls that include untrusted captured transcript text. Switched to `--tools """ (empty allow-list = zero tools available, authoritative over built-ins AND MCP). Verified the flag isolates with a real claude -p call.
…ex P2) Invocations come from ALL projects, but the worker resolved project-scoped skills from its own SessionStart cwd — so a skill pulled --to project in another cwd was skipped, or a same-named skill in the current cwd was edited instead. readSkillBodyViaManifest now resolves the authoritative installRoot from the pull manifest (global ~/.claude/skills as fallback). Test covers the project-root read.
|
@coderabbitai review |
✅ Action performedReview finished.
|
…alls (codex P2) HIVEMIND_CAPTURE=false stopped capture, but the spawned judge/proposer claude -p still ran our SessionStart hook — injecting the large Deeplake context into the prompt and doing auto-pull/graph work, once per anchored invocation. Set HIVEMIND_WIKI_WORKER=1 (the guard internal runners use) so the hook returns immediately: no prompt contamination, no repeated background work.
… P3) path LIKE '%sid%' could pull rows from an unrelated session (sid as a substring, or SQL LIKE wildcards in the id), feeding the judge a mixed transcript. Escape LIKE wildcards (+ ESCAPE clause) and drop any row whose message.session_id isn't the exact session. Test covers a cross-session collision row.
… P3)
bumpVersion('') left the frontmatter empty while reporting newVersion: 2, so a
published edit of a no-frontmatter skill carried no version and the gate couldn't
tell versions apart. Now emits a minimal '---\nversion: N\n---\n' block. Test added.
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/skillify/skillopt-worker.ts (1)
64-64:⚠️ Potential issue | 🟠 Major | ⚡ Quick winReturn a non-zero exit code on fatal worker errors.
Line 64 logs a fatal condition and still exits with
0. That makes a crashed SkillOpt run indistinguishable from a successful no-op run, so callers and manual invocations cannot detect that the worker actually failed.Suggested fix
-main().catch((e) => { log(`fatal (swallowed): ${(e as Error)?.message ?? e}`); process.exit(0); }); +main().catch((e) => { + log(`fatal: ${(e as Error)?.message ?? e}`); + process.exit(1); +});🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/skillify/skillopt-worker.ts` at line 64, The current top-level error handler for main() logs a fatal error but calls process.exit(0); change it to return a non-zero exit code so failures are detectable: update the catch handler on main() (the lambda that calls log(...) and process.exit(...)) to call process.exit(1) (or another non-zero code) and include the error details in the log message (using the existing (e as Error)?.message ?? e expression) so the handler for main() correctly signals failure to callers.src/skillify/skillopt-engine.ts (1)
70-86:⚠️ Potential issue | 🟠 Major | ⚡ Quick winIsolate per-skill failures inside the proposal loop.
On Line 74 through Line 84, any throw from
readSkillBody,proposeSkillEdit,writeProposal, ormeta.recordabortsrunSkillOptCyclefor the entire batch. Becausesrc/skillify/skillopt-worker.tsLine 64 catches that once at the top, one transient model or filesystem failure skips every remaining deficient skill in the cycle.Suggested containment
const proposals: CycleResult["proposals"] = []; for (const s of targets) { - const body = deps.readSkillBody(s.name, s.author); - if (!body) continue; // not installed locally → nothing to edit - const priorEdits = deps.meta?.prior(s.name, s.author) ?? []; - const p = await proposeSkillEdit(body, s.examples, { ...deps.proposer, priorEdits }); - // dedup against the meta memory — don't re-write an edit already tried for this skill. - const isDup = p.changed && (deps.meta?.has(s.name, s.author, p.edits) ?? false); - if (p.changed && !isDup) { - deps.writeProposal({ - name: s.name, author: s.author, - invocations: s.invocations, confirmedFailures: s.confirmedFailures, failureRate: s.failureRate, - examples: s.examples, edits: p.edits, report: p.report, - candidateBody: p.editedBody, createdAt: deps.now, - }); - deps.meta?.record(s.name, s.author, p.edits); - } - proposals.push({ name: s.name, author: s.author, changed: p.changed && !isDup, failureRate: s.failureRate }); + try { + const body = deps.readSkillBody(s.name, s.author); + if (!body) continue; // not installed locally → nothing to edit + const priorEdits = deps.meta?.prior(s.name, s.author) ?? []; + const p = await proposeSkillEdit(body, s.examples, { ...deps.proposer, priorEdits }); + const isDup = p.changed && (deps.meta?.has(s.name, s.author, p.edits) ?? false); + if (p.changed && !isDup) { + deps.writeProposal({ + name: s.name, author: s.author, + invocations: s.invocations, confirmedFailures: s.confirmedFailures, failureRate: s.failureRate, + examples: s.examples, edits: p.edits, report: p.report, + candidateBody: p.editedBody, createdAt: deps.now, + }); + deps.meta?.record(s.name, s.author, p.edits); + } + proposals.push({ name: s.name, author: s.author, changed: p.changed && !isDup, failureRate: s.failureRate }); + } catch { + proposals.push({ name: s.name, author: s.author, changed: false, failureRate: s.failureRate }); + continue; + } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/skillify/skillopt-engine.ts` around lines 70 - 86, Wrap each iteration of the targets loop in a per-skill try/catch so exceptions from readSkillBody, proposeSkillEdit, writeProposal, or deps.meta.record do not abort the whole runSkillOptCycle; on error, log the exception with skill context (name, author) and continue to the next skill, and still push a proposals entry for that skill with changed set to false (and include its failureRate/other available metadata) so the caller sees the skill was processed but failed; ensure deps.meta.record and deps.writeProposal are only invoked on the successful path where p.changed && !isDup.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/skillify/skillopt-engine.ts`:
- Around line 70-86: Wrap each iteration of the targets loop in a per-skill
try/catch so exceptions from readSkillBody, proposeSkillEdit, writeProposal, or
deps.meta.record do not abort the whole runSkillOptCycle; on error, log the
exception with skill context (name, author) and continue to the next skill, and
still push a proposals entry for that skill with changed set to false (and
include its failureRate/other available metadata) so the caller sees the skill
was processed but failed; ensure deps.meta.record and deps.writeProposal are
only invoked on the successful path where p.changed && !isDup.
In `@src/skillify/skillopt-worker.ts`:
- Line 64: The current top-level error handler for main() logs a fatal error but
calls process.exit(0); change it to return a non-zero exit code so failures are
detectable: update the catch handler on main() (the lambda that calls log(...)
and process.exit(...)) to call process.exit(1) (or another non-zero code) and
include the error details in the log message (using the existing (e as
Error)?.message ?? e expression) so the handler for main() correctly signals
failure to callers.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 61444a8f-b98a-4d96-9ef9-5f55fa29c5ef
📒 Files selected for processing (8)
src/skillify/claude-model.tssrc/skillify/skill-invocations.tssrc/skillify/skill-publisher.tssrc/skillify/skillopt-engine.tssrc/skillify/skillopt-worker.tstests/shared/skill-invocations.test.tstests/shared/skill-publisher.test.tstests/shared/skillopt-engine.test.ts
🚧 Files skipped from review as they are similar to previous changes (5)
- tests/shared/skill-publisher.test.ts
- tests/shared/skill-invocations.test.ts
- tests/shared/skillopt-engine.test.ts
- src/skillify/skill-publisher.ts
- src/skillify/skill-invocations.ts
A detached hook worker may not have claude on PATH (e.g. ~/.claude/local/claude);
spawn("claude") would ENOENT and the callers swallow it as no-change, so the weekly
worker silently produced no proposals in those layouts. Resolve the binary the same
way the rest of skillify does (findAgentBin('claude_code')).
runWeeklySkillOpt stamped lastRun before the worker checks creds, so firing at a logged-out SessionStart consumed the week — a user who logged in minutes later was skipped as throttled until next week. Added a canFire gate (default: creds present) that returns reason 'no-creds' before any stamp/spawn. Test covers it.
|
@coderabbitai review |
✅ Action performedReview finished.
|
… (codex P2) Two follow-ons from round 6: - claude-model: --tools "" denies tool USE but claude -p still LOADS the user's MCP config; a broken/oversized MCP schema fails every judge/proposer call before JSON. Added --strict-mcp-config to ignore user MCP entirely. - trigger: the canFire gate used the credentials FILE, but the worker uses loadConfig() (which also accepts HIVEMIND_TOKEN/ORG_ID env creds). Mismatch skipped env-cred users forever / stamped on a token-only malformed file. Now both use loadConfig()?.token.
|
@coderabbitai review |
✅ Action performedReview finished.
|
The weekly worker's lookback + detector thresholds are now overridable via env —
HIVEMIND_SKILLOPT_{LOOKBACK_DAYS,MIN_INVOCATIONS,FAILURE_RATE,FIRE_THRESHOLD} — with
the same defaults (30d, detector min-n, 0.4 rate, >=5 fire). A positive override
wins; non-numeric/<=0 falls back. Useful for tuning + smaller-data testing.
|
@CodeRabbit how do we label a skill? how do we decide if it's good or not? |
|
Tip For best results, initiate chat on the files or code changes.
How a skill's quality is evaluatedStage 1 — Anchor detection (cheap heuristic)Every time a
If such a pattern is found → the invocation is anchored (flagged as a candidate failure). No judge runs for un-anchored invocations, so the expensive model call is reserved only for suspicious windows. Stage 2 — Success judge (LLM-as-a-reward-model)For each anchored invocation window, Aggregation & deficiency gate
Even if a skill is deficient, no proposals fire unless ≥5 deficient events are accumulated (the Post-publish gateAfter an edit is published,
In short: user pushback patterns = cheap first filter → LLM judge = confirmation → failure rate thresholds = deficiency flag. No human labels are ever required. |
@CodeRabbit do we keep the tool calls? if so, does it make sense to remove them to remove the noise? we do that for summaries |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
fd 'skill-invocations.ts' --exec cat {}Length of output: 8220
// The Skill invocation tool_call itself is detected and skipped (only used to mark position):
const ref = invokedSkillRef(j);
if (ref) {
// ... mark invIndex, then:
continue;
}
// All other message types are handled here:
const text = typeof j.content === "string" ? j.content.trim() : "";
if (!text) continue;
if (j.type === "user_message") turns.push({ role: "USER", text });
else if (j.type === "assistant_message") turns.push({ role: "ASSISTANT", text });
// ← any other type (tool_call, tool_result, system, etc.) falls through — never pushedSo One edge case worth noting: if |
|
@kaghni can you increase the coverage? |
Stacked on #230 (foundation). Builds the actual SkillOpt engine: detect deficient skills from real usage, propose bounded edits, gate them — the loop the foundation's worker stub was a placeholder for.
What it does
```
Skill invocations (real Skill tool_use) → window → anchor (free heuristic)
→ success-judge (only if anchored) → ≥5 deficient gate
→ propose bounded structured edits (meta-aware) → write REVIEW PROPOSALS
→ [apply] → live publish (version+backup) → longitudinal gate → keep/revert
```
Key decisions
skills_active(the foundation's availability attribution). Actual invocation (Skilltool_use, already captured by capture.ts) is strictly better — it proves the skill was used and pins the turn for tight windowing.Proven E2E on real data
Fabricated synthetic sessions in a test org → the real engine ran against real Deeplake + real `claude -p`: 6 deficient skills detected → ≥5 gate fired → 6 genuinely-good candidate edits (correctly diagnosed "mock-the-client" / "no-verify" from confirmed failures and rewrote the skill). On the org's real sessions: 40 invocations → 0 anchored → $0 (free filter working).
Paper fidelity
All SkillOpt components mapped: trainable doc ✅, rollout=real sessions ✅, reward=judge+anchor ✅, reflect→structured edits ✅, edit budget ✅, slow-update region ✅, meta-skill ✅, publish ✅, gate ✅ (longitudinal; randomized A/B = follow-up, needs version-at-invocation capture).
Tests
98 unit tests across 12 files — token discipline, precision gate, dedup, slow-update protection, conservative judge, gate decisions. All injected (no live Deeplake/LLM in tests).
Deferred (honest)
🤖 Generated with Claude Code
Summary by CodeRabbit