feat(skillopt): deficiency-detection + edit-proposer engine (stacked on #230) by kaghni · Pull Request #237 · activeloopai/hivemind

kaghni · 2026-06-05T06:09:56Z

Stacked on #230 (foundation). Builds the actual SkillOpt engine: detect deficient skills from real usage, propose bounded edits, gate them — the loop the foundation's worker stub was a placeholder for.

What it does

```
Skill invocations (real Skill tool_use) → window → anchor (free heuristic)
→ success-judge (only if anchored) → ≥5 deficient gate
→ propose bounded structured edits (meta-aware) → write REVIEW PROPOSALS
→ [apply] → live publish (version+backup) → longitudinal gate → keep/revert
```

Key decisions

Drops skills_active (the foundation's availability attribution). Actual invocation (Skill tool_use, already captured by capture.ts) is strictly better — it proves the skill was used and pins the turn for tight windowing.
No auto-publish. The worker writes review proposals; live overwrite waits for the gate to validate an edit in production (the offline gate is provably untrustworthy). Nothing ever touches the model channel (PR First-run + resume session briefs, kept to the user channel #223 rule).
Reward = success-judge + free heuristic anchor (no labels). The anchor (user pushback) is a free filter so the judge runs only on a fraction → token-cheap.

Proven E2E on real data

Fabricated synthetic sessions in a test org → the real engine ran against real Deeplake + real `claude -p`: 6 deficient skills detected → ≥5 gate fired → 6 genuinely-good candidate edits (correctly diagnosed "mock-the-client" / "no-verify" from confirmed failures and rewrote the skill). On the org's real sessions: 40 invocations → 0 anchored → $0 (free filter working).

Paper fidelity

All SkillOpt components mapped: trainable doc ✅, rollout=real sessions ✅, reward=judge+anchor ✅, reflect→structured edits ✅, edit budget ✅, slow-update region ✅, meta-skill ✅, publish ✅, gate ✅ (longitudinal; randomized A/B = follow-up, needs version-at-invocation capture).

Tests

98 unit tests across 12 files — token discipline, precision gate, dedup, slow-update protection, conservative judge, gate decisions. All injected (no live Deeplake/LLM in tests).

Deferred (honest)

Randomized A/B (needs a small capture change: record skill version at invocation).
Apply-proposal + monitor command (where the gate plugs in).
Anchor pattern tuning against real pushback.

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Automated weekly skill‑optimization cycle that detects deficient skills, generates bounded review proposals, deduplicates candidates, and produces safe publish artifacts with backups and version bumps.
- New end‑user tooling: invocation & anchor detection, success judging, edit proposal/apply with protected regions, model‑driven proposer, and local proposal/meta management.
Bug Fixes
- Reduced session‑start query footprint and clearer logging when capture/DDL is skipped.
Tests
- Extensive Vitest coverage added across detection, anchoring, judging, edits, proposing, publishing, engine, meta, and worker behavior.
Chores
- .gitignore tweak to ignore node_modules.

Read side of the attribution the foundation writes. listSkillSessions reads the skills_active rows (filtered on description='skills_active', deduped per session) into per-session {skills, bucket}; cohortsForSkill partitions a skill's treatment/control; reconstructSession rebuilds the raw transcript from the session's captured rows. All queries injected (unit-testable); nothing touches the model channel. First piece of the deficiency detector. 10 unit tests.

Read side keyed on actual invocations: capture.ts persists each Skill tool_use as a tool_call row (tool_name=Skill, tool_input={skill}). listSkillInvocations pulls org-skill (<name>--<author>) invocations; windowAroundInvocation slices the transcript before/after the invocation turn for tight, cheap judging. Supersedes availability-based attribution. Injected queries; 8 unit tests.

Skill *invocation* (Skill tool_use, already captured) is a more accurate signal than availability and pins the turn for windowing, so the SessionStart skills_active write + helper + the availability-based cohorts module are removed. The weekly trigger/worker stay. session-start-hook tests revert to the pre-attribution query sequence; trigger tests unchanged.

Pure, free pushback detector: a USER turn right after an ASSISTANT turn matching correction patterns = a hard, observable failure signal, independent of any LLM. Tuned for recall (false positives just cost a judge call; false negatives only under-detect — never churns a good skill). Exposes windowedTurns() from skill-invocations so the anchor reads the same windowed slice the judge will. 6 anchor tests + patterns to tune on real data.

LLM judge of the windowed slice: was the task accomplished CORRECTLY? Prompted to ignore user mood (a praised-but-wrong answer is a FAILURE) — the anti-sycophancy axis the research validated. Runs on the user's claude -p (cheap default model, all tools denied), injected for tests. Conservative: unparseable/errored/empty judgments return success=1 so a flaky judge can only under-detect, never manufacture deficiency. 7 unit tests.

Orchestrates the engine's detection step: group org-skill invocations, window each, run the free anchor, and judge ONLY anchored windows. A confirmed failure needs anchor AND judge agreement (precision). A skill is deficient at >=minInvocations AND >=failureRateThreshold confirmed-failure rate. Returns per-skill stats + deficientCount (the caller's >=5 fire gate). All injected; 2 tests assert the min-n gate and that the judge runs only on anchored windows.

Port of SkillOpt's edit-application: append/insert_after/replace/delete ops, the edit budget (textual learning rate), and the protected  region (the paper's slow-update) that fast edits must not touch. Pure + deterministic; 7 tests cover each op, the budget, and the protection.

…e backend The paper's backward pass: from a deficient skill's body + its confirmed failures, the optimizer diagnoses the single recurring weakness and emits bounded structured edits (slow-update region off-limits), applied locally to a candidate body — nothing published. Extracts the shared claude -p backend (claude-model.ts) used by both judge and proposer; success-judge refactored onto it. Injected model; tolerant edit parse; 5 proposer tests.

Writes an accepted edit to the live SKILL.md via the native skills dir (never the model channel — PR #223), bumps the frontmatter version (v1-vs-v2), and backs up the prior version for one-command revert. Mechanism only: the worker won't call it on an unvalidated edit (offline gate isn't trustworthy) — reserved for the real-usage A/B gate. 6 fs tests.

runSkillOptCycle ties it together: detect deficient skills, fire only at >=fireThreshold (default 5 — act on a pattern, not noise), then for each: read the body, propose a bounded edit, write a REVIEW PROPOSAL (candidate.md + proposal.json with full evidence). Never auto-overwrites a live skill — live publish is gated on the deferred A/B. All I/O injected; 4 tests cover the gate, proposals, skip-when- not-installed, and a custom threshold.

Replaces the stub: the detached weekly worker now loads config, builds the query, and runs runSkillOptCycle (detect -> >=5 gate -> propose -> write review proposals) over a 30-day invocation lookback. Verified end-to-end against a real org: 40 Skill invocations scanned, 0 anchored -> 0 judge calls -> 0 deficient -> exits at the gate (~14s, $0). No auto-publish; proposals land under <stateDir>/skillopt/proposals.

The paper's meta-skill: an append-only JSONL recording every proposed edit per skill (with an order-independent fingerprint). The cycle feeds prior-tried edits to the proposer as context and dedups — never re-writing an edit already proposed for a skill, so the optimizer doesn't churn. Status field (proposed→applied/reverted) is ready to close the loop once the A/B gate records outcomes. Wired into the worker; meta + dedup tests.

The validation organ. A randomized A/B needs the skill version at invocation time (a capture change we lack), so the feasible gate is longitudinal: after an edit publishes, compare the skill's confirmed-failure rate after vs before — a real drop keeps it, a real rise reverts (from the SKILL.v<old> backup), unclear/too-few = inconclusive. Observational (needs margin + min sample); randomized A/B is the upgrade once invocation-version capture lands. Extracts scoreInvocations (shared by detector + gate); adds untilIso to listSkillInvocations. 7 tests.

coderabbitai · 2026-06-05T06:10:08Z

Need an answer fast? Review this PR in Change Stack to ask focused questions about the PR or a changed range.

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: cd440db5-bf3b-47df-b1ab-00c220937524

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds an end-to-end SkillOpt pipeline: invocation extraction, anchor-based judging, deficiency scoring, LLM-driven edit proposals, edit application/publishing, meta deduplication, a one-shot worker run, and removal of session-start skill-attribution writes.

Changes

SkillOpt Core Implementation

Layer / File(s)	Summary
Session failure detection `src/skillify/session-anchor.ts`, `src/skillify/claude-model.ts`, `src/skillify/success-judge.ts`, `tests/shared/session-anchor.test.ts`, `tests/shared/success-judge.test.ts`	Anchoring detects explicit user corrections by pattern-matching adjacent turns; `claudeModel` spawns the Claude CLI with timeouts and tool denial; `judgeSuccess` evaluates transcript windows with conservative fallback on errors.
Skill invocation extraction `src/skillify/skill-invocations.ts`, `tests/shared/skill-invocations.test.ts`	Extracts Skill tool-call invocations from session messages, parses org-scoped `name--author` refs, reconstructs user/assistant turn pairs, and returns windowed transcript context with elision.
Deficiency detection `src/skillify/deficiency-detector.ts`, `tests/shared/deficiency-detector.test.ts`	Groups invocations by skill, windows transcripts, applies anchor+judge filtering, counts confirmed failures, and marks skills deficient when failure rate exceeds threshold and minimum invocation count is met.
Skill edit primitives `src/skillify/skill-edits.ts`, `tests/shared/skill-edits.test.ts`	Defines markdown edit operations (append, insert_after, replace, delete) with protected-region support and edit budgets; applies edits with per-operation reporting and skips operations targeting protected blocks.
Skill proposal generation `src/skillify/skill-proposer.ts`, `tests/shared/skill-proposer.test.ts`	Constructs prompts for Claude with skill body, failure examples, and prior edits; parses JSON edit arrays from model output with tolerance for surrounding prose; applies edits under budget and returns change flag.
Skill file versioning & publishing `src/skillify/skill-publisher.ts`, `tests/shared/skill-publisher.test.ts`	Splits frontmatter from markdown body, bumps version numbers with defaults, writes backups named `SKILL.v<old>.bak.md`, and overwrites `SKILL.md` with bumped frontmatter plus edited body.
Edit outcome validation (before/after gating) `src/skillify/skill-edit-gate.ts`, `tests/shared/skill-edit-gate.test.ts`	Measures skill failure rates in before/after windows around publish time, computes deltas, and decides `keep`, `revert`, or `inconclusive` based on configurable margin and minimum-sample thresholds.
Cycle orchestration & meta-state management `src/skillify/skillopt-engine.ts`, `src/skillify/skillopt-meta.ts`, `tests/shared/skillopt-engine.test.ts`, `tests/shared/skillopt-meta.test.ts`	Orchestrates the full weekly cycle: detects deficient skills, reads local bodies, proposes edits, deduplicates against meta history using fingerprints, writes proposals to disk, and tracks status for cross-run deduplication.
Weekly worker activation & execution `src/skillify/skillopt-worker.ts`	Activates the SkillOpt worker to run one full cycle: loads config, constructs query adapter, initializes filesystem roots, loads meta cache, calls `runSkillOptCycle` with fire-gate threshold, and logs deficiency/proposal counts.
Session-start hook cleanup `src/hooks/session-start.ts`, `tests/claude-code/session-start-hook.test.ts`	Removes deprecated `HIVEMIND_SKILL_ATTRIBUTION` skill attribution capture logic from session startup; retains `autoPullSkills()`, placeholder setup, and context rendering.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

activeloopai/hivemind#112: Related session-start skill pull/attribution changes; touches similar SessionStart logic.

Suggested reviewers

efenocchi

Poem

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 54.35% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main feature: implementing the SkillOpt deficiency-detection and edit-proposer engine, with clear reference to the stacked dependency (`#230`).
Description check	✅ Passed	The description is comprehensive and complete: it explains the PR's purpose, key decisions, architecture, proven testing, paper fidelity, deferred items, and includes a detailed commit summary. The author demonstrates thorough understanding of the changes.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/skillopt-engine

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

kaghni · 2026-06-05T06:10:11Z

@coderabbitai review

coderabbitai · 2026-06-05T06:10:16Z

✅ Action performed

Review finished.

github-actions · 2026-06-05T06:11:40Z

Coverage Report

Scope: files changed in this PR. Enforced threshold: 90% per metric (per file via vitest.config.ts).

Status	Category	Percentage	Covered / Total
🔴	Lines	89.70% (🎯 90%)	418 / 466
🔴	Statements	84.73% (🎯 90%)	505 / 596
🔴	Functions	78.26% (🎯 90%)	72 / 92
🔴	Branches	83.82% (🎯 90%)	347 / 414

File Coverage — 15 files changed

File	Stmts	Branches	Functions	Lines
`src/hooks/session-start.ts`	🟢 100.0%	🟢 94.8%	🟢 100.0%	🟢 100.0%
`src/skillify/claude-model.ts`	🔴 0.0%	🔴 0.0%	🔴 0.0%	🔴 0.0%
`src/skillify/deficiency-detector.ts`	🟢 100.0%	🔴 87.5%	🟢 100.0%	🟢 100.0%
`src/skillify/session-anchor.ts`	🟢 92.8%	🟢 91.7%	🟢 100.0%	🟢 100.0%
`src/skillify/skill-edit-gate.ts`	🟢 100.0%	🔴 89.5%	🟢 100.0%	🟢 100.0%
`src/skillify/skill-edits.ts`	🔴 81.3%	🔴 72.2%	🟢 100.0%	🟢 98.2%
`src/skillify/skill-invocations.ts`	🟢 93.3%	🔴 86.2%	🟢 100.0%	🟢 98.4%
`src/skillify/skill-proposer.ts`	🟢 94.9%	🔴 89.7%	🟢 100.0%	🟢 100.0%
`src/skillify/skill-publisher.ts`	🟢 100.0%	🟢 100.0%	🟢 100.0%	🟢 100.0%
`src/skillify/skillopt-engine.ts`	🔴 87.2%	🟢 95.7%	🔴 85.7%	🔴 84.8%
`src/skillify/skillopt-meta.ts`	🟢 100.0%	🔴 88.2%	🟢 100.0%	🟢 100.0%
`src/skillify/skillopt-trigger.ts`	🟢 96.2%	🟢 95.0%	🟢 100.0%	🟢 97.5%
`src/skillify/skillopt-worker.ts`	🔴 0.0%	🔴 0.0%	🔴 0.0%	🔴 0.0%
`src/skillify/skills-active.ts`	—	—	—	—
`src/skillify/success-judge.ts`	🟢 96.2%	🔴 83.3%	🟢 100.0%	🟢 100.0%

_{Generated for commit 9a1769a.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (11)

tests/shared/session-anchor.test.ts (1)

8-41: ⚡ Quick win

Consider adding a test for evidence truncation.

Line 39 in the implementation truncates evidence to 200 chars, but there's no test verifying this behavior. This is an edge case worth validating.

🧪 Suggested test addition

it("truncates evidence to 200 chars", () => {
  const longPushback = "no that's wrong " + "x".repeat(300);
  const r = detectAnchor([a("done"), u(longPushback)]);
  expect(r.anchored).toBe(true);
  expect(r.evidence.length).toBe(200);
});

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/shared/session-anchor.test.ts` around lines 8 - 41, Add a unit test to
verify detectAnchor truncates evidence to 200 characters: create a test that
calls detectAnchor with an assistant turn followed by a very long user pushback
(e.g., "no that's wrong " + "x".repeat(300)), assert anchored is true and that
r.evidence.length === 200; place this alongside the other detectAnchor tests and
reference the detectAnchor helper and the u/a message builders to construct the
turns.

src/skillify/session-anchor.ts (2)

27-27: 💤 Low value

BENIGN pattern may over-match on "thank".

The pattern ends with thank which will match "thanksgiving", "thankful", etc. Consider adding word boundary or requiring "thanks" specifically.

🔧 Suggested refinement

-const BENIGN = /\b(no (problem|worries|need|biggie)|no,? thanks|all good|works? (now|great|fine|perfectly)|that works|perfect|looks good|thank)/i;
+const BENIGN = /\b(no (problem|worries|need|biggie)|no,? thanks|all good|works? (now|great|fine|perfectly)|that works|perfect|looks good|thanks?)\b/i;

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/skillify/session-anchor.ts` at line 27, The BENIGN regex constant
over-matches because the trailing "thank" will match words like "thanksgiving"
or "thankful"; update the BENIGN pattern in session-anchor.ts (symbol: BENIGN)
to require a word boundary or explicit pluralization (e.g., use \bthank(s)?\b or
otherwise add \b around "thank") so only standalone "thank" or "thanks" are
matched; keep the existing alternation structure and flags intact.

23-23: ⚡ Quick win

Consider expanding PUSHBACK patterns for better recall.

The regex is tuned for recall but could catch a few more common correction patterns:

"that fail" should likely be "that failed" (past tense)
Missing common patterns: "fix", "error", "bug", "doesn't help", "failed"

Since false positives only cost one judge call and the goal is high recall, consider adding these patterns.

📝 Suggested regex enhancement

-const PUSHBACK = /\b(no|nope|wrong|incorrect|not what|that'?s not|does ?n'?t work|did ?n'?t work|do ?n'?t work|wo ?n'?t work|is ?n'?t|that'?s wrong|broke|broken|still (failing|broken|not working|wrong|the same)|try again|undo|revert that|that fail)/i;
+const PUSHBACK = /\b(no|nope|wrong|incorrect|not what|that'?s not|does ?n'?t work|did ?n'?t work|do ?n'?t work|wo ?n'?t work|is ?n'?t|that'?s wrong|broke|broken|failed|still (failing|broken|not working|wrong|the same)|try again|undo|revert that|fix (that|this|it)|error|bug)/i;

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/skillify/session-anchor.ts` at line 23, Update the PUSHBACK regex (the
PUSHBACK constant) to broaden recall: change the "that fail" token to "that
failed" and add common correction/error tokens such as "failed", "fix", "fix
it", "error", "bug", and the negation "doesn't help" (plus apostrophe-free
variants like "doesnt help" and "doesnt") so the pattern matches those user
correction phrases as well; keep the existing alternatives and case-insensitive
flag intact.

tests/shared/skill-invocations.test.ts (1)

tests/shared/skill-edit-gate.test.ts (1)

src/skillify/skill-invocations.ts (2)

src/skillify/skillopt-worker.ts (1)

37-49: 💤 Low value

Consider using consistent timestamp across cycle.

Line 45's record callback creates a new Date().toISOString() timestamp for each meta entry, while line 48 passes a separate now value to the cycle. This causes MetaEntry.proposedAt and ProposalRecord.createdAt to have slightly different timestamps (potentially seconds apart) for edits in the same cycle run.

🔧 Suggested fix for timestamp consistency

+  const now = new Date().toISOString();
   const res = await runSkillOptCycle({
     query,
     sessionsTable: config.sessionsTableName,
     readSkillBody: (name, author) => readSkillBodyFromDisk(skillsRoot, name, author),
     writeProposal: (rec) => writeProposalToDisk(proposalsRoot, rec),
     meta: {
       prior: (n, a) => priorEditSummaries(metaCache, n, a),
       has: (n, a, edits) => alreadyProposed(metaCache, n, a, edits),
-      record: (n, a, edits) => { const e = metaEntryFor(n, a, edits, new Date().toISOString()); appendMeta(metaFile, e); metaCache.push(e); },
+      record: (n, a, edits) => { const e = metaEntryFor(n, a, edits, now); appendMeta(metaFile, e); metaCache.push(e); },
     },
     detector: { sinceIso, limit: 5000 },
-    now: new Date().toISOString(),
+    now,
   });

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/skillify/skillopt-worker.ts` around lines 37 - 49, Compute a single
timestamp and reuse it for the whole cycle so MetaEntry.proposedAt and
ProposalRecord.createdAt match: before calling runSkillOptCycle, create a const
(e.g. nowIso = new Date().toISOString()) and pass that as the now argument to
runSkillOptCycle; inside the meta.record callback (which currently calls
metaEntryFor(n, a, edits, new Date().toISOString())), replace the inline new
Date() call with the shared nowIso so metaEntryFor and the cycle use the
identical timestamp.

src/skillify/skill-edits.ts (2)

32-33: ⚡ Quick win

Consider target ambiguity when multiple occurrences exist.

The indexOf method finds only the first occurrence of the target string. If a SKILL.md contains duplicate text patterns and an edit intends to target a later occurrence, this will modify the wrong location. While the LLM proposer should choose sufficiently specific anchors, this limitation could cause incorrect edits in edge cases.

💡 Consider adding a comment documenting this limitation

 function targetsProtected(skill: string, target: string): boolean {
   const r = protectedRange(skill);
   if (!r || !target) return false;
+  // Note: indexOf finds the FIRST occurrence only. Proposer must choose unambiguous anchors.
   const idx = skill.indexOf(target);
   return idx !== -1 && idx >= r[0] && idx < r[1];
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/skillify/skill-edits.ts` around lines 32 - 33, The code uses
skill.indexOf(target) which only returns the first occurrence and can pick the
wrong match when the same target appears multiple times; update the logic in the
function containing variables skill, target, idx, and r so you search for an
occurrence that falls inside the range r (e.g., use indexOf(target, r[0]) and if
that result >= r[1] or -1, loop with successive indexOf calls until you find an
idx such that idx >= r[0] && idx < r[1] or conclude no match), and add a short
inline comment noting the remaining ambiguity if multiple identical anchors
exist and suggesting making anchors more specific.

60-66: 💤 Low value

Ensure proper newline spacing before protected region.

When appending with a protected region present, the code inserts content + "\n\n" before the marker but doesn't ensure proper spacing before the content itself. If the document text ends without a newline before the protected marker, the appended content will run into it.

🔧 Suggested defensive spacing

       case "append": {
         const content = (e.content ?? "").trim();
         if (!content) { report.push("SKIP append: empty content"); break; }
         const r = protectedRange(s);
-        if (r) s = s.slice(0, r[0]) + content + "\n\n" + s.slice(r[0]);
+        if (r) {
+          const before = s.slice(0, r[0]).replace(/\s*$/, "");
+          s = before + "\n\n" + content + "\n\n" + s.slice(r[0]);
+        }
         else s = s.replace(/\s*$/, "") + "\n\n" + content + "\n";
         ok(`append (+${content.length} chars)`);
         break;
       }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/skillify/skill-edits.ts` around lines 60 - 66, In the "append" case when
a protectedRange(s) exists, ensure you normalize spacing by trimming trailing
whitespace from the prefix before the protected marker and inserting the same
blank-line padding used in the non-protected branch: compute prefix = s.slice(0,
r[0]).replace(/\s*$/, "") then set s = prefix + "\n\n" + content + "\n\n" +
s.slice(r[0]); update the code inside case "append" (the protectedRange(s)
branch) to use this logic so appended content never runs into the protected
marker.

src/skillify/skill-publisher.ts (1)

29-38: ⚡ Quick win

Add defensive handling for missing frontmatter.

If splitFrontmatter returns empty frontmatter (file has no frontmatter block), bumpVersion returns empty frontmatter instead of creating a valid frontmatter block with version. The published file will then lack version tracking entirely.

🛡️ Suggested defensive handling

 export function bumpVersion(frontmatter: string): { frontmatter: string; oldVersion: number; newVersion: number } {
+  // If frontmatter is empty, create a minimal valid block
+  if (!frontmatter.trim()) {
+    return { frontmatter: "---\nversion: 2\n---\n", oldVersion: 1, newVersion: 2 };
+  }
   const m = frontmatter.match(/^version:\s*(\d+)\s*$/m);
   const oldVersion = m ? parseInt(m[1], 10) : 1;
   const newVersion = oldVersion + 1;
   const next = m
     ? frontmatter.replace(/^version:\s*\d+\s*$/m, `version: ${newVersion}`)
     : frontmatter.replace(/\n---\n$/, `\nversion: ${newVersion}\n---\n`);
   return { frontmatter: next, oldVersion, newVersion };
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/skillify/skill-publisher.ts` around lines 29 - 38, bumpVersion doesn't
handle an entirely missing/empty frontmatter string: when splitFrontmatter
returns '' you should create a proper frontmatter block instead of returning an
empty string. Update the bumpVersion function to detect empty or whitespace-only
frontmatter, treat oldVersion as 1 and newVersion as 2, and return a frontmatter
string like a full frontmatter block containing the version (e.g. '---\nversion:
2\n---\n'); keep existing behavior for when a version line exists (use the
existing regex replace) and when frontmatter exists but lacks a version line
(insert version before the closing '---'). Ensure you reference bumpVersion and
its returned oldVersion/newVersion values so callers remain consistent.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@node_modules`:
- Line 1: Remove the committed machine-specific node_modules symlink from the
repository and stop tracking it: unstage/untrack the tracked node_modules
symlink so it is removed from the repo history going forward, add node_modules/
to .gitignore to prevent re-adding it, and commit the removal and updated
.gitignore. Ensure the repository no longer contains the symbolic link named
"node_modules" and that ".gitignore" includes "node_modules/".

---

Nitpick comments:
In `@src/skillify/session-anchor.ts`:
- Line 27: The BENIGN regex constant over-matches because the trailing "thank"
will match words like "thanksgiving" or "thankful"; update the BENIGN pattern in
session-anchor.ts (symbol: BENIGN) to require a word boundary or explicit
pluralization (e.g., use \bthank(s)?\b or otherwise add \b around "thank") so
only standalone "thank" or "thanks" are matched; keep the existing alternation
structure and flags intact.
- Line 23: Update the PUSHBACK regex (the PUSHBACK constant) to broaden recall:
change the "that fail" token to "that failed" and add common correction/error
tokens such as "failed", "fix", "fix it", "error", "bug", and the negation
"doesn't help" (plus apostrophe-free variants like "doesnt help" and "doesnt")
so the pattern matches those user correction phrases as well; keep the existing
alternatives and case-insensitive flag intact.

In `@src/skillify/skill-edits.ts`:
- Around line 32-33: The code uses skill.indexOf(target) which only returns the
first occurrence and can pick the wrong match when the same target appears
multiple times; update the logic in the function containing variables skill,
target, idx, and r so you search for an occurrence that falls inside the range r
(e.g., use indexOf(target, r[0]) and if that result >= r[1] or -1, loop with
successive indexOf calls until you find an idx such that idx >= r[0] && idx <
r[1] or conclude no match), and add a short inline comment noting the remaining
ambiguity if multiple identical anchors exist and suggesting making anchors more
specific.
- Around line 60-66: In the "append" case when a protectedRange(s) exists,
ensure you normalize spacing by trimming trailing whitespace from the prefix
before the protected marker and inserting the same blank-line padding used in
the non-protected branch: compute prefix = s.slice(0, r[0]).replace(/\s*$/, "")
then set s = prefix + "\n\n" + content + "\n\n" + s.slice(r[0]); update the code
inside case "append" (the protectedRange(s) branch) to use this logic so
appended content never runs into the protected marker.

In `@src/skillify/skill-invocations.ts`:
- Around line 155-168: windowAroundInvocation can produce a degenerate output
when maxChars is very small because the elision notice can end up longer than
the visible head/tail; update windowAroundInvocation to handle small limits by
(a) computing the elision message first (e.g., elideMsg = `\n\n…[${joined.length
- maxChars} chars elided]…\n\n`) and ensuring head+tail+elideMsg.length <=
maxChars, or (b) for very small maxChars (choose a threshold like 100–200) fall
back to a simpler truncation strategy such as returning joined.slice(0, maxChars
- elideMsg.length) + elideMsg; adjust the head/tail length calculations
accordingly so head = Math.max(0, floor((maxChars - elideMsg.length) * 0.55))
and tail uses the remainder, referencing windowAroundInvocation, maxChars,
joined, head, tail and the elision message in your changes.
- Around line 73-79: Validate or safely-escape the sessionsTable identifier
before interpolating it in listSkillInvocations: ensure
loadConfig().sessionsTableName is checked against a strict whitelist or regex
for valid SQL identifiers (e.g., /^[A-Za-z_][A-Za-z0-9_]*$/) and reject or
fallback if it fails, so the query using sessionsTable cannot be malformed; also
review windowAroundInvocation to guard against tiny maxChars by clamping
maxChars to a sensible minimum or accounting for the elision marker length when
computing head/tail from joined so the resulting string never exceeds maxChars
and head/tail do not produce empty slices unexpectedly.

In `@src/skillify/skill-publisher.ts`:
- Around line 29-38: bumpVersion doesn't handle an entirely missing/empty
frontmatter string: when splitFrontmatter returns '' you should create a proper
frontmatter block instead of returning an empty string. Update the bumpVersion
function to detect empty or whitespace-only frontmatter, treat oldVersion as 1
and newVersion as 2, and return a frontmatter string like a full frontmatter
block containing the version (e.g. '---\nversion: 2\n---\n'); keep existing
behavior for when a version line exists (use the existing regex replace) and
when frontmatter exists but lacks a version line (insert version before the
closing '---'). Ensure you reference bumpVersion and its returned
oldVersion/newVersion values so callers remain consistent.

In `@src/skillify/skillopt-worker.ts`:
- Around line 37-49: Compute a single timestamp and reuse it for the whole cycle
so MetaEntry.proposedAt and ProposalRecord.createdAt match: before calling
runSkillOptCycle, create a const (e.g. nowIso = new Date().toISOString()) and
pass that as the now argument to runSkillOptCycle; inside the meta.record
callback (which currently calls metaEntryFor(n, a, edits, new
Date().toISOString())), replace the inline new Date() call with the shared
nowIso so metaEntryFor and the cycle use the identical timestamp.

In `@tests/shared/session-anchor.test.ts`:
- Around line 8-41: Add a unit test to verify detectAnchor truncates evidence to
200 characters: create a test that calls detectAnchor with an assistant turn
followed by a very long user pushback (e.g., "no that's wrong " +
"x".repeat(300)), assert anchored is true and that r.evidence.length === 200;
place this alongside the other detectAnchor tests and reference the detectAnchor
helper and the u/a message builders to construct the turns.

In `@tests/shared/skill-edit-gate.test.ts`:
- Around line 45-51: The query mock (vi.fn assigned to query) must validate both
time bounds instead of only checking for `< '${PUB}'`; update the mock to detect
the before-window SQL by asserting it contains the lower bound token (>=
beforeSince) and the upper bound `< '${PUB}'` and return beforeInvs, and detect
the after-window SQL by asserting it contains `>= '${PUB}'` (publishIso) and the
corresponding upper bound and return afterInvs; reference the variables and
symbols used in the test (query, PUB, beforeSince, publishIso, beforeInvs,
afterInvs, gateEdit) so the mock fails the test if gateEdit constructs incorrect
time predicates.

In `@tests/shared/skill-invocations.test.ts`:
- Around line 64-103: Add explicit tests for invocation-at-session-boundaries in
the skill-invocations.test.ts suite: create one test where the Skill tool_call
occurs at the very start (invocation index 0) and call windowAroundInvocation
with before > available turns (e.g., before: 3) and assert the result only
contains the available post-invocation turns; create another test where the
Skill tool_call is the final turn and call with after > available turns (e.g.,
after: 6) and assert the result only contains the available pre-invocation
turns. Use the existing helpers (mockQuery, TABLE, windowAroundInvocation) and
assertions matching the style of the other tests to validate the Math.max(0,
...) boundary behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: ce0d8923-7d0c-4915-abfb-cfeac3a9842f

📥 Commits

Reviewing files that changed from the base of the PR and between a34a9de and b03c10a.

📒 Files selected for processing (27)

node_modules
src/hooks/session-start.ts
src/skillify/claude-model.ts
src/skillify/deficiency-detector.ts
src/skillify/session-anchor.ts
src/skillify/skill-edit-gate.ts
src/skillify/skill-edits.ts
src/skillify/skill-invocations.ts
src/skillify/skill-proposer.ts
src/skillify/skill-publisher.ts
src/skillify/skillopt-engine.ts
src/skillify/skillopt-meta.ts
src/skillify/skillopt-worker.ts
src/skillify/skills-active.ts
src/skillify/success-judge.ts
tests/claude-code/session-start-hook.test.ts
tests/shared/deficiency-detector.test.ts
tests/shared/session-anchor.test.ts
tests/shared/skill-edit-gate.test.ts
tests/shared/skill-edits.test.ts
tests/shared/skill-invocations.test.ts
tests/shared/skill-proposer.test.ts
tests/shared/skill-publisher.test.ts
tests/shared/skillopt-engine.test.ts
tests/shared/skillopt-meta.test.ts
tests/shared/skills-active.test.ts
tests/shared/success-judge.test.ts

💤 Files with no reviewable changes (2)

tests/shared/skills-active.test.ts
src/hooks/session-start.ts

A machine-specific absolute node_modules symlink got committed in 1929c96 because .gitignore only had `node_modules/` (matches a directory, not the symlink file). Untracked it and tightened the ignore to `node_modules` so the symlink can't be re-added. The symlink stays locally (for builds); it's just no longer in git.

kaghni · 2026-06-05T06:23:18Z

@coderabbitai review

coderabbitai · 2026-06-05T06:23:23Z

✅ Action performed

Review finished.

scoreInvocations built the judge window from windowedTurns without applying the maxChars cap, so a captured turn containing a pasted log/diff could be sent whole to claude -p — slow, costly, or failing. Extracted elide() from windowAroundInvocation and apply it to the window before judging. Test asserts a 5000-char paste is capped under maxChars.

…dex P2) The judge & proposer get untrusted captured transcript text in their prompts, so --disallowed-tools is the prompt-injection guard. The list missed write-capable tools (MultiEdit, NotebookEdit, TodoWrite), leaving an escape hatch in installs that expose them. Enumerated them so a malicious failure example can't act.

The worker only read ~/.claude/skills, so a deficient skill pulled with --to project (<cwd>/.claude/skills) had its invocation detected but no body to edit -> the proposal was silently skipped. The detached worker inherits the SessionStart cwd, so it now falls back to <cwd>/.claude/skills.

kaghni · 2026-06-05T06:32:57Z

@coderabbitai review

coderabbitai · 2026-06-05T06:33:02Z

✅ Action performed

Review finished.

On a machine with the Hivemind hooks installed, each judge/proposer claude -p call was captured as a real session row — polluting the very sessions data the detector scans. Spawn the child with HIVEMIND_CAPTURE=false, as the skillify gate runner does.

"thanks, but this is still failing" matched PUSHBACK but BENIGN's bare "thank" also matched, so detectAnchor returned false and the failure was dropped — bad for a recall-oriented stage. Split into STRONG corrections (always anchor, even amid polite words) and the ambiguous bare "no" (benign-gated). Test covers polite corrections.

coderabbitai

🧹 Nitpick comments (1)

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/shared/deficiency-detector.test.ts`:
- Around line 55-74: The test "caps the judged window at maxChars..." currently
asserts judgedLen < 800 which is too loose; update the tighter threshold to
judgedLen < 500 (or <450) to better reflect maxChars: 300 behavior. Locate the
test in tests/shared/deficiency-detector.test.ts (the it block that calls
detectDeficientSkills and sets judgedLen via the mocked judge) and change the
expectation on judgedLen from expect(judgedLen).toBeLessThan(800) to
expect(judgedLen).toBeLessThan(500) so the assertion more precisely verifies the
capping.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: f702cf06-8dff-4496-abbd-21b4f5227f0e

📥 Commits

Reviewing files that changed from the base of the PR and between a1905fd and a118d0a.

📒 Files selected for processing (7)

src/skillify/claude-model.ts
src/skillify/deficiency-detector.ts
src/skillify/session-anchor.ts
src/skillify/skill-invocations.ts
src/skillify/skillopt-worker.ts
tests/shared/deficiency-detector.test.ts
tests/shared/session-anchor.test.ts

🚧 Files skipped from review as they are similar to previous changes (6)

src/skillify/session-anchor.ts
tests/shared/session-anchor.test.ts
src/skillify/claude-model.ts
src/skillify/skill-invocations.ts
src/skillify/deficiency-detector.ts
src/skillify/skillopt-worker.ts

…ex P2) targetsProtected only checked the target's start offset, so a replace/delete whose target began just before SLOW_UPDATE_START and spanned into the block slipped through and could remove protected longitudinal guidance. Now rejects any target RANGE that overlaps [r0, r1). Test covers a spanning target.

…ths (codex P2) Two findings sharing skill-invocations.ts: - detectAnchor scanned the whole window incl. pre-invocation turns, so a correction from an EARLIER failed attempt (skill used as a repair) was misattributed to this skill. windowedTurns now returns a pivot (first post-invocation turn); detectAnchor takes fromIndex and only fires when both the reaction AND the assistant it reacts to are post-invocation. Test transcripts reordered to real capture order (user → tool_call → assistant → reaction). - splitOrgSkill now rejects refs containing path separators or '..' before name/author are used to build skills/proposals filesystem paths (path-escape guard, matching the pull path's untrusted treatment).

kaghni · 2026-06-05T18:23:42Z

@coderabbitai review

coderabbitai · 2026-06-05T18:23:47Z

✅ Action performed

Review finished.

…x P1) A deny-list only blocks named built-ins — any other tool in the user's env (LS, NotebookRead, configured MCP tools) stayed callable, breaking the pure-text isolation for calls that include untrusted captured transcript text. Switched to `--tools """ (empty allow-list = zero tools available, authoritative over built-ins AND MCP). Verified the flag isolates with a real claude -p call.

…ex P2) Invocations come from ALL projects, but the worker resolved project-scoped skills from its own SessionStart cwd — so a skill pulled --to project in another cwd was skipped, or a same-named skill in the current cwd was edited instead. readSkillBodyViaManifest now resolves the authoritative installRoot from the pull manifest (global ~/.claude/skills as fallback). Test covers the project-root read.

kaghni · 2026-06-05T18:38:15Z

@coderabbitai review

coderabbitai · 2026-06-05T18:38:20Z

✅ Action performed

Review finished.

…alls (codex P2) HIVEMIND_CAPTURE=false stopped capture, but the spawned judge/proposer claude -p still ran our SessionStart hook — injecting the large Deeplake context into the prompt and doing auto-pull/graph work, once per anchored invocation. Set HIVEMIND_WIKI_WORKER=1 (the guard internal runners use) so the hook returns immediately: no prompt contamination, no repeated background work.

… P3) path LIKE '%sid%' could pull rows from an unrelated session (sid as a substring, or SQL LIKE wildcards in the id), feeding the judge a mixed transcript. Escape LIKE wildcards (+ ESCAPE clause) and drop any row whose message.session_id isn't the exact session. Test covers a cross-session collision row.

… P3) bumpVersion('') left the frontmatter empty while reporting newVersion: 2, so a published edit of a no-frontmatter skill carried no version and the gate couldn't tell versions apart. Now emits a minimal '---\nversion: N\n---\n' block. Test added.

kaghni · 2026-06-05T18:54:42Z

@coderabbitai review

coderabbitai · 2026-06-05T18:54:47Z

✅ Action performed

Review finished.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/skillify/skillopt-worker.ts (1)

64-64: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Return a non-zero exit code on fatal worker errors.

Line 64 logs a fatal condition and still exits with 0. That makes a crashed SkillOpt run indistinguishable from a successful no-op run, so callers and manual invocations cannot detect that the worker actually failed.

Suggested fix

-main().catch((e) => { log(`fatal (swallowed): ${(e as Error)?.message ?? e}`); process.exit(0); });
+main().catch((e) => {
+  log(`fatal: ${(e as Error)?.message ?? e}`);
+  process.exit(1);
+});

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/skillify/skillopt-worker.ts` at line 64, The current top-level error
handler for main() logs a fatal error but calls process.exit(0); change it to
return a non-zero exit code so failures are detectable: update the catch handler
on main() (the lambda that calls log(...) and process.exit(...)) to call
process.exit(1) (or another non-zero code) and include the error details in the
log message (using the existing (e as Error)?.message ?? e expression) so the
handler for main() correctly signals failure to callers.

src/skillify/skillopt-engine.ts (1)

70-86: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Isolate per-skill failures inside the proposal loop.

On Line 74 through Line 84, any throw from readSkillBody, proposeSkillEdit, writeProposal, or meta.record aborts runSkillOptCycle for the entire batch. Because src/skillify/skillopt-worker.ts Line 64 catches that once at the top, one transient model or filesystem failure skips every remaining deficient skill in the cycle.

Suggested containment

   const proposals: CycleResult["proposals"] = [];
   for (const s of targets) {
-    const body = deps.readSkillBody(s.name, s.author);
-    if (!body) continue; // not installed locally → nothing to edit
-    const priorEdits = deps.meta?.prior(s.name, s.author) ?? [];
-    const p = await proposeSkillEdit(body, s.examples, { ...deps.proposer, priorEdits });
-    // dedup against the meta memory — don't re-write an edit already tried for this skill.
-    const isDup = p.changed && (deps.meta?.has(s.name, s.author, p.edits) ?? false);
-    if (p.changed && !isDup) {
-      deps.writeProposal({
-        name: s.name, author: s.author,
-        invocations: s.invocations, confirmedFailures: s.confirmedFailures, failureRate: s.failureRate,
-        examples: s.examples, edits: p.edits, report: p.report,
-        candidateBody: p.editedBody, createdAt: deps.now,
-      });
-      deps.meta?.record(s.name, s.author, p.edits);
-    }
-    proposals.push({ name: s.name, author: s.author, changed: p.changed && !isDup, failureRate: s.failureRate });
+    try {
+      const body = deps.readSkillBody(s.name, s.author);
+      if (!body) continue; // not installed locally → nothing to edit
+      const priorEdits = deps.meta?.prior(s.name, s.author) ?? [];
+      const p = await proposeSkillEdit(body, s.examples, { ...deps.proposer, priorEdits });
+      const isDup = p.changed && (deps.meta?.has(s.name, s.author, p.edits) ?? false);
+      if (p.changed && !isDup) {
+        deps.writeProposal({
+          name: s.name, author: s.author,
+          invocations: s.invocations, confirmedFailures: s.confirmedFailures, failureRate: s.failureRate,
+          examples: s.examples, edits: p.edits, report: p.report,
+          candidateBody: p.editedBody, createdAt: deps.now,
+        });
+        deps.meta?.record(s.name, s.author, p.edits);
+      }
+      proposals.push({ name: s.name, author: s.author, changed: p.changed && !isDup, failureRate: s.failureRate });
+    } catch {
+      proposals.push({ name: s.name, author: s.author, changed: false, failureRate: s.failureRate });
+      continue;
+    }
   }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/skillify/skillopt-engine.ts` around lines 70 - 86, Wrap each iteration of
the targets loop in a per-skill try/catch so exceptions from readSkillBody,
proposeSkillEdit, writeProposal, or deps.meta.record do not abort the whole
runSkillOptCycle; on error, log the exception with skill context (name, author)
and continue to the next skill, and still push a proposals entry for that skill
with changed set to false (and include its failureRate/other available metadata)
so the caller sees the skill was processed but failed; ensure deps.meta.record
and deps.writeProposal are only invoked on the successful path where p.changed
&& !isDup.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/skillify/skillopt-engine.ts`:
- Around line 70-86: Wrap each iteration of the targets loop in a per-skill
try/catch so exceptions from readSkillBody, proposeSkillEdit, writeProposal, or
deps.meta.record do not abort the whole runSkillOptCycle; on error, log the
exception with skill context (name, author) and continue to the next skill, and
still push a proposals entry for that skill with changed set to false (and
include its failureRate/other available metadata) so the caller sees the skill
was processed but failed; ensure deps.meta.record and deps.writeProposal are
only invoked on the successful path where p.changed && !isDup.

In `@src/skillify/skillopt-worker.ts`:
- Line 64: The current top-level error handler for main() logs a fatal error but
calls process.exit(0); change it to return a non-zero exit code so failures are
detectable: update the catch handler on main() (the lambda that calls log(...)
and process.exit(...)) to call process.exit(1) (or another non-zero code) and
include the error details in the log message (using the existing (e as
Error)?.message ?? e expression) so the handler for main() correctly signals
failure to callers.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 61444a8f-b98a-4d96-9ef9-5f55fa29c5ef

📥 Commits

Reviewing files that changed from the base of the PR and between b934458 and 78d68a8.

📒 Files selected for processing (8)

src/skillify/claude-model.ts
src/skillify/skill-invocations.ts
src/skillify/skill-publisher.ts
src/skillify/skillopt-engine.ts
src/skillify/skillopt-worker.ts
tests/shared/skill-invocations.test.ts
tests/shared/skill-publisher.test.ts
tests/shared/skillopt-engine.test.ts

🚧 Files skipped from review as they are similar to previous changes (5)

tests/shared/skill-publisher.test.ts
tests/shared/skill-invocations.test.ts
tests/shared/skillopt-engine.test.ts
src/skillify/skill-publisher.ts
src/skillify/skill-invocations.ts

A detached hook worker may not have claude on PATH (e.g. ~/.claude/local/claude); spawn("claude") would ENOENT and the callers swallow it as no-change, so the weekly worker silently produced no proposals in those layouts. Resolve the binary the same way the rest of skillify does (findAgentBin('claude_code')).

runWeeklySkillOpt stamped lastRun before the worker checks creds, so firing at a logged-out SessionStart consumed the week — a user who logged in minutes later was skipped as throttled until next week. Added a canFire gate (default: creds present) that returns reason 'no-creds' before any stamp/spawn. Test covers it.

kaghni · 2026-06-05T19:12:27Z

@coderabbitai review

coderabbitai · 2026-06-05T19:12:32Z

✅ Action performed

Review finished.

… (codex P2) Two follow-ons from round 6: - claude-model: --tools "" denies tool USE but claude -p still LOADS the user's MCP config; a broken/oversized MCP schema fails every judge/proposer call before JSON. Added --strict-mcp-config to ignore user MCP entirely. - trigger: the canFire gate used the credentials FILE, but the worker uses loadConfig() (which also accepts HIVEMIND_TOKEN/ORG_ID env creds). Mismatch skipped env-cred users forever / stamped on a token-only malformed file. Now both use loadConfig()?.token.

kaghni · 2026-06-05T19:25:06Z

@coderabbitai review

coderabbitai · 2026-06-05T19:25:12Z

✅ Action performed

Review finished.

The weekly worker's lookback + detector thresholds are now overridable via env — HIVEMIND_SKILLOPT_{LOOKBACK_DAYS,MIN_INVOCATIONS,FAILURE_RATE,FIRE_THRESHOLD} — with the same defaults (30d, detector min-n, 0.4 rate, >=5 fire). A positive override wins; non-numeric/<=0 falls back. Useful for tuning + smaller-data testing.

efenocchi · 2026-06-05T21:49:29Z

@CodeRabbit how do we label a skill? how do we decide if it's good or not?

coderabbitai · 2026-06-05T21:50:09Z

Tip

For best results, initiate chat on the files or code changes.

@efenocchi Great question! This PR deliberately avoids manual labels — instead it uses a two-stage proxy signal:

How a skill's quality is evaluated

Stage 1 — Anchor detection (cheap heuristic)

Every time a Skill tool is invoked, the surrounding transcript turns are windowed. detectAnchor (src/skillify/session-anchor.ts) then scans the turns after the invocation, looking for a USER turn that immediately follows an ASSISTANT turn and matches either:

A strong correction regex (e.g. "that's wrong", "no, try again", "please redo…")
An ambiguous negative ("no", "nope") — only if it does not also match a benign negation ("no problem", "works now", "thanks")

If such a pattern is found → the invocation is anchored (flagged as a candidate failure). No judge runs for un-anchored invocations, so the expensive model call is reserved only for suspicious windows.

Stage 2 — Success judge (LLM-as-a-reward-model)

For each anchored invocation window, judgeSuccess (src/skillify/success-judge.ts) calls a Claude Haiku model with a strict JSON-only prompt asking it to output { success: 0|1, confidence, reason }. Conservatively, if the output can't be parsed, it defaults to success=1 (skill worked) to avoid false positives.

Aggregation & deficiency gate

detectDeficientSkills (src/skillify/deficiency-detector.ts) groups results by (name, author) and computes:

Metric	Meaning
`invocations`	Total times this skill was called
`anchored`	How many invocations had a user pushback anchor
`confirmed`	Of those anchored, how many the judge scored `success=0`
`failureRate`	`confirmed / invocations`
`deficient`	`true` only when `invocations ≥ minInvocations` AND `failureRate ≥ failureRateThreshold`

Even if a skill is deficient, no proposals fire unless ≥5 deficient events are accumulated (the fireThreshold gate in the engine).

Post-publish gate

After an edit is published, gateEdit (src/skillify/skill-edit-gate.ts) compares the before/after failure rates:

keep — failure rate dropped by at least margin
revert — failure rate rose by at least margin
inconclusive — not enough post-publish data yet

In short: user pushback patterns = cheap first filter → LLM judge = confirmation → failure rate thresholds = deficiency flag. No human labels are ever required.

efenocchi · 2026-06-05T22:51:35Z

@CodeRabbit do we keep the tool calls? if so, does it make sense to remove them to remove the noise? we do that for summaries

coderabbitai · 2026-06-05T22:52:23Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
fd 'skill-invocations.ts' --exec cat {}

Length of output: 8220

@efenocchi Tool calls are already stripped from the judge windows — the current implementation only reconstructs user_message and assistant_message turns. Here's how it works in sessionTurns():

// The Skill invocation tool_call itself is detected and skipped (only used to mark position):
const ref = invokedSkillRef(j);
if (ref) {
  // ... mark invIndex, then:
  continue;
}

// All other message types are handled here:
const text = typeof j.content === "string" ? j.content.trim() : "";
if (!text) continue;
if (j.type === "user_message") turns.push({ role: "USER", text });
else if (j.type === "assistant_message") turns.push({ role: "ASSISTANT", text });
// ← any other type (tool_call, tool_result, system, etc.) falls through — never pushed

So tool_call and tool_result rows are silently dropped. The window the judge sees is pure USER/ASSISTANT conversational text only — already consistent with what you do for summaries.

One edge case worth noting: if assistant_message records store their .content as a structured JSON object (e.g. an array of content blocks) rather than a plain string, the typeof j.content === "string" guard will produce an empty string and that turn gets dropped entirely by the if (!text) continue check. That could cause a tool-heavy assistant turn to go invisible to the judge — potentially a minor blind spot, but it also means no tool noise leaks through.

efenocchi · 2026-06-05T22:53:26Z

@kaghni can you increase the coverage?

kaghni added 13 commits June 5, 2026 03:50

coderabbitai Bot requested review from efenocchi and khustup2 June 5, 2026 06:10

coderabbitai Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread node_modules Outdated

kaghni added 3 commits June 5, 2026 06:32

kaghni added 2 commits June 5, 2026 06:43

coderabbitai Bot reviewed Jun 5, 2026

View reviewed changes

kaghni added 2 commits June 5, 2026 18:23

kaghni added 2 commits June 5, 2026 18:37

kaghni added 3 commits June 5, 2026 18:54

coderabbitai Bot reviewed Jun 5, 2026

View reviewed changes

kaghni added 2 commits June 5, 2026 19:12

kaghni merged commit c5e9a9d into feat/skillopt-foundation Jun 5, 2026
7 checks passed

coderabbitai Bot mentioned this pull request Jun 6, 2026

SkillOpt: Deeplake-centric direct-publish + multi-agent scorer + targeted event trigger #240

Merged

Conversation

kaghni commented Jun 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What it does

Key decisions

Proven E2E on real data

Paper fidelity

Tests

Deferred (honest)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

kaghni commented Jun 5, 2026

Uh oh!

coderabbitai Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kaghni commented Jun 5, 2026

Uh oh!

coderabbitai Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaghni commented Jun 5, 2026

Uh oh!

coderabbitai Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

kaghni commented Jun 5, 2026

Uh oh!

coderabbitai Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaghni commented Jun 5, 2026

Uh oh!

coderabbitai Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaghni commented Jun 5, 2026

Uh oh!

coderabbitai Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

kaghni commented Jun 5, 2026

Uh oh!

coderabbitai Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaghni commented Jun 5, 2026

Uh oh!

coderabbitai Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

efenocchi commented Jun 5, 2026

Uh oh!

coderabbitai Bot commented Jun 5, 2026

kaghni commented Jun 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading

github-actions Bot commented Jun 5, 2026 •

edited

Loading

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading