Skip to content

feat(skillopt): deficiency-detection + edit-proposer engine (stacked on #230)#237

Merged
kaghni merged 30 commits into
feat/skillopt-foundationfrom
feat/skillopt-engine
Jun 5, 2026
Merged

feat(skillopt): deficiency-detection + edit-proposer engine (stacked on #230)#237
kaghni merged 30 commits into
feat/skillopt-foundationfrom
feat/skillopt-engine

Conversation

@kaghni
Copy link
Copy Markdown
Collaborator

@kaghni kaghni commented Jun 5, 2026

Stacked on #230 (foundation). Builds the actual SkillOpt engine: detect deficient skills from real usage, propose bounded edits, gate them — the loop the foundation's worker stub was a placeholder for.

What it does

```
Skill invocations (real Skill tool_use) → window → anchor (free heuristic)
→ success-judge (only if anchored) → ≥5 deficient gate
→ propose bounded structured edits (meta-aware) → write REVIEW PROPOSALS
→ [apply] → live publish (version+backup) → longitudinal gate → keep/revert
```

Key decisions

  • Drops skills_active (the foundation's availability attribution). Actual invocation (Skill tool_use, already captured by capture.ts) is strictly better — it proves the skill was used and pins the turn for tight windowing.
  • No auto-publish. The worker writes review proposals; live overwrite waits for the gate to validate an edit in production (the offline gate is provably untrustworthy). Nothing ever touches the model channel (PR First-run + resume session briefs, kept to the user channel #223 rule).
  • Reward = success-judge + free heuristic anchor (no labels). The anchor (user pushback) is a free filter so the judge runs only on a fraction → token-cheap.

Proven E2E on real data

Fabricated synthetic sessions in a test org → the real engine ran against real Deeplake + real `claude -p`: 6 deficient skills detected → ≥5 gate fired → 6 genuinely-good candidate edits (correctly diagnosed "mock-the-client" / "no-verify" from confirmed failures and rewrote the skill). On the org's real sessions: 40 invocations → 0 anchored → $0 (free filter working).

Paper fidelity

All SkillOpt components mapped: trainable doc ✅, rollout=real sessions ✅, reward=judge+anchor ✅, reflect→structured edits ✅, edit budget ✅, slow-update region ✅, meta-skill ✅, publish ✅, gate ✅ (longitudinal; randomized A/B = follow-up, needs version-at-invocation capture).

Tests

98 unit tests across 12 files — token discipline, precision gate, dedup, slow-update protection, conservative judge, gate decisions. All injected (no live Deeplake/LLM in tests).

Deferred (honest)

  • Randomized A/B (needs a small capture change: record skill version at invocation).
  • Apply-proposal + monitor command (where the gate plugs in).
  • Anchor pattern tuning against real pushback.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Automated weekly skill‑optimization cycle that detects deficient skills, generates bounded review proposals, deduplicates candidates, and produces safe publish artifacts with backups and version bumps.
    • New end‑user tooling: invocation & anchor detection, success judging, edit proposal/apply with protected regions, model‑driven proposer, and local proposal/meta management.
  • Bug Fixes
    • Reduced session‑start query footprint and clearer logging when capture/DDL is skipped.
  • Tests
    • Extensive Vitest coverage added across detection, anchoring, judging, edits, proposing, publishing, engine, meta, and worker behavior.
  • Chores
    • .gitignore tweak to ignore node_modules.

kaghni added 13 commits June 5, 2026 03:50
Read side of the attribution the foundation writes. listSkillSessions reads the
skills_active rows (filtered on description='skills_active', deduped per session)
into per-session {skills, bucket}; cohortsForSkill partitions a skill's
treatment/control; reconstructSession rebuilds the raw transcript from the
session's captured rows. All queries injected (unit-testable); nothing touches
the model channel. First piece of the deficiency detector. 10 unit tests.
Read side keyed on actual invocations: capture.ts persists each Skill tool_use
as a tool_call row (tool_name=Skill, tool_input={skill}). listSkillInvocations
pulls org-skill (<name>--<author>) invocations; windowAroundInvocation slices the
transcript before/after the invocation turn for tight, cheap judging. Supersedes
availability-based attribution. Injected queries; 8 unit tests.
Skill *invocation* (Skill tool_use, already captured) is a more accurate signal
than availability and pins the turn for windowing, so the SessionStart
skills_active write + helper + the availability-based cohorts module are removed.
The weekly trigger/worker stay. session-start-hook tests revert to the
pre-attribution query sequence; trigger tests unchanged.
Pure, free pushback detector: a USER turn right after an ASSISTANT turn matching
correction patterns = a hard, observable failure signal, independent of any LLM.
Tuned for recall (false positives just cost a judge call; false negatives only
under-detect — never churns a good skill). Exposes windowedTurns() from
skill-invocations so the anchor reads the same windowed slice the judge will.
6 anchor tests + patterns to tune on real data.
LLM judge of the windowed slice: was the task accomplished CORRECTLY? Prompted to
ignore user mood (a praised-but-wrong answer is a FAILURE) — the anti-sycophancy
axis the research validated. Runs on the user's claude -p (cheap default model,
all tools denied), injected for tests. Conservative: unparseable/errored/empty
judgments return success=1 so a flaky judge can only under-detect, never
manufacture deficiency. 7 unit tests.
Orchestrates the engine's detection step: group org-skill invocations, window each,
run the free anchor, and judge ONLY anchored windows. A confirmed failure needs
anchor AND judge agreement (precision). A skill is deficient at >=minInvocations
AND >=failureRateThreshold confirmed-failure rate. Returns per-skill stats +
deficientCount (the caller's >=5 fire gate). All injected; 2 tests assert the
min-n gate and that the judge runs only on anchored windows.
Port of SkillOpt's edit-application: append/insert_after/replace/delete ops, the
edit budget (textual learning rate), and the protected
<!-- SLOW_UPDATE_START/END --> region (the paper's slow-update) that fast edits
must not touch. Pure + deterministic; 7 tests cover each op, the budget, and the
protection.
…e backend

The paper's backward pass: from a deficient skill's body + its confirmed failures,
the optimizer diagnoses the single recurring weakness and emits bounded structured
edits (slow-update region off-limits), applied locally to a candidate body —
nothing published. Extracts the shared claude -p backend (claude-model.ts) used by
both judge and proposer; success-judge refactored onto it. Injected model; tolerant
edit parse; 5 proposer tests.
Writes an accepted edit to the live SKILL.md via the native skills dir (never the
model channel — PR #223), bumps the frontmatter version (v1-vs-v2), and backs up
the prior version for one-command revert. Mechanism only: the worker won't call it
on an unvalidated edit (offline gate isn't trustworthy) — reserved for the
real-usage A/B gate. 6 fs tests.
runSkillOptCycle ties it together: detect deficient skills, fire only at
>=fireThreshold (default 5 — act on a pattern, not noise), then for each: read the
body, propose a bounded edit, write a REVIEW PROPOSAL (candidate.md + proposal.json
with full evidence). Never auto-overwrites a live skill — live publish is gated on
the deferred A/B. All I/O injected; 4 tests cover the gate, proposals, skip-when-
not-installed, and a custom threshold.
Replaces the stub: the detached weekly worker now loads config, builds the query,
and runs runSkillOptCycle (detect -> >=5 gate -> propose -> write review proposals)
over a 30-day invocation lookback. Verified end-to-end against a real org: 40 Skill
invocations scanned, 0 anchored -> 0 judge calls -> 0 deficient -> exits at the gate
(~14s, $0). No auto-publish; proposals land under <stateDir>/skillopt/proposals.
The paper's meta-skill: an append-only JSONL recording every proposed edit per
skill (with an order-independent fingerprint). The cycle feeds prior-tried edits to
the proposer as context and dedups — never re-writing an edit already proposed for a
skill, so the optimizer doesn't churn. Status field (proposed→applied/reverted) is
ready to close the loop once the A/B gate records outcomes. Wired into the worker;
meta + dedup tests.
The validation organ. A randomized A/B needs the skill version at invocation time
(a capture change we lack), so the feasible gate is longitudinal: after an edit
publishes, compare the skill's confirmed-failure rate after vs before — a real drop
keeps it, a real rise reverts (from the SKILL.v<old> backup), unclear/too-few =
inconclusive. Observational (needs margin + min sample); randomized A/B is the
upgrade once invocation-version capture lands. Extracts scoreInvocations (shared by
detector + gate); adds untilIso to listSkillInvocations. 7 tests.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 5, 2026

Need an answer fast? Review this PR in Change Stack to ask focused questions about the PR or a changed range.

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: cd440db5-bf3b-47df-b1ab-00c220937524

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds an end-to-end SkillOpt pipeline: invocation extraction, anchor-based judging, deficiency scoring, LLM-driven edit proposals, edit application/publishing, meta deduplication, a one-shot worker run, and removal of session-start skill-attribution writes.

Changes

SkillOpt Core Implementation

Layer / File(s) Summary
Session failure detection
src/skillify/session-anchor.ts, src/skillify/claude-model.ts, src/skillify/success-judge.ts, tests/shared/session-anchor.test.ts, tests/shared/success-judge.test.ts
Anchoring detects explicit user corrections by pattern-matching adjacent turns; claudeModel spawns the Claude CLI with timeouts and tool denial; judgeSuccess evaluates transcript windows with conservative fallback on errors.
Skill invocation extraction
src/skillify/skill-invocations.ts, tests/shared/skill-invocations.test.ts
Extracts Skill tool-call invocations from session messages, parses org-scoped name--author refs, reconstructs user/assistant turn pairs, and returns windowed transcript context with elision.
Deficiency detection
src/skillify/deficiency-detector.ts, tests/shared/deficiency-detector.test.ts
Groups invocations by skill, windows transcripts, applies anchor+judge filtering, counts confirmed failures, and marks skills deficient when failure rate exceeds threshold and minimum invocation count is met.
Skill edit primitives
src/skillify/skill-edits.ts, tests/shared/skill-edits.test.ts
Defines markdown edit operations (append, insert_after, replace, delete) with protected-region support and edit budgets; applies edits with per-operation reporting and skips operations targeting protected blocks.
Skill proposal generation
src/skillify/skill-proposer.ts, tests/shared/skill-proposer.test.ts
Constructs prompts for Claude with skill body, failure examples, and prior edits; parses JSON edit arrays from model output with tolerance for surrounding prose; applies edits under budget and returns change flag.
Skill file versioning & publishing
src/skillify/skill-publisher.ts, tests/shared/skill-publisher.test.ts
Splits frontmatter from markdown body, bumps version numbers with defaults, writes backups named SKILL.v<old>.bak.md, and overwrites SKILL.md with bumped frontmatter plus edited body.
Edit outcome validation (before/after gating)
src/skillify/skill-edit-gate.ts, tests/shared/skill-edit-gate.test.ts
Measures skill failure rates in before/after windows around publish time, computes deltas, and decides keep, revert, or inconclusive based on configurable margin and minimum-sample thresholds.
Cycle orchestration & meta-state management
src/skillify/skillopt-engine.ts, src/skillify/skillopt-meta.ts, tests/shared/skillopt-engine.test.ts, tests/shared/skillopt-meta.test.ts
Orchestrates the full weekly cycle: detects deficient skills, reads local bodies, proposes edits, deduplicates against meta history using fingerprints, writes proposals to disk, and tracks status for cross-run deduplication.
Weekly worker activation & execution
src/skillify/skillopt-worker.ts
Activates the SkillOpt worker to run one full cycle: loads config, constructs query adapter, initializes filesystem roots, loads meta cache, calls runSkillOptCycle with fire-gate threshold, and logs deficiency/proposal counts.
Session-start hook cleanup
src/hooks/session-start.ts, tests/claude-code/session-start-hook.test.ts
Removes deprecated HIVEMIND_SKILL_ATTRIBUTION skill attribution capture logic from session startup; retains autoPullSkills(), placeholder setup, and context rendering.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • efenocchi

Poem

🐰 I hop through logs and clipped transcripts bright,
I seek the turns where users say "not right",
Claude whispers edits in guarded JSON rows,
Fingerprints stop echoes from repeat woes,
I tuck the bump, back up, and watch skills grow.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 54.35% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main feature: implementing the SkillOpt deficiency-detection and edit-proposer engine, with clear reference to the stacked dependency (#230).
Description check ✅ Passed The description is comprehensive and complete: it explains the PR's purpose, key decisions, architecture, proven testing, paper fidelity, deferred items, and includes a detailed commit summary. The author demonstrates thorough understanding of the changes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/skillopt-engine

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@kaghni
Copy link
Copy Markdown
Collaborator Author

kaghni commented Jun 5, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 5, 2026

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot requested review from efenocchi and khustup2 June 5, 2026 06:10
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

Coverage Report

Scope: files changed in this PR. Enforced threshold: 90% per metric (per file via vitest.config.ts).

Status Category Percentage Covered / Total
🔴 Lines 89.70% (🎯 90%) 418 / 466
🔴 Statements 84.73% (🎯 90%) 505 / 596
🔴 Functions 78.26% (🎯 90%) 72 / 92
🔴 Branches 83.82% (🎯 90%) 347 / 414
File Coverage — 15 files changed
File Stmts Branches Functions Lines
src/hooks/session-start.ts 🟢 100.0% 🟢 94.8% 🟢 100.0% 🟢 100.0%
src/skillify/claude-model.ts 🔴 0.0% 🔴 0.0% 🔴 0.0% 🔴 0.0%
src/skillify/deficiency-detector.ts 🟢 100.0% 🔴 87.5% 🟢 100.0% 🟢 100.0%
src/skillify/session-anchor.ts 🟢 92.8% 🟢 91.7% 🟢 100.0% 🟢 100.0%
src/skillify/skill-edit-gate.ts 🟢 100.0% 🔴 89.5% 🟢 100.0% 🟢 100.0%
src/skillify/skill-edits.ts 🔴 81.3% 🔴 72.2% 🟢 100.0% 🟢 98.2%
src/skillify/skill-invocations.ts 🟢 93.3% 🔴 86.2% 🟢 100.0% 🟢 98.4%
src/skillify/skill-proposer.ts 🟢 94.9% 🔴 89.7% 🟢 100.0% 🟢 100.0%
src/skillify/skill-publisher.ts 🟢 100.0% 🟢 100.0% 🟢 100.0% 🟢 100.0%
src/skillify/skillopt-engine.ts 🔴 87.2% 🟢 95.7% 🔴 85.7% 🔴 84.8%
src/skillify/skillopt-meta.ts 🟢 100.0% 🔴 88.2% 🟢 100.0% 🟢 100.0%
src/skillify/skillopt-trigger.ts 🟢 96.2% 🟢 95.0% 🟢 100.0% 🟢 97.5%
src/skillify/skillopt-worker.ts 🔴 0.0% 🔴 0.0% 🔴 0.0% 🔴 0.0%
src/skillify/skills-active.ts
src/skillify/success-judge.ts 🟢 96.2% 🔴 83.3% 🟢 100.0% 🟢 100.0%

Generated for commit 9a1769a.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (11)
tests/shared/session-anchor.test.ts (1)

8-41: ⚡ Quick win

Consider adding a test for evidence truncation.

Line 39 in the implementation truncates evidence to 200 chars, but there's no test verifying this behavior. This is an edge case worth validating.

🧪 Suggested test addition
it("truncates evidence to 200 chars", () => {
  const longPushback = "no that's wrong " + "x".repeat(300);
  const r = detectAnchor([a("done"), u(longPushback)]);
  expect(r.anchored).toBe(true);
  expect(r.evidence.length).toBe(200);
});
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/shared/session-anchor.test.ts` around lines 8 - 41, Add a unit test to
verify detectAnchor truncates evidence to 200 characters: create a test that
calls detectAnchor with an assistant turn followed by a very long user pushback
(e.g., "no that's wrong " + "x".repeat(300)), assert anchored is true and that
r.evidence.length === 200; place this alongside the other detectAnchor tests and
reference the detectAnchor helper and the u/a message builders to construct the
turns.
src/skillify/session-anchor.ts (2)

27-27: 💤 Low value

BENIGN pattern may over-match on "thank".

The pattern ends with thank which will match "thanksgiving", "thankful", etc. Consider adding word boundary or requiring "thanks" specifically.

🔧 Suggested refinement
-const BENIGN = /\b(no (problem|worries|need|biggie)|no,? thanks|all good|works? (now|great|fine|perfectly)|that works|perfect|looks good|thank)/i;
+const BENIGN = /\b(no (problem|worries|need|biggie)|no,? thanks|all good|works? (now|great|fine|perfectly)|that works|perfect|looks good|thanks?)\b/i;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/skillify/session-anchor.ts` at line 27, The BENIGN regex constant
over-matches because the trailing "thank" will match words like "thanksgiving"
or "thankful"; update the BENIGN pattern in session-anchor.ts (symbol: BENIGN)
to require a word boundary or explicit pluralization (e.g., use \bthank(s)?\b or
otherwise add \b around "thank") so only standalone "thank" or "thanks" are
matched; keep the existing alternation structure and flags intact.

23-23: ⚡ Quick win

Consider expanding PUSHBACK patterns for better recall.

The regex is tuned for recall but could catch a few more common correction patterns:

  • "that fail" should likely be "that failed" (past tense)
  • Missing common patterns: "fix", "error", "bug", "doesn't help", "failed"

Since false positives only cost one judge call and the goal is high recall, consider adding these patterns.

📝 Suggested regex enhancement
-const PUSHBACK = /\b(no|nope|wrong|incorrect|not what|that'?s not|does ?n'?t work|did ?n'?t work|do ?n'?t work|wo ?n'?t work|is ?n'?t|that'?s wrong|broke|broken|still (failing|broken|not working|wrong|the same)|try again|undo|revert that|that fail)/i;
+const PUSHBACK = /\b(no|nope|wrong|incorrect|not what|that'?s not|does ?n'?t work|did ?n'?t work|do ?n'?t work|wo ?n'?t work|is ?n'?t|that'?s wrong|broke|broken|failed|still (failing|broken|not working|wrong|the same)|try again|undo|revert that|fix (that|this|it)|error|bug)/i;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/skillify/session-anchor.ts` at line 23, Update the PUSHBACK regex (the
PUSHBACK constant) to broaden recall: change the "that fail" token to "that
failed" and add common correction/error tokens such as "failed", "fix", "fix
it", "error", "bug", and the negation "doesn't help" (plus apostrophe-free
variants like "doesnt help" and "doesnt") so the pattern matches those user
correction phrases as well; keep the existing alternatives and case-insensitive
flag intact.
tests/shared/skill-invocations.test.ts (1)

64-103: ⚡ Quick win

Consider adding test for invocation at session boundaries.

The tests cover invocation in the middle and invocation not found, but don't test when the invocation is at the very start (before=3 but only 0 turns before) or at the very end (after=6 but only 1 turn after). The Math.max(0, ...) on line 152 of the implementation handles this, but explicit test coverage would validate the boundary behavior.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/shared/skill-invocations.test.ts` around lines 64 - 103, Add explicit
tests for invocation-at-session-boundaries in the skill-invocations.test.ts
suite: create one test where the Skill tool_call occurs at the very start
(invocation index 0) and call windowAroundInvocation with before > available
turns (e.g., before: 3) and assert the result only contains the available
post-invocation turns; create another test where the Skill tool_call is the
final turn and call with after > available turns (e.g., after: 6) and assert the
result only contains the available pre-invocation turns. Use the existing
helpers (mockQuery, TABLE, windowAroundInvocation) and assertions matching the
style of the other tests to validate the Math.max(0, ...) boundary behavior.
tests/shared/skill-edit-gate.test.ts (1)

45-51: 💤 Low value

Query mock could validate both time bounds more precisely.

The mock distinguishes before/after windows by checking sql.includes(\< '${PUB}'`), but doesn't verify the lower bound (>=). While this works for the test, it doesn't fully validate that gateEdit` constructs the correct SQL time bounds.

Consider adding assertions that verify both >= beforeSince (for before window) and >= publishIso (for after window) are present in the SQL.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/shared/skill-edit-gate.test.ts` around lines 45 - 51, The query mock
(vi.fn assigned to query) must validate both time bounds instead of only
checking for `< '${PUB}'`; update the mock to detect the before-window SQL by
asserting it contains the lower bound token (>= beforeSince) and the upper bound
`< '${PUB}'` and return beforeInvs, and detect the after-window SQL by asserting
it contains `>= '${PUB}'` (publishIso) and the corresponding upper bound and
return afterInvs; reference the variables and symbols used in the test (query,
PUB, beforeSince, publishIso, beforeInvs, afterInvs, gateEdit) so the mock fails
the test if gateEdit constructs incorrect time predicates.
src/skillify/skill-invocations.ts (2)

155-168: 💤 Low value

Elision math may produce odd results for very small maxChars.

If maxChars is very small (e.g., < 100), the head/tail split could produce a degenerate result where the elision message itself is longer than the actual content shown.

Consider adding a minimum threshold or simpler truncation for small limits.

🛡️ Suggested guard
 export async function windowAroundInvocation(
   query: QueryFn,
   sessionsTable: string,
   inv: SkillInvocation,
   opts: { before?: number; after?: number; maxChars?: number } = {},
 ): Promise<string> {
   const maxChars = opts.maxChars ?? 4000;
   const slice = await windowedTurns(query, sessionsTable, inv, opts);
   const joined = slice.map((t) => `${t.role}: ${t.text}`).join("\n\n");
   if (joined.length <= maxChars) return joined;
+  if (maxChars < 100) return joined.slice(0, maxChars) + "…"; // simple truncation for tiny limits
   const head = joined.slice(0, Math.floor(maxChars * 0.55));
   const tail = joined.slice(joined.length - Math.floor(maxChars * 0.45));
   return `${head}\n\n…[${joined.length - maxChars} chars elided]…\n\n${tail}`;
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/skillify/skill-invocations.ts` around lines 155 - 168,
windowAroundInvocation can produce a degenerate output when maxChars is very
small because the elision notice can end up longer than the visible head/tail;
update windowAroundInvocation to handle small limits by (a) computing the
elision message first (e.g., elideMsg = `\n\n…[${joined.length - maxChars} chars
elided]…\n\n`) and ensuring head+tail+elideMsg.length <= maxChars, or (b) for
very small maxChars (choose a threshold like 100–200) fall back to a simpler
truncation strategy such as returning joined.slice(0, maxChars -
elideMsg.length) + elideMsg; adjust the head/tail length calculations
accordingly so head = Math.max(0, floor((maxChars - elideMsg.length) * 0.55))
and tail uses the remainder, referencing windowAroundInvocation, maxChars,
joined, head, tail and the elision message in your changes.

73-79: ⚡ Quick win

Validate sessionsTable as an SQL identifier; elision bounds are mainly OK

  • sessionsTable interpolation in listSkillInvocations is fed from loadConfig().sessionsTableName (env HIVEMIND_SESSIONS_TABLE / default) and flows through server/worker context; there’s no path from request/user input into this value, so the practical SQL-injection risk from users is low. Still, because it’s an identifier inserted into SQL, add a whitelist/validation (or identifier-escaping helper) to protect against misconfiguration/poisoned env.
  • The elision slicing in windowAroundInvocation won’t overlap head/tail (it’s based on the ends of joined, and this branch only runs when joined.length > maxChars). For very small maxChars, head/tail can become empty and the final string can exceed maxChars due to the inserted elision marker; clamp maxChars if it can be user-controlled.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/skillify/skill-invocations.ts` around lines 73 - 79, Validate or
safely-escape the sessionsTable identifier before interpolating it in
listSkillInvocations: ensure loadConfig().sessionsTableName is checked against a
strict whitelist or regex for valid SQL identifiers (e.g.,
/^[A-Za-z_][A-Za-z0-9_]*$/) and reject or fallback if it fails, so the query
using sessionsTable cannot be malformed; also review windowAroundInvocation to
guard against tiny maxChars by clamping maxChars to a sensible minimum or
accounting for the elision marker length when computing head/tail from joined so
the resulting string never exceeds maxChars and head/tail do not produce empty
slices unexpectedly.
src/skillify/skillopt-worker.ts (1)

37-49: 💤 Low value

Consider using consistent timestamp across cycle.

Line 45's record callback creates a new Date().toISOString() timestamp for each meta entry, while line 48 passes a separate now value to the cycle. This causes MetaEntry.proposedAt and ProposalRecord.createdAt to have slightly different timestamps (potentially seconds apart) for edits in the same cycle run.

🔧 Suggested fix for timestamp consistency
+  const now = new Date().toISOString();
   const res = await runSkillOptCycle({
     query,
     sessionsTable: config.sessionsTableName,
     readSkillBody: (name, author) => readSkillBodyFromDisk(skillsRoot, name, author),
     writeProposal: (rec) => writeProposalToDisk(proposalsRoot, rec),
     meta: {
       prior: (n, a) => priorEditSummaries(metaCache, n, a),
       has: (n, a, edits) => alreadyProposed(metaCache, n, a, edits),
-      record: (n, a, edits) => { const e = metaEntryFor(n, a, edits, new Date().toISOString()); appendMeta(metaFile, e); metaCache.push(e); },
+      record: (n, a, edits) => { const e = metaEntryFor(n, a, edits, now); appendMeta(metaFile, e); metaCache.push(e); },
     },
     detector: { sinceIso, limit: 5000 },
-    now: new Date().toISOString(),
+    now,
   });
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/skillify/skillopt-worker.ts` around lines 37 - 49, Compute a single
timestamp and reuse it for the whole cycle so MetaEntry.proposedAt and
ProposalRecord.createdAt match: before calling runSkillOptCycle, create a const
(e.g. nowIso = new Date().toISOString()) and pass that as the now argument to
runSkillOptCycle; inside the meta.record callback (which currently calls
metaEntryFor(n, a, edits, new Date().toISOString())), replace the inline new
Date() call with the shared nowIso so metaEntryFor and the cycle use the
identical timestamp.
src/skillify/skill-edits.ts (2)

32-33: ⚡ Quick win

Consider target ambiguity when multiple occurrences exist.

The indexOf method finds only the first occurrence of the target string. If a SKILL.md contains duplicate text patterns and an edit intends to target a later occurrence, this will modify the wrong location. While the LLM proposer should choose sufficiently specific anchors, this limitation could cause incorrect edits in edge cases.

💡 Consider adding a comment documenting this limitation
 function targetsProtected(skill: string, target: string): boolean {
   const r = protectedRange(skill);
   if (!r || !target) return false;
+  // Note: indexOf finds the FIRST occurrence only. Proposer must choose unambiguous anchors.
   const idx = skill.indexOf(target);
   return idx !== -1 && idx >= r[0] && idx < r[1];
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/skillify/skill-edits.ts` around lines 32 - 33, The code uses
skill.indexOf(target) which only returns the first occurrence and can pick the
wrong match when the same target appears multiple times; update the logic in the
function containing variables skill, target, idx, and r so you search for an
occurrence that falls inside the range r (e.g., use indexOf(target, r[0]) and if
that result >= r[1] or -1, loop with successive indexOf calls until you find an
idx such that idx >= r[0] && idx < r[1] or conclude no match), and add a short
inline comment noting the remaining ambiguity if multiple identical anchors
exist and suggesting making anchors more specific.

60-66: 💤 Low value

Ensure proper newline spacing before protected region.

When appending with a protected region present, the code inserts content + "\n\n" before the marker but doesn't ensure proper spacing before the content itself. If the document text ends without a newline before the protected marker, the appended content will run into it.

🔧 Suggested defensive spacing
       case "append": {
         const content = (e.content ?? "").trim();
         if (!content) { report.push("SKIP append: empty content"); break; }
         const r = protectedRange(s);
-        if (r) s = s.slice(0, r[0]) + content + "\n\n" + s.slice(r[0]);
+        if (r) {
+          const before = s.slice(0, r[0]).replace(/\s*$/, "");
+          s = before + "\n\n" + content + "\n\n" + s.slice(r[0]);
+        }
         else s = s.replace(/\s*$/, "") + "\n\n" + content + "\n";
         ok(`append (+${content.length} chars)`);
         break;
       }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/skillify/skill-edits.ts` around lines 60 - 66, In the "append" case when
a protectedRange(s) exists, ensure you normalize spacing by trimming trailing
whitespace from the prefix before the protected marker and inserting the same
blank-line padding used in the non-protected branch: compute prefix = s.slice(0,
r[0]).replace(/\s*$/, "") then set s = prefix + "\n\n" + content + "\n\n" +
s.slice(r[0]); update the code inside case "append" (the protectedRange(s)
branch) to use this logic so appended content never runs into the protected
marker.
src/skillify/skill-publisher.ts (1)

29-38: ⚡ Quick win

Add defensive handling for missing frontmatter.

If splitFrontmatter returns empty frontmatter (file has no frontmatter block), bumpVersion returns empty frontmatter instead of creating a valid frontmatter block with version. The published file will then lack version tracking entirely.

🛡️ Suggested defensive handling
 export function bumpVersion(frontmatter: string): { frontmatter: string; oldVersion: number; newVersion: number } {
+  // If frontmatter is empty, create a minimal valid block
+  if (!frontmatter.trim()) {
+    return { frontmatter: "---\nversion: 2\n---\n", oldVersion: 1, newVersion: 2 };
+  }
   const m = frontmatter.match(/^version:\s*(\d+)\s*$/m);
   const oldVersion = m ? parseInt(m[1], 10) : 1;
   const newVersion = oldVersion + 1;
   const next = m
     ? frontmatter.replace(/^version:\s*\d+\s*$/m, `version: ${newVersion}`)
     : frontmatter.replace(/\n---\n$/, `\nversion: ${newVersion}\n---\n`);
   return { frontmatter: next, oldVersion, newVersion };
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/skillify/skill-publisher.ts` around lines 29 - 38, bumpVersion doesn't
handle an entirely missing/empty frontmatter string: when splitFrontmatter
returns '' you should create a proper frontmatter block instead of returning an
empty string. Update the bumpVersion function to detect empty or whitespace-only
frontmatter, treat oldVersion as 1 and newVersion as 2, and return a frontmatter
string like a full frontmatter block containing the version (e.g. '---\nversion:
2\n---\n'); keep existing behavior for when a version line exists (use the
existing regex replace) and when frontmatter exists but lacks a version line
(insert version before the closing '---'). Ensure you reference bumpVersion and
its returned oldVersion/newVersion values so callers remain consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@node_modules`:
- Line 1: Remove the committed machine-specific node_modules symlink from the
repository and stop tracking it: unstage/untrack the tracked node_modules
symlink so it is removed from the repo history going forward, add node_modules/
to .gitignore to prevent re-adding it, and commit the removal and updated
.gitignore. Ensure the repository no longer contains the symbolic link named
"node_modules" and that ".gitignore" includes "node_modules/".

---

Nitpick comments:
In `@src/skillify/session-anchor.ts`:
- Line 27: The BENIGN regex constant over-matches because the trailing "thank"
will match words like "thanksgiving" or "thankful"; update the BENIGN pattern in
session-anchor.ts (symbol: BENIGN) to require a word boundary or explicit
pluralization (e.g., use \bthank(s)?\b or otherwise add \b around "thank") so
only standalone "thank" or "thanks" are matched; keep the existing alternation
structure and flags intact.
- Line 23: Update the PUSHBACK regex (the PUSHBACK constant) to broaden recall:
change the "that fail" token to "that failed" and add common correction/error
tokens such as "failed", "fix", "fix it", "error", "bug", and the negation
"doesn't help" (plus apostrophe-free variants like "doesnt help" and "doesnt")
so the pattern matches those user correction phrases as well; keep the existing
alternatives and case-insensitive flag intact.

In `@src/skillify/skill-edits.ts`:
- Around line 32-33: The code uses skill.indexOf(target) which only returns the
first occurrence and can pick the wrong match when the same target appears
multiple times; update the logic in the function containing variables skill,
target, idx, and r so you search for an occurrence that falls inside the range r
(e.g., use indexOf(target, r[0]) and if that result >= r[1] or -1, loop with
successive indexOf calls until you find an idx such that idx >= r[0] && idx <
r[1] or conclude no match), and add a short inline comment noting the remaining
ambiguity if multiple identical anchors exist and suggesting making anchors more
specific.
- Around line 60-66: In the "append" case when a protectedRange(s) exists,
ensure you normalize spacing by trimming trailing whitespace from the prefix
before the protected marker and inserting the same blank-line padding used in
the non-protected branch: compute prefix = s.slice(0, r[0]).replace(/\s*$/, "")
then set s = prefix + "\n\n" + content + "\n\n" + s.slice(r[0]); update the code
inside case "append" (the protectedRange(s) branch) to use this logic so
appended content never runs into the protected marker.

In `@src/skillify/skill-invocations.ts`:
- Around line 155-168: windowAroundInvocation can produce a degenerate output
when maxChars is very small because the elision notice can end up longer than
the visible head/tail; update windowAroundInvocation to handle small limits by
(a) computing the elision message first (e.g., elideMsg = `\n\n…[${joined.length
- maxChars} chars elided]…\n\n`) and ensuring head+tail+elideMsg.length <=
maxChars, or (b) for very small maxChars (choose a threshold like 100–200) fall
back to a simpler truncation strategy such as returning joined.slice(0, maxChars
- elideMsg.length) + elideMsg; adjust the head/tail length calculations
accordingly so head = Math.max(0, floor((maxChars - elideMsg.length) * 0.55))
and tail uses the remainder, referencing windowAroundInvocation, maxChars,
joined, head, tail and the elision message in your changes.
- Around line 73-79: Validate or safely-escape the sessionsTable identifier
before interpolating it in listSkillInvocations: ensure
loadConfig().sessionsTableName is checked against a strict whitelist or regex
for valid SQL identifiers (e.g., /^[A-Za-z_][A-Za-z0-9_]*$/) and reject or
fallback if it fails, so the query using sessionsTable cannot be malformed; also
review windowAroundInvocation to guard against tiny maxChars by clamping
maxChars to a sensible minimum or accounting for the elision marker length when
computing head/tail from joined so the resulting string never exceeds maxChars
and head/tail do not produce empty slices unexpectedly.

In `@src/skillify/skill-publisher.ts`:
- Around line 29-38: bumpVersion doesn't handle an entirely missing/empty
frontmatter string: when splitFrontmatter returns '' you should create a proper
frontmatter block instead of returning an empty string. Update the bumpVersion
function to detect empty or whitespace-only frontmatter, treat oldVersion as 1
and newVersion as 2, and return a frontmatter string like a full frontmatter
block containing the version (e.g. '---\nversion: 2\n---\n'); keep existing
behavior for when a version line exists (use the existing regex replace) and
when frontmatter exists but lacks a version line (insert version before the
closing '---'). Ensure you reference bumpVersion and its returned
oldVersion/newVersion values so callers remain consistent.

In `@src/skillify/skillopt-worker.ts`:
- Around line 37-49: Compute a single timestamp and reuse it for the whole cycle
so MetaEntry.proposedAt and ProposalRecord.createdAt match: before calling
runSkillOptCycle, create a const (e.g. nowIso = new Date().toISOString()) and
pass that as the now argument to runSkillOptCycle; inside the meta.record
callback (which currently calls metaEntryFor(n, a, edits, new
Date().toISOString())), replace the inline new Date() call with the shared
nowIso so metaEntryFor and the cycle use the identical timestamp.

In `@tests/shared/session-anchor.test.ts`:
- Around line 8-41: Add a unit test to verify detectAnchor truncates evidence to
200 characters: create a test that calls detectAnchor with an assistant turn
followed by a very long user pushback (e.g., "no that's wrong " +
"x".repeat(300)), assert anchored is true and that r.evidence.length === 200;
place this alongside the other detectAnchor tests and reference the detectAnchor
helper and the u/a message builders to construct the turns.

In `@tests/shared/skill-edit-gate.test.ts`:
- Around line 45-51: The query mock (vi.fn assigned to query) must validate both
time bounds instead of only checking for `< '${PUB}'`; update the mock to detect
the before-window SQL by asserting it contains the lower bound token (>=
beforeSince) and the upper bound `< '${PUB}'` and return beforeInvs, and detect
the after-window SQL by asserting it contains `>= '${PUB}'` (publishIso) and the
corresponding upper bound and return afterInvs; reference the variables and
symbols used in the test (query, PUB, beforeSince, publishIso, beforeInvs,
afterInvs, gateEdit) so the mock fails the test if gateEdit constructs incorrect
time predicates.

In `@tests/shared/skill-invocations.test.ts`:
- Around line 64-103: Add explicit tests for invocation-at-session-boundaries in
the skill-invocations.test.ts suite: create one test where the Skill tool_call
occurs at the very start (invocation index 0) and call windowAroundInvocation
with before > available turns (e.g., before: 3) and assert the result only
contains the available post-invocation turns; create another test where the
Skill tool_call is the final turn and call with after > available turns (e.g.,
after: 6) and assert the result only contains the available pre-invocation
turns. Use the existing helpers (mockQuery, TABLE, windowAroundInvocation) and
assertions matching the style of the other tests to validate the Math.max(0,
...) boundary behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: ce0d8923-7d0c-4915-abfb-cfeac3a9842f

📥 Commits

Reviewing files that changed from the base of the PR and between a34a9de and b03c10a.

📒 Files selected for processing (27)
  • node_modules
  • src/hooks/session-start.ts
  • src/skillify/claude-model.ts
  • src/skillify/deficiency-detector.ts
  • src/skillify/session-anchor.ts
  • src/skillify/skill-edit-gate.ts
  • src/skillify/skill-edits.ts
  • src/skillify/skill-invocations.ts
  • src/skillify/skill-proposer.ts
  • src/skillify/skill-publisher.ts
  • src/skillify/skillopt-engine.ts
  • src/skillify/skillopt-meta.ts
  • src/skillify/skillopt-worker.ts
  • src/skillify/skills-active.ts
  • src/skillify/success-judge.ts
  • tests/claude-code/session-start-hook.test.ts
  • tests/shared/deficiency-detector.test.ts
  • tests/shared/session-anchor.test.ts
  • tests/shared/skill-edit-gate.test.ts
  • tests/shared/skill-edits.test.ts
  • tests/shared/skill-invocations.test.ts
  • tests/shared/skill-proposer.test.ts
  • tests/shared/skill-publisher.test.ts
  • tests/shared/skillopt-engine.test.ts
  • tests/shared/skillopt-meta.test.ts
  • tests/shared/skills-active.test.ts
  • tests/shared/success-judge.test.ts
💤 Files with no reviewable changes (2)
  • tests/shared/skills-active.test.ts
  • src/hooks/session-start.ts

Comment thread node_modules Outdated
A machine-specific absolute node_modules symlink got committed in 1929c96 because
.gitignore only had `node_modules/` (matches a directory, not the symlink file).
Untracked it and tightened the ignore to `node_modules` so the symlink can't be
re-added. The symlink stays locally (for builds); it's just no longer in git.
@kaghni
Copy link
Copy Markdown
Collaborator Author

kaghni commented Jun 5, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 5, 2026

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

kaghni added 3 commits June 5, 2026 06:32
scoreInvocations built the judge window from windowedTurns without applying the
maxChars cap, so a captured turn containing a pasted log/diff could be sent whole
to claude -p — slow, costly, or failing. Extracted elide() from
windowAroundInvocation and apply it to the window before judging. Test asserts a
5000-char paste is capped under maxChars.
…dex P2)

The judge & proposer get untrusted captured transcript text in their prompts, so
--disallowed-tools is the prompt-injection guard. The list missed write-capable
tools (MultiEdit, NotebookEdit, TodoWrite), leaving an escape hatch in installs
that expose them. Enumerated them so a malicious failure example can't act.
The worker only read ~/.claude/skills, so a deficient skill pulled with
--to project (<cwd>/.claude/skills) had its invocation detected but no body to
edit -> the proposal was silently skipped. The detached worker inherits the
SessionStart cwd, so it now falls back to <cwd>/.claude/skills.
@kaghni
Copy link
Copy Markdown
Collaborator Author

kaghni commented Jun 5, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 5, 2026

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

kaghni added 2 commits June 5, 2026 06:43
On a machine with the Hivemind hooks installed, each judge/proposer claude -p call
was captured as a real session row — polluting the very sessions data the detector
scans. Spawn the child with HIVEMIND_CAPTURE=false, as the skillify gate runner does.
"thanks, but this is still failing" matched PUSHBACK but BENIGN's bare "thank"
also matched, so detectAnchor returned false and the failure was dropped — bad for
a recall-oriented stage. Split into STRONG corrections (always anchor, even amid
polite words) and the ambiguous bare "no" (benign-gated). Test covers polite
corrections.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/shared/deficiency-detector.test.ts (1)

55-74: ⚡ Quick win

Tighten the assertion threshold for stronger test precision.

The assertion on line 73 uses < 800, but with maxChars: 300, the elided output should be approximately 330 characters (165 head + 135 tail + ~30-character elision marker). The current threshold of 800 is 2.67× the expected size, which still proves that huge pastes are capped but won't catch bugs where the cap is only partially effective (e.g., if it capped at 700 instead of ~330).

Consider tightening the assertion to < 450 or < 500 to more precisely validate the capping behavior while still allowing reasonable overhead.

🎯 Proposed tighter assertion
-    expect(judgedLen).toBeLessThan(800);   // capped — not the ~5000-char paste
+    expect(judgedLen).toBeLessThan(450);   // capped to ~330 chars (maxChars=300 + elision marker overhead)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/shared/deficiency-detector.test.ts` around lines 55 - 74, The test
"caps the judged window at maxChars..." currently asserts judgedLen < 800 which
is too loose; update the tighter threshold to judgedLen < 500 (or <450) to
better reflect maxChars: 300 behavior. Locate the test in
tests/shared/deficiency-detector.test.ts (the it block that calls
detectDeficientSkills and sets judgedLen via the mocked judge) and change the
expectation on judgedLen from expect(judgedLen).toBeLessThan(800) to
expect(judgedLen).toBeLessThan(500) so the assertion more precisely verifies the
capping.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/shared/deficiency-detector.test.ts`:
- Around line 55-74: The test "caps the judged window at maxChars..." currently
asserts judgedLen < 800 which is too loose; update the tighter threshold to
judgedLen < 500 (or <450) to better reflect maxChars: 300 behavior. Locate the
test in tests/shared/deficiency-detector.test.ts (the it block that calls
detectDeficientSkills and sets judgedLen via the mocked judge) and change the
expectation on judgedLen from expect(judgedLen).toBeLessThan(800) to
expect(judgedLen).toBeLessThan(500) so the assertion more precisely verifies the
capping.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: f702cf06-8dff-4496-abbd-21b4f5227f0e

📥 Commits

Reviewing files that changed from the base of the PR and between a1905fd and a118d0a.

📒 Files selected for processing (7)
  • src/skillify/claude-model.ts
  • src/skillify/deficiency-detector.ts
  • src/skillify/session-anchor.ts
  • src/skillify/skill-invocations.ts
  • src/skillify/skillopt-worker.ts
  • tests/shared/deficiency-detector.test.ts
  • tests/shared/session-anchor.test.ts
🚧 Files skipped from review as they are similar to previous changes (6)
  • src/skillify/session-anchor.ts
  • tests/shared/session-anchor.test.ts
  • src/skillify/claude-model.ts
  • src/skillify/skill-invocations.ts
  • src/skillify/deficiency-detector.ts
  • src/skillify/skillopt-worker.ts

kaghni added 2 commits June 5, 2026 18:23
…ex P2)

targetsProtected only checked the target's start offset, so a replace/delete whose
target began just before SLOW_UPDATE_START and spanned into the block slipped
through and could remove protected longitudinal guidance. Now rejects any target
RANGE that overlaps [r0, r1). Test covers a spanning target.
…ths (codex P2)

Two findings sharing skill-invocations.ts:
- detectAnchor scanned the whole window incl. pre-invocation turns, so a correction
  from an EARLIER failed attempt (skill used as a repair) was misattributed to this
  skill. windowedTurns now returns a pivot (first post-invocation turn); detectAnchor
  takes fromIndex and only fires when both the reaction AND the assistant it reacts to
  are post-invocation. Test transcripts reordered to real capture order
  (user → tool_call → assistant → reaction).
- splitOrgSkill now rejects refs containing path separators or '..' before name/author
  are used to build skills/proposals filesystem paths (path-escape guard, matching the
  pull path's untrusted treatment).
@kaghni
Copy link
Copy Markdown
Collaborator Author

kaghni commented Jun 5, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 5, 2026

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

kaghni added 2 commits June 5, 2026 18:37
…x P1)

A deny-list only blocks named built-ins — any other tool in the user's env (LS,
NotebookRead, configured MCP tools) stayed callable, breaking the pure-text
isolation for calls that include untrusted captured transcript text. Switched to
`--tools """ (empty allow-list = zero tools available, authoritative over built-ins
AND MCP). Verified the flag isolates with a real claude -p call.
…ex P2)

Invocations come from ALL projects, but the worker resolved project-scoped skills
from its own SessionStart cwd — so a skill pulled --to project in another cwd was
skipped, or a same-named skill in the current cwd was edited instead.
readSkillBodyViaManifest now resolves the authoritative installRoot from the pull
manifest (global ~/.claude/skills as fallback). Test covers the project-root read.
@kaghni
Copy link
Copy Markdown
Collaborator Author

kaghni commented Jun 5, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 5, 2026

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

kaghni added 3 commits June 5, 2026 18:54
…alls (codex P2)

HIVEMIND_CAPTURE=false stopped capture, but the spawned judge/proposer claude -p
still ran our SessionStart hook — injecting the large Deeplake context into the
prompt and doing auto-pull/graph work, once per anchored invocation. Set
HIVEMIND_WIKI_WORKER=1 (the guard internal runners use) so the hook returns
immediately: no prompt contamination, no repeated background work.
… P3)

path LIKE '%sid%' could pull rows from an unrelated session (sid as a substring, or
SQL LIKE wildcards in the id), feeding the judge a mixed transcript. Escape LIKE
wildcards (+ ESCAPE clause) and drop any row whose message.session_id isn't the exact
session. Test covers a cross-session collision row.
… P3)

bumpVersion('') left the frontmatter empty while reporting newVersion: 2, so a
published edit of a no-frontmatter skill carried no version and the gate couldn't
tell versions apart. Now emits a minimal '---\nversion: N\n---\n' block. Test added.
@kaghni
Copy link
Copy Markdown
Collaborator Author

kaghni commented Jun 5, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 5, 2026

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/skillify/skillopt-worker.ts (1)

64-64: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Return a non-zero exit code on fatal worker errors.

Line 64 logs a fatal condition and still exits with 0. That makes a crashed SkillOpt run indistinguishable from a successful no-op run, so callers and manual invocations cannot detect that the worker actually failed.

Suggested fix
-main().catch((e) => { log(`fatal (swallowed): ${(e as Error)?.message ?? e}`); process.exit(0); });
+main().catch((e) => {
+  log(`fatal: ${(e as Error)?.message ?? e}`);
+  process.exit(1);
+});
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/skillify/skillopt-worker.ts` at line 64, The current top-level error
handler for main() logs a fatal error but calls process.exit(0); change it to
return a non-zero exit code so failures are detectable: update the catch handler
on main() (the lambda that calls log(...) and process.exit(...)) to call
process.exit(1) (or another non-zero code) and include the error details in the
log message (using the existing (e as Error)?.message ?? e expression) so the
handler for main() correctly signals failure to callers.
src/skillify/skillopt-engine.ts (1)

70-86: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Isolate per-skill failures inside the proposal loop.

On Line 74 through Line 84, any throw from readSkillBody, proposeSkillEdit, writeProposal, or meta.record aborts runSkillOptCycle for the entire batch. Because src/skillify/skillopt-worker.ts Line 64 catches that once at the top, one transient model or filesystem failure skips every remaining deficient skill in the cycle.

Suggested containment
   const proposals: CycleResult["proposals"] = [];
   for (const s of targets) {
-    const body = deps.readSkillBody(s.name, s.author);
-    if (!body) continue; // not installed locally → nothing to edit
-    const priorEdits = deps.meta?.prior(s.name, s.author) ?? [];
-    const p = await proposeSkillEdit(body, s.examples, { ...deps.proposer, priorEdits });
-    // dedup against the meta memory — don't re-write an edit already tried for this skill.
-    const isDup = p.changed && (deps.meta?.has(s.name, s.author, p.edits) ?? false);
-    if (p.changed && !isDup) {
-      deps.writeProposal({
-        name: s.name, author: s.author,
-        invocations: s.invocations, confirmedFailures: s.confirmedFailures, failureRate: s.failureRate,
-        examples: s.examples, edits: p.edits, report: p.report,
-        candidateBody: p.editedBody, createdAt: deps.now,
-      });
-      deps.meta?.record(s.name, s.author, p.edits);
-    }
-    proposals.push({ name: s.name, author: s.author, changed: p.changed && !isDup, failureRate: s.failureRate });
+    try {
+      const body = deps.readSkillBody(s.name, s.author);
+      if (!body) continue; // not installed locally → nothing to edit
+      const priorEdits = deps.meta?.prior(s.name, s.author) ?? [];
+      const p = await proposeSkillEdit(body, s.examples, { ...deps.proposer, priorEdits });
+      const isDup = p.changed && (deps.meta?.has(s.name, s.author, p.edits) ?? false);
+      if (p.changed && !isDup) {
+        deps.writeProposal({
+          name: s.name, author: s.author,
+          invocations: s.invocations, confirmedFailures: s.confirmedFailures, failureRate: s.failureRate,
+          examples: s.examples, edits: p.edits, report: p.report,
+          candidateBody: p.editedBody, createdAt: deps.now,
+        });
+        deps.meta?.record(s.name, s.author, p.edits);
+      }
+      proposals.push({ name: s.name, author: s.author, changed: p.changed && !isDup, failureRate: s.failureRate });
+    } catch {
+      proposals.push({ name: s.name, author: s.author, changed: false, failureRate: s.failureRate });
+      continue;
+    }
   }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/skillify/skillopt-engine.ts` around lines 70 - 86, Wrap each iteration of
the targets loop in a per-skill try/catch so exceptions from readSkillBody,
proposeSkillEdit, writeProposal, or deps.meta.record do not abort the whole
runSkillOptCycle; on error, log the exception with skill context (name, author)
and continue to the next skill, and still push a proposals entry for that skill
with changed set to false (and include its failureRate/other available metadata)
so the caller sees the skill was processed but failed; ensure deps.meta.record
and deps.writeProposal are only invoked on the successful path where p.changed
&& !isDup.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/skillify/skillopt-engine.ts`:
- Around line 70-86: Wrap each iteration of the targets loop in a per-skill
try/catch so exceptions from readSkillBody, proposeSkillEdit, writeProposal, or
deps.meta.record do not abort the whole runSkillOptCycle; on error, log the
exception with skill context (name, author) and continue to the next skill, and
still push a proposals entry for that skill with changed set to false (and
include its failureRate/other available metadata) so the caller sees the skill
was processed but failed; ensure deps.meta.record and deps.writeProposal are
only invoked on the successful path where p.changed && !isDup.

In `@src/skillify/skillopt-worker.ts`:
- Line 64: The current top-level error handler for main() logs a fatal error but
calls process.exit(0); change it to return a non-zero exit code so failures are
detectable: update the catch handler on main() (the lambda that calls log(...)
and process.exit(...)) to call process.exit(1) (or another non-zero code) and
include the error details in the log message (using the existing (e as
Error)?.message ?? e expression) so the handler for main() correctly signals
failure to callers.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 61444a8f-b98a-4d96-9ef9-5f55fa29c5ef

📥 Commits

Reviewing files that changed from the base of the PR and between b934458 and 78d68a8.

📒 Files selected for processing (8)
  • src/skillify/claude-model.ts
  • src/skillify/skill-invocations.ts
  • src/skillify/skill-publisher.ts
  • src/skillify/skillopt-engine.ts
  • src/skillify/skillopt-worker.ts
  • tests/shared/skill-invocations.test.ts
  • tests/shared/skill-publisher.test.ts
  • tests/shared/skillopt-engine.test.ts
🚧 Files skipped from review as they are similar to previous changes (5)
  • tests/shared/skill-publisher.test.ts
  • tests/shared/skill-invocations.test.ts
  • tests/shared/skillopt-engine.test.ts
  • src/skillify/skill-publisher.ts
  • src/skillify/skill-invocations.ts

kaghni added 2 commits June 5, 2026 19:12
A detached hook worker may not have claude on PATH (e.g. ~/.claude/local/claude);
spawn("claude") would ENOENT and the callers swallow it as no-change, so the weekly
worker silently produced no proposals in those layouts. Resolve the binary the same
way the rest of skillify does (findAgentBin('claude_code')).
runWeeklySkillOpt stamped lastRun before the worker checks creds, so firing at a
logged-out SessionStart consumed the week — a user who logged in minutes later was
skipped as throttled until next week. Added a canFire gate (default: creds present)
that returns reason 'no-creds' before any stamp/spawn. Test covers it.
@kaghni
Copy link
Copy Markdown
Collaborator Author

kaghni commented Jun 5, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 5, 2026

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

… (codex P2)

Two follow-ons from round 6:
- claude-model: --tools "" denies tool USE but claude -p still LOADS the user's MCP
  config; a broken/oversized MCP schema fails every judge/proposer call before JSON.
  Added --strict-mcp-config to ignore user MCP entirely.
- trigger: the canFire gate used the credentials FILE, but the worker uses loadConfig()
  (which also accepts HIVEMIND_TOKEN/ORG_ID env creds). Mismatch skipped env-cred users
  forever / stamped on a token-only malformed file. Now both use loadConfig()?.token.
@kaghni
Copy link
Copy Markdown
Collaborator Author

kaghni commented Jun 5, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 5, 2026

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

The weekly worker's lookback + detector thresholds are now overridable via env —
HIVEMIND_SKILLOPT_{LOOKBACK_DAYS,MIN_INVOCATIONS,FAILURE_RATE,FIRE_THRESHOLD} — with
the same defaults (30d, detector min-n, 0.4 rate, >=5 fire). A positive override
wins; non-numeric/<=0 falls back. Useful for tuning + smaller-data testing.
@efenocchi
Copy link
Copy Markdown
Collaborator

@CodeRabbit how do we label a skill? how do we decide if it's good or not?

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 5, 2026

Tip

For best results, initiate chat on the files or code changes.

@efenocchi Great question! This PR deliberately avoids manual labels — instead it uses a two-stage proxy signal:


How a skill's quality is evaluated

Stage 1 — Anchor detection (cheap heuristic)

Every time a Skill tool is invoked, the surrounding transcript turns are windowed. detectAnchor (src/skillify/session-anchor.ts) then scans the turns after the invocation, looking for a USER turn that immediately follows an ASSISTANT turn and matches either:

  • A strong correction regex (e.g. "that's wrong", "no, try again", "please redo…")
  • An ambiguous negative ("no", "nope") — only if it does not also match a benign negation ("no problem", "works now", "thanks")

If such a pattern is found → the invocation is anchored (flagged as a candidate failure). No judge runs for un-anchored invocations, so the expensive model call is reserved only for suspicious windows.

Stage 2 — Success judge (LLM-as-a-reward-model)

For each anchored invocation window, judgeSuccess (src/skillify/success-judge.ts) calls a Claude Haiku model with a strict JSON-only prompt asking it to output { success: 0|1, confidence, reason }. Conservatively, if the output can't be parsed, it defaults to success=1 (skill worked) to avoid false positives.

Aggregation & deficiency gate

detectDeficientSkills (src/skillify/deficiency-detector.ts) groups results by (name, author) and computes:

Metric Meaning
invocations Total times this skill was called
anchored How many invocations had a user pushback anchor
confirmed Of those anchored, how many the judge scored success=0
failureRate confirmed / invocations
deficient true only when invocations ≥ minInvocations AND failureRate ≥ failureRateThreshold

Even if a skill is deficient, no proposals fire unless ≥5 deficient events are accumulated (the fireThreshold gate in the engine).

Post-publish gate

After an edit is published, gateEdit (src/skillify/skill-edit-gate.ts) compares the before/after failure rates:

  • keep — failure rate dropped by at least margin
  • revert — failure rate rose by at least margin
  • inconclusive — not enough post-publish data yet

In short: user pushback patterns = cheap first filter → LLM judge = confirmation → failure rate thresholds = deficiency flag. No human labels are ever required.

@efenocchi
Copy link
Copy Markdown
Collaborator

Tip

For best results, initiate chat on the files or code changes.

@efenocchi Great question! This PR deliberately avoids manual labels — instead it uses a two-stage proxy signal:

How a skill's quality is evaluated

Stage 1 — Anchor detection (cheap heuristic)

Every time a Skill tool is invoked, the surrounding transcript turns are windowed. detectAnchor (src/skillify/session-anchor.ts) then scans the turns after the invocation, looking for a USER turn that immediately follows an ASSISTANT turn and matches either:

  • A strong correction regex (e.g. "that's wrong", "no, try again", "please redo…")
  • An ambiguous negative ("no", "nope") — only if it does not also match a benign negation ("no problem", "works now", "thanks")

If such a pattern is found → the invocation is anchored (flagged as a candidate failure). No judge runs for un-anchored invocations, so the expensive model call is reserved only for suspicious windows.

Stage 2 — Success judge (LLM-as-a-reward-model)

For each anchored invocation window, judgeSuccess (src/skillify/success-judge.ts) calls a Claude Haiku model with a strict JSON-only prompt asking it to output { success: 0|1, confidence, reason }. Conservatively, if the output can't be parsed, it defaults to success=1 (skill worked) to avoid false positives.

Aggregation & deficiency gate

detectDeficientSkills (src/skillify/deficiency-detector.ts) groups results by (name, author) and computes:

Metric Meaning
invocations Total times this skill was called
anchored How many invocations had a user pushback anchor
confirmed Of those anchored, how many the judge scored success=0
failureRate confirmed / invocations
deficient true only when invocations ≥ minInvocations AND failureRate ≥ failureRateThreshold
Even if a skill is deficient, no proposals fire unless ≥5 deficient events are accumulated (the fireThreshold gate in the engine).

Post-publish gate

After an edit is published, gateEdit (src/skillify/skill-edit-gate.ts) compares the before/after failure rates:

  • keep — failure rate dropped by at least margin
  • revert — failure rate rose by at least margin
  • inconclusive — not enough post-publish data yet

In short: user pushback patterns = cheap first filter → LLM judge = confirmation → failure rate thresholds = deficiency flag. No human labels are ever required.

@CodeRabbit do we keep the tool calls? if so, does it make sense to remove them to remove the noise? we do that for summaries

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 5, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
fd 'skill-invocations.ts' --exec cat {}

Length of output: 8220



@efenocchi Tool calls are already stripped from the judge windows — the current implementation only reconstructs user_message and assistant_message turns. Here's how it works in sessionTurns():

// The Skill invocation tool_call itself is detected and skipped (only used to mark position):
const ref = invokedSkillRef(j);
if (ref) {
  // ... mark invIndex, then:
  continue;
}

// All other message types are handled here:
const text = typeof j.content === "string" ? j.content.trim() : "";
if (!text) continue;
if (j.type === "user_message") turns.push({ role: "USER", text });
else if (j.type === "assistant_message") turns.push({ role: "ASSISTANT", text });
// ← any other type (tool_call, tool_result, system, etc.) falls through — never pushed

So tool_call and tool_result rows are silently dropped. The window the judge sees is pure USER/ASSISTANT conversational text only — already consistent with what you do for summaries.

One edge case worth noting: if assistant_message records store their .content as a structured JSON object (e.g. an array of content blocks) rather than a plain string, the typeof j.content === "string" guard will produce an empty string and that turn gets dropped entirely by the if (!text) continue check. That could cause a tool-heavy assistant turn to go invisible to the judge — potentially a minor blind spot, but it also means no tool noise leaks through.

@efenocchi
Copy link
Copy Markdown
Collaborator

@kaghni can you increase the coverage?

@kaghni kaghni merged commit c5e9a9d into feat/skillopt-foundation Jun 5, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants