refactor(skillopt): clearer proposer prompt + robust edit-array parsing by khustup2 · Pull Request #245 · activeloopai/hivemind

khustup2 · 2026-06-07T06:17:36Z

What

Hardens the SkillOpt proposer's output parsing and restructures its prompt:

Parser: extract the JSON edit array via a balanced back-scan — robust to a leading prose/bracket line — with a first-[..last-] fallback. (Genuine robustness fix.)
Prompt: diagnose the single recurring weakness first, then emit concrete, anchored edits; explicit two-part contract so SYSTEM and user prompt agree.
Tests + coverage for both, including bracketed-prefix and unbalanced-content cases.

On the benchmark — the prompt change does NOT improve quality

Tested against the original proposer on 4 org skills with real failures, judging each edit vs the current body 5× per skill (rate, not a single noisy verdict):

skill	baseline	this PR's prompt
pg-suso	40%	40%
suso	80%	no edit produced
pg-analytics	60%	60%
deeplake-pg-production-diagnostics	80%	80%
mean accept-rate	65%	60%

Within noise (Δ5%), nominally worse. An earlier single run showed 3/4 but did not reproduce — that was judge nondeterminism. So the prompt rewrite has no measured benefit; the keepable part of this PR is the parser hardening + tests. n=4 is too small to detect an optimizer gain regardless.

Tests

26 green, typecheck clean.

…nd-at-end The proposer appended generic guidance to the end of long skill docs, which diluted rather than fixed (and lost the A/B against the current body). It now diagnoses the single recurring weakness first, then requires each edit to be concrete/operational and anchored into the relevant existing section (or to replace the weak existing instruction). On a 4-skill benchmark of real failing org skills, an independent judge prefers the proposed edit over the current body 3/4, up from 2/4 before.

coderabbitai · 2026-06-07T06:17:46Z

📝 Walkthrough

Walkthrough

This PR tightens the skill-proposer prompts: the SYSTEM prompt enforces a two-step diagnose-then-edit workflow with concrete, anchored, traceable edits (JSON array only), and buildUserPrompt now requires a one-line diagnosis followed by anchored edits. Parsing of the model output now extracts the last balanced JSON array and parseEdits returns [] on failure; tests validate bracket handling and priorEdits inclusion.

Changes

Skill Edit Proposal Constraints

Layer / File(s)	Summary
LLM prompt refinements for structured skill edits `src/skillify/skill-proposer.ts`	SYSTEM prompt enforces a two-step directive: diagnose the single recurring weakness, then propose a small set of concrete, operational edits anchored to existing sections via `insert_after`/`replace`/`delete`; forbids appending uninformed text and reiterates the `SU_START`/`SU_END` protected region. `buildUserPrompt` now requires the one-line diagnosis first, then outputs only the JSON array of anchored edits.
Robust edits JSON extraction and parseEdits `src/skillify/skill-proposer.ts`	Replaces naive first/last-bracket slicing with a backward scan to locate the last balanced `[...]` region before JSON.parse; `parseEdits` uses this helper and returns `[]` on extraction or parse failure.
Prompt validation and parsing tests `tests/shared/skill-proposer.test.ts`	Adds tests covering balanced-bracket extraction when weakness lines contain brackets, fallback when edits include an unbalanced `]`, asserts the system prompt includes anchoring/concrete/operational guidance and “REPLACE the weak” phrasing, verifies resulting `p.changed === true` and anchored edit output, and checks that `priorEdits` appear as an `ALREADY TRIED` section with the exact prior edit string.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

A rabbit nibbles prompts at dawn,
Finds the weakness, marks what's wrong,
Anchors edits, small and neat,
JSON arrays keep fixes discrete,
Hops off with a dotted line and song.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title directly describes the main changes: refactoring the proposer prompt for clarity and making edit-array parsing more robust, which aligns with the core objectives of the PR.
Description check	✅ Passed	The PR description provides a comprehensive summary of changes, benchmark results, test status, and acknowledges the parser hardening as the primary keepable improvement.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/skillopt-proposer-anchored-edits

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-06-07T06:19:22Z

Coverage Report

Scope: files changed in this PR. Enforced threshold: 90% per metric (per file via vitest.config.ts).

Status	Category	Percentage	Covered / Total
🟢	Lines	94.87% (🎯 90%)	37 / 39
🟢	Statements	96.30% (🎯 90%)	52 / 54
🟢	Functions	100.00% (🎯 90%)	6 / 6
🔴	Branches	87.80% (🎯 90%)	36 / 41

File Coverage — 1 file changed

File	Stmts	Branches	Functions	Lines
`src/skillify/skill-proposer.ts`	🟢 96.3%	🔴 87.8%	🟢 100.0%	🟢 94.9%

_{Generated for commit 2c9e4cd.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/skillify/skill-proposer.ts`:
- Around line 43-46: The SYSTEM prompt fragment using `${SU_END}` demands "Reply
with ONLY a JSON array..." but another prompt later requires a "one-line
weakness before JSON", creating a contradiction; reconcile them by choosing one
contract and making both prompt templates consistent: either remove the one-line
weakness requirement from the user prompt or change the SYSTEM/string built
around `SU_END` to allow an optional single-line weakness prior to the JSON.
Update the prompt-building code paths that reference `SU_END` and the other
prompt fragment (the one that currently asks for a one-line weakness) so both
emit the same instruction text, and run any prompt-formatting tests or callers
that parse the model output to ensure they accept the chosen format.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 8fe5edfc-dd9b-4ef8-8d59-5f5f3c3f2e6e

📥 Commits

Reviewing files that changed from the base of the PR and between 0b35aec and 11021a3.

📒 Files selected for processing (1)

src/skillify/skill-proposer.ts

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/shared/skill-proposer.test.ts`:
- Around line 44-47: The assertions in tests/shared/skill-proposer.test.ts use
broad regexes on the `system` string and `model.mock.calls[0][1]`; replace them
with exact prompt-fragment checks to avoid false positives. Update the three
`expect(system).toMatch(...)` lines to assert the precise expected substrings
(e.g., the exact instruction phrases you expect to appear in `system`) and
change `expect(model.mock.calls[0][1]).toMatch(...)` to assert the exact
fragment or full message from `model.mock.calls[0][1]` (use equality or
`toContain` with the specific text). Ensure the new assertions reference the
same symbols (`system` and `model.mock.calls[0][1]`) so the test still validates
the produced prompt fragments precisely.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: b85b9fe4-0604-476a-b027-8738b5d9a683

📥 Commits

Reviewing files that changed from the base of the PR and between 11021a3 and 2840a1c.

📒 Files selected for processing (1)

tests/shared/skill-proposer.test.ts

…en tests/coverage - Resolve SYSTEM/user-prompt contradiction (CodeRabbit): output stays JSON-only; diagnosis is an internal step, not emitted (also avoids a bracketed weakness line breaking parseEdits). - Assert exact prompt fragments instead of broad regexes. - Cover the priorEdits branch.

khustup2 · 2026-06-07T15:34:20Z

Addressed CodeRabbit in 1e5f58d6:

Output-format contradiction — output stays JSON-only; the weakness diagnosis is now an internal step (not emitted), which also avoids a bracketed weakness line breaking parseEdits.
Test assertions — switched to exact-fragment toContain checks.
Coverage — added a test for the priorEdits branch; proposer now 100% line/func, 89.7% branch.

…arse The JSON-only reconcile removed the 'name the weakness first' step, which was carrying the benchmark lift. Restore it as an explicit TWO-part contract (weakness line, then JSON array) — SYSTEM and user prompt now agree — and harden parseEdits to extract the array via a balanced back-scan (robust to a bracketed weakness line), with a first-[..last-] fallback. Tests cover both paths.

coderabbitai

🧹 Nitpick comments (1)

tests/shared/skill-proposer.test.ts (1)

44-44: 💤 Low value

Consider using exact fragment assertion for consistency.

Line 44 uses a case-insensitive regex while the rest of the file uses exact toContain checks. For consistency with the coding guidelines:

-    expect(model.mock.calls[0][0]).toMatch(/recurring weakness/i);
+    expect(model.mock.calls[0][0]).toContain("SINGLE recurring weakness");

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/shared/skill-proposer.test.ts` at line 44, The assertion at
expect(model.mock.calls[0][0]).toMatch(/recurring weakness/i) uses a regex while
the rest of the file uses exact fragment checks; replace this regex assertion
with an exact fragment assertion (use toContain with the string "recurring
weakness") on the same value (model.mock.calls[0][0]) so it matches the file's
consistency pattern.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/shared/skill-proposer.test.ts`:
- Line 44: The assertion at expect(model.mock.calls[0][0]).toMatch(/recurring
weakness/i) uses a regex while the rest of the file uses exact fragment checks;
replace this regex assertion with an exact fragment assertion (use toContain
with the string "recurring weakness") on the same value (model.mock.calls[0][0])
so it matches the file's consistency pattern.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 56b4207b-7fbc-4cd4-a5b1-0b5cd451d57b

📥 Commits

Reviewing files that changed from the base of the PR and between 1e5f58d and a6055ca.

📒 Files selected for processing (2)

src/skillify/skill-proposer.ts
tests/shared/skill-proposer.test.ts

khustup2 · 2026-06-07T17:32:54Z

Closing — the premise didn't hold. A 5×-per-skill A/B showed the prompt change is within noise (baseline 65% vs 60%), so there's no optimizer improvement to ship. The earlier 3/4 was nondeterminism. Apologies for the churn.

coderabbitai Bot reviewed Jun 7, 2026

View reviewed changes

Comment thread src/skillify/skill-proposer.ts Outdated

test(skillopt): cover proposer steering toward concrete, anchored edits

2840a1c

coderabbitai Bot reviewed Jun 7, 2026

View reviewed changes

Comment thread tests/shared/skill-proposer.test.ts Outdated

coderabbitai Bot reviewed Jun 7, 2026

View reviewed changes

khustup2 changed the title ~~feat(skillopt): proposer targets root-cause + anchors edits (2/4 → 3/4 on real-skill benchmark)~~ refactor(skillopt): clearer proposer prompt + robust edit-array parsing Jun 7, 2026

khustup2 closed this Jun 7, 2026

khustup2 deleted the feat/skillopt-proposer-anchored-edits branch June 7, 2026 17:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(skillopt): clearer proposer prompt + robust edit-array parsing#245

refactor(skillopt): clearer proposer prompt + robust edit-array parsing#245
khustup2 wants to merge 4 commits into
mainfrom
feat/skillopt-proposer-anchored-edits

khustup2 commented Jun 7, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 7, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Jun 7, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

khustup2 commented Jun 7, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

khustup2 commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

khustup2 commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

On the benchmark — the prompt change does NOT improve quality

Tests

Uh oh!

coderabbitai Bot commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

khustup2 commented Jun 7, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

khustup2 commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

khustup2 commented Jun 7, 2026 •

edited

Loading

coderabbitai Bot commented Jun 7, 2026 •

edited

Loading

github-actions Bot commented Jun 7, 2026 •

edited

Loading