refactor(skillopt): clearer proposer prompt + robust edit-array parsing#245
refactor(skillopt): clearer proposer prompt + robust edit-array parsing#245khustup2 wants to merge 4 commits into
Conversation
…nd-at-end The proposer appended generic guidance to the end of long skill docs, which diluted rather than fixed (and lost the A/B against the current body). It now diagnoses the single recurring weakness first, then requires each edit to be concrete/operational and anchored into the relevant existing section (or to replace the weak existing instruction). On a 4-skill benchmark of real failing org skills, an independent judge prefers the proposed edit over the current body 3/4, up from 2/4 before.
📝 WalkthroughWalkthroughThis PR tightens the skill-proposer prompts: the SYSTEM prompt enforces a two-step diagnose-then-edit workflow with concrete, anchored, traceable edits (JSON array only), and buildUserPrompt now requires a one-line diagnosis followed by anchored edits. Parsing of the model output now extracts the last balanced JSON array and parseEdits returns [] on failure; tests validate bracket handling and priorEdits inclusion. ChangesSkill Edit Proposal Constraints
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Coverage ReportScope: files changed in this PR. Enforced threshold: 90% per metric (per file via
File Coverage — 1 file changed
Generated for commit 2c9e4cd. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/skillify/skill-proposer.ts`:
- Around line 43-46: The SYSTEM prompt fragment using `${SU_END}` demands "Reply
with ONLY a JSON array..." but another prompt later requires a "one-line
weakness before JSON", creating a contradiction; reconcile them by choosing one
contract and making both prompt templates consistent: either remove the one-line
weakness requirement from the user prompt or change the SYSTEM/string built
around `SU_END` to allow an optional single-line weakness prior to the JSON.
Update the prompt-building code paths that reference `SU_END` and the other
prompt fragment (the one that currently asks for a one-line weakness) so both
emit the same instruction text, and run any prompt-formatting tests or callers
that parse the model output to ensure they accept the chosen format.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 8fe5edfc-dd9b-4ef8-8d59-5f5f3c3f2e6e
📒 Files selected for processing (1)
src/skillify/skill-proposer.ts
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/shared/skill-proposer.test.ts`:
- Around line 44-47: The assertions in tests/shared/skill-proposer.test.ts use
broad regexes on the `system` string and `model.mock.calls[0][1]`; replace them
with exact prompt-fragment checks to avoid false positives. Update the three
`expect(system).toMatch(...)` lines to assert the precise expected substrings
(e.g., the exact instruction phrases you expect to appear in `system`) and
change `expect(model.mock.calls[0][1]).toMatch(...)` to assert the exact
fragment or full message from `model.mock.calls[0][1]` (use equality or
`toContain` with the specific text). Ensure the new assertions reference the
same symbols (`system` and `model.mock.calls[0][1]`) so the test still validates
the produced prompt fragments precisely.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: b85b9fe4-0604-476a-b027-8738b5d9a683
📒 Files selected for processing (1)
tests/shared/skill-proposer.test.ts
…en tests/coverage - Resolve SYSTEM/user-prompt contradiction (CodeRabbit): output stays JSON-only; diagnosis is an internal step, not emitted (also avoids a bracketed weakness line breaking parseEdits). - Assert exact prompt fragments instead of broad regexes. - Cover the priorEdits branch.
|
Addressed CodeRabbit in
|
…arse The JSON-only reconcile removed the 'name the weakness first' step, which was carrying the benchmark lift. Restore it as an explicit TWO-part contract (weakness line, then JSON array) — SYSTEM and user prompt now agree — and harden parseEdits to extract the array via a balanced back-scan (robust to a bracketed weakness line), with a first-[..last-] fallback. Tests cover both paths.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/shared/skill-proposer.test.ts (1)
44-44: 💤 Low valueConsider using exact fragment assertion for consistency.
Line 44 uses a case-insensitive regex while the rest of the file uses exact
toContainchecks. For consistency with the coding guidelines:- expect(model.mock.calls[0][0]).toMatch(/recurring weakness/i); + expect(model.mock.calls[0][0]).toContain("SINGLE recurring weakness");🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/shared/skill-proposer.test.ts` at line 44, The assertion at expect(model.mock.calls[0][0]).toMatch(/recurring weakness/i) uses a regex while the rest of the file uses exact fragment checks; replace this regex assertion with an exact fragment assertion (use toContain with the string "recurring weakness") on the same value (model.mock.calls[0][0]) so it matches the file's consistency pattern.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/shared/skill-proposer.test.ts`:
- Line 44: The assertion at expect(model.mock.calls[0][0]).toMatch(/recurring
weakness/i) uses a regex while the rest of the file uses exact fragment checks;
replace this regex assertion with an exact fragment assertion (use toContain
with the string "recurring weakness") on the same value (model.mock.calls[0][0])
so it matches the file's consistency pattern.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 56b4207b-7fbc-4cd4-a5b1-0b5cd451d57b
📒 Files selected for processing (2)
src/skillify/skill-proposer.tstests/shared/skill-proposer.test.ts
|
Closing — the premise didn't hold. A 5×-per-skill A/B showed the prompt change is within noise (baseline 65% vs 60%), so there's no optimizer improvement to ship. The earlier 3/4 was nondeterminism. Apologies for the churn. |
What
Hardens the SkillOpt proposer's output parsing and restructures its prompt:
[..last-]fallback. (Genuine robustness fix.)On the benchmark — the prompt change does NOT improve quality
Tested against the original proposer on 4 org skills with real failures, judging each edit vs the current body 5× per skill (rate, not a single noisy verdict):
Within noise (Δ5%), nominally worse. An earlier single run showed 3/4 but did not reproduce — that was judge nondeterminism. So the prompt rewrite has no measured benefit; the keepable part of this PR is the parser hardening + tests. n=4 is too small to detect an optimizer gain regardless.
Tests
26 green, typecheck clean.