feat(skill): distill an OpenKB wiki into a redistributable Anthropic Skill#57
feat(skill): distill an OpenKB wiki into a redistributable Anthropic Skill#57KylinMountain wants to merge 33 commits into
Conversation
The previous c.islower() check accepted Unicode lowercase letters like 'é' and 'ü', contradicting the [a-z0-9-] docstring promise. Names are used as filesystem directories and YAML frontmatter, so the slug must stay ASCII. Add explicit a-z / 0-9 range check + 2 regression tests.
The plan-verbatim implementation omitted the 'owner' field at the top level. Some marketplace consumers (Claude Code's /plugin marketplace add) expect 'owner' for listing/attribution. Derive from git config; fall back to 'openkb-user' if git isn't configured. Mirror as plugin-level 'author' to match the existing repo convention.
The plan called for verifying that wiki/concepts/ or wiki/summaries/ has actual files before allowing skill compilation. Earlier impl loosened this to 'any file in wiki/', which silently accepted freshly-init'd KBs (openkb init pre-creates empty concepts/ + summaries/ dirs). Restore the strict check + populate test fixture + add regression test for the freshly-init'd case.
- /skill new: catch shlex.split ValueError on unclosed quotes so a typo doesn't crash the chat REPL - write_kb_file: reject bare directory paths (e.g. 'output') that would otherwise raise IsADirectoryError on write_text - chat.py: drop stale build_query_agent import (chat now uses build_chat_agent) - test_chat_slash_commands.py: update patch target from build_query_agent -> build_chat_agent so the test exercises the right symbol - Add tests/test_write_kb_file.py covering allow-list, traversal, bare-directory rejection
C1: /skill new in chat had only name validation — no wiki-exists check, no wiki-content check, no overwrite guard. The CLI had all three. Extract the gates into _preflight_skill_new and call from both. Add explicit 'remove existing skill first' message in chat (no -y equivalent there). C2: System prompt advertised tool names (list_wiki_dir, read_wiki_file, write_skill_file) that didn't match what was registered with @function_tool (list_wiki, read_wiki, write_skill). LLM saw the registered names; prompt references would confuse it. Rename the wrappers to match. I1: query_wiki was a sync @function_tool calling asyncio.run() on run_query — works only because openai-agents SDK runs sync tools on worker threads. Convert to async @function_tool so the runner awaits it in the same loop, eliminating the nested-asyncio fragility. Add 2 regression tests for the chat safety gates.
The previous impl used the KB directory name as both the marketplace 'name' and the plugin 'name', and stitched together a metadata description by truncating the first skill's SKILL.md description at 200 chars (often mid-word). Lock the convention to match skills/openkb/.claude-plugin/marketplace.json from the official skill: - marketplace name: 'vectify' (always) - plugin name: 'openkb' (always) - description: fixed string, no SKILL.md content injection, no truncation Different KBs are distinguished by <owner>/<repo> URL, not manifest name. Users get one canonical install command (/plugin install openkb@vectify) regardless of which KB they're consuming. Also fix _git_owner to pass cwd=kb_dir so 'openkb --kb-dir ... skill new' run from anywhere reads the KB's git config, not the process CWD.
Code reviewFound 1 issue:
OpenKB/openkb/agent/skill_compiler.py Lines 124 to 131 in 7467200 Call sites that catch only Lines 1490 to 1495 in 7467200 Lines 540 to 546 in 7467200 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
…ator - Rename openkb/agent/skill_compiler.py → skill_creator.py (and associated symbols + prompt file). The existing openkb/agent/compiler.py owns 'compile' for raw → wiki; this module generates a skill from compiled wiki content, which is closer to 'create' / 'distill'. Disambiguates the verb in one package. - Translate MaxTurnsExceeded into a friendly RuntimeError inside run_skill_create. Both the CLI and chat call sites only catch RuntimeError; the SDK exception leaked a Python traceback before. - Defer the rmtree-on-overwrite until after _setup_llm_key and load_config succeed. Previously an unset API key would wipe the existing skill output with nothing to replace it. - Fix marketplace.py module docstring: don't claim chat-side SKILL.md edits regenerate the manifest (they don't). - Drop unused yes_flag from _preflight_skill_new + rewrite its docstring to match what the function actually does. - Clean up github-code-quality bot findings: unused pytest imports in 2 test files, unused 'manifest' local in test_marketplace.py (replaced with the assertion the test intended), redundant in-function stdlib imports in openkb/cli.py. Add 2 regression tests: - test_run_skill_create_translates_max_turns_to_runtime_error - test_skill_new_keeps_existing_skill_when_key_setup_fails
Reframe the project mental model: the wiki is the substrate, and several primitives (query, chat, skill new) generate output from it. Add 'Drop in a book. Out comes a digital expert.' slogan to the Skill Factory subsection. - Features list split into 'Wiki foundation' (compile + maintain) and 'Generators' (query / chat / Skill Factory) - Quick Start adds step 6 — distill a skill - Architecture diagram extended to show the wiki branching into query/chat + Skill Factory + future generators (ppt/podcast/…) - Usage section regrouped under 🧱 Wiki Foundation and ✨ Generators - Skill Factory subsection promoted with the slogan as its heading and a concrete example folder tree - Chat slash command list updated to include /skill new
Borrows from Anthropic's skill-creator: each time 'openkb skill new -y' would overwrite an existing skill, the old version is copied to <kb>/output/skills/<name>-workspace/iteration-N/ instead of being destroyed. Iteration numbers monotonically increase. Each iteration carries a diff.md showing description + reference-file delta vs the previous version. New commands: openkb skill history <name> list past iterations openkb skill rollback <name> restore latest iteration openkb skill rollback --to N restore specific iteration Tests cover: iteration numbering, restore-from-N, restore-from-latest, diff content (description / added refs / removed refs).
Borrows from Codex skill-creator's quick_validate.py: catches the common failure modes that would make a skill unloadable before distribution — missing/malformed frontmatter, name/dir mismatch, oversized files, broken references/* wikilinks, non-stdlib imports in scripts/* (strict mode). New CLI: openkb skill validate validate all compiled skills openkb skill validate <name> validate one openkb skill validate --strict treat warnings as failures openkb skill new auto-runs validation after compile and surfaces errors + warnings so the user knows whether the freshly-compiled skill is well-formed. Doesn't block marketplace.json regeneration — the files are on disk and the user can fix or rollback.
Borrows from Anthropic skill-creator's evaluation loop, simplified for
v0.3: measure whether a skill's description: field actually fires when
it should and stays quiet when it shouldn't. The description is the
activation signal other agents read; a vague description silently fails
to load when it ought to.
Flow:
1. LLM generates N should-trigger + N should-not prompts from the
description only
2. Grader LLM scores each: 'should this description activate this
skill for this question?'
3. Compare to ground truth, print pass rate + misses
New CLI:
openkb skill eval <name> run eval (10+10 default)
openkb skill eval <name> --save persist prompts to disk
openkb skill eval <name> --eval-set X reuse saved prompts
openkb skill eval <name> --count N override prompt count
Tests mock Runner.run for both generator and grader — no real LLM
calls in CI. Saved eval sets live at .openkb/eval-sets/<name>.json
for reproducibility.
Skill Factory section now lists the 4 quality-gate commands borrowed from Codex (validate) and Anthropic skill-creator (eval + iteration): openkb skill validate structural lint (frontmatter, sizes, refs) openkb skill eval trigger-accuracy test of the description openkb skill history list past iterations openkb skill rollback restore a previous iteration The slogan promised a 'digital expert' — these commands are what makes the output worth that label.
Code review round-2 flagged the eval pipeline reintroducing the MaxTurnsExceeded/JSONDecodeError traceback leak that round-1 caught for skill new. Apply the same shim inside skill_evaluator + 4 other carryover items: - Translate MaxTurnsExceeded and json.JSONDecodeError to RuntimeError inside generate_eval_set and grade_one. CLI catch (RuntimeError) now covers both. - Wrap _setup_llm_key in skill_eval with the same try/except/exit pattern as skill_new / query / chat. - Move openkb/skill_evaluator.py -> openkb/agent/skill_evaluator.py. Modules that construct Agent live under openkb/agent/ per repo convention; top-level openkb/ keeps marketplace + generator (no agents SDK). - Validator: reject '<' / '>' in description (Anthropic parser requirement); warn on unknown frontmatter keys (Anthropic spec allows a fixed set). - Drop redundant in-function 'import asyncio' from skill_eval (already at module top). - Drop unused EvalMiss import from tests. - Validator module docstring updated to enumerate all checks. Also delete community contribution scaffolding (CONTRIBUTING.md + .github/PULL_REQUEST_TEMPLATE/skill_submission.md) - premature for the project's current stage; will revisit when real contributors arrive.
All skill-related code now lives in a single openkb/skill/ subpackage (7 modules + __init__). Drop the redundant 'skill_' prefix on filenames since the package qualifies them already. Moves: openkb/generator.py -> openkb/skill/generator.py openkb/marketplace.py -> openkb/skill/marketplace.py openkb/skill_validator.py -> openkb/skill/validator.py openkb/skill_workspace.py -> openkb/skill/workspace.py openkb/agent/skill_creator.py -> openkb/skill/creator.py openkb/agent/skill_tools.py -> openkb/skill/tools.py openkb/agent/skill_evaluator.py -> openkb/skill/evaluator.py generator.py + marketplace.py go under skill/ for now (v0.x only has skills); they're nominally generic primitives but YAGNI -- when a second artifact type (ppt / podcast / report) actually lands, those two will move back out to openkb/<shared>/. No behavioral changes. All imports + test patch targets updated. Test suite stays at 494 passed.
The Features list at the top duplicated everything already covered by Quick Start (step-by-step feature tour) + Usage (Wiki Foundation + Generators tables). It also duplicated the Skill Factory slogan that now lives canonically in the Skill Factory subsection. Replace 17 lines with a one-sentence pointer to Usage. 405 → 389 lines. Slogan now appears exactly once (in the Skill Factory subsection) instead of 5 places. Other duplicates (PageIndex mentions, cross-CLI install instructions, Karpathy comparison table) left for now — they target different audiences (scanners vs deep readers) and aren't worth touching in this pass.
Code review (design consistency)Reviewed through the lens requested: 设计的一致性. Found 4 issues.
OpenKB/openkb/skill/workspace.py Lines 111 to 120 in 1679c8c
Lines 552 to 560 in 1679c8c Lines 1520 to 1538 in 1679c8c
OpenKB/openkb/skill/workspace.py Lines 29 to 37 in 1679c8c
OpenKB/openkb/skill/validator.py Lines 195 to 210 in 1679c8c OpenKB/openkb/skill/marketplace.py Lines 29 to 40 in 1679c8c OpenKB/openkb/skill/workspace.py Lines 125 to 148 in 1679c8c Lower-confidence consistency notes (not blocking): 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
Code review (Skills domain quality)Reviewed through the lens requested: 能否真的把一本书蒸馏成专家. The short answer is: today it produces a well-formed knowledge briefing, not an expert. Four structural reasons: 1. OpenKB/openkb/skill/evaluator.py Lines 87 to 188 in 1679c8c 2. The agent reads compressed wiki, never the raw source. The compile prompt's working method directs the agent to read OpenKB/openkb/prompts/skill_create.md Lines 77 to 96 in 1679c8c 3. Granularity mismatch with the Anthropic Skills model. Reference Anthropic skills are narrow operational tools — 4. Source-grounded citation is a suggestion, not a gate. The prompt instructs the agent to cite non-trivial claims with OpenKB/openkb/skill/validator.py Lines 157 to 195 in 1679c8c Verdict. Drop a book in today and you reliably get: a structurally valid Markdown skill with a passable description, navigable wikilinks to whichever concept pages the agent picked, and The minimum changes that would close the largest part of the gap:
🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
Two rounds of review fixes on top of the skill factory. Design consistency (from PR-level review): - workspace: restore_iteration now saves the current skill before overwriting, matching skill new's reversibility invariant — a user who edits files in chat then rolls back keeps those edits as a new iteration - generator: validate_skill moved inside Generator.run() so /skill new in chat gets the same quality gate as openkb skill new (was CLI-only) - skill/__init__: single-source path helpers (skills_root / skill_dir / skill_workspace_dir) and frontmatter parser (extract_frontmatter / extract_description). Removes duplicated path construction across 6+ sites and four divergent frontmatter implementations - cli/chat: route every "output/skills/<name>" construction through the helper; add missing "Run openkb init first." hint to skill validate / skill eval; capitalize /skill autocomplete; fix stale comment Skills domain quality (the compile pipeline can actually distill now): - creator/tools: skill agent gets the query agent's deep-retrieval toolset — get_page_content for page-range source reads on PageIndex docs, get_image for figures. query_wiki demoted to narrow follow-ups only; primary traversal is direct file reads - skill_create.md: rewritten end-to-end. Working method now mirrors query's 6-step strategy (survey -> summaries -> sources via full_text pointer -> concepts -> draft -> review-and-revise). Required output adds a "Core decision rules" section (>=5 if-X-then-Y rules) and a "Known gaps" section. Description-writing rules optimize for trigger predicate (situations, keywords, exclusions), not topic accuracy - evaluator: was a tautology (generator and grader both saw only the description -> a LOREM IPSUM body could pass 100%). Now the generator reads body + reference excerpts so prompts reflect actual claims, and a second grader (grade_coverage) checks whether the body has substance to support what the description promises. CLI surfaces both Trigger accuracy and Body coverage - validator: foreign-wikilink gate. SKILL.md and references/*.md cannot contain [[concepts/]] / [[summaries/]] / [[sources/]] — those point at the producer's wiki, which doesn't ship; on the consumer's machine they are dead links plus wasted context tokens. Only [[references/<slug>]] (which ships with the skill) is valid - prompt's source-use rules: paraphrase, short quotes <=40 words, no bulk copying; provenance audit is the producer's responsibility at compile time, not something to ship in the artifact Tests: +9 new (rollback preservation, body-coverage grader, foreign wikilink detection in body and references). 504 passing.
Code reviewReviewed commit ce918cf. Found 1 issue:
OpenKB/openkb/skill/marketplace.py Lines 26 to 30 in ce918cf Lower-confidence observations worth a glance (not blocking):
🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
Address self-review findings on ce918cf. - marketplace: drop unused `extract_description` import left over from the frontmatter-parser consolidation. The manifest builder emits only fixed strings; no per-skill description is interpolated here. - validator: scope the foreign-wikilink scan. Was running on the full SKILL.md text including frontmatter, which produced a body-pointing error message even when the offending wikilink was inside the description. Now scans the description and body separately, with location-specific error wording. - skill/__init__: add `extract_body(text)` — a line-anchored body extractor that mirrors `extract_frontmatter`'s logic. The validator and evaluator both route through it, replacing the brittle `text.split("---", 2)[-1]` shortcut that mis-handled bodies starting with a Markdown horizontal rule. - evaluator: `grade_coverage` no longer fail-closes ambiguous LLM outputs to "unsupported". A third "ambiguous" verdict surfaces grader malfunction as a distinct state on `EvalResult.coverage_ambiguous`, which is excluded from both numerator and denominator of `coverage_rate` so a garbled grader doesn't masquerade as a hollow skill. CLI prints a separate WARN block when ambiguous outputs occur. - evaluator: lift `from agents.exceptions import MaxTurnsExceeded` to the module top; the three intra-function imports it had were not circular-import workarounds. Tests: +1 covering the ambiguous-vs-unsupported segregation; the previous fail-closed test is rewritten to assert the new "ambiguous" state. 505 passing.
The eval-set generator, trigger grader, and coverage grader have no tools — they only produce text. OpenAI's API rejects `parallel_tool_calls` when `tools` is unset: Invalid value for 'parallel_tool_calls': 'parallel_tool_calls' is only allowed when 'tools' are specified. `skill eval` would die at the very first LLM call, before any prompt was graded. Tests masked it because they patch `Runner.run` and never hit the LiteLLM layer. Fix: stop passing `model_settings=ModelSettings(parallel_tool_calls=False)` on those three agents — the flag is both invalid and meaningless when the agent has zero tools to parallelise. `ModelSettings` is no longer referenced anywhere in this module, so the import is dropped too. Pre-existing bug carried over from the original evaluator design; this is the first time anyone has run `skill eval` end-to-end against a real LLM (everything else mocks Runner.run).
uv's recommended practice since 0.5 is to commit the lockfile for both applications and libraries — pins the dev environment so contributors and CI resolve identical versions, and dep drift shows up as a reviewable diff. This repo has never committed it (not gitignored either — just plain untracked from before uv-first workflow). Establishing it now as source of truth. Downstream PyPI consumers continue to resolve via `pyproject.toml` ranges, so this is purely a dev-side change.
Code review (performance)Reviewed b6607ec through 5 opus-model agents. Three issues ≥ 80 confidence:
Lower-confidence observations worth a glance:
🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
Three findings from the performance review on b6607ec: 1. `run_eval` was awaiting ~30 LLM calls sequentially (20 trigger graders + 10 coverage graders). Each prompt is independent — same `desc`/`content` inputs, results accumulated in eval_set order — so `asyncio.gather` is correctness-preserving. Wrap in a `Semaphore(EVAL_CONCURRENCY=8)` to bound simultaneous requests under provider rate limits. Expected wall-clock cut for `openkb skill eval` is roughly 4-15x depending on provider latency, with the floor set by the semaphore. 2. The compile agent had `parallel_tool_calls=False`, forcing every read tool (`list_wiki_dir`, `read_wiki_file`, `get_page_content`) into its own turn. The early phase of compile is naturally a read-fan-out (survey directories, read N summaries, follow `full_text` pointers to source page-ranges). Allowing parallel tool calls lets the model batch independent reads, saving roughly 5-10 outer turns per compile (~20-40s at Opus-class latencies). Writes serialise naturally because each `write_skill_file` depends on accumulated reads. 3. `query_wiki` is a nested `Runner.run(max_turns=50)` inside the outer compile (`max_turns=80`). The docstring's "narrow follow-ups only" was the only enforcement; a pathological run could spawn many nested calls. Added a per-compile counter via closure: after `QUERY_WIKI_MAX_CALLS=3` invocations, the tool returns an error string steering the agent back to direct file reads. Bounds tail latency without breaking the common case (legitimate cross-document sub-questions still get answered). Prompt-caching for `grade_coverage` (the fourth finding in the review) was deferred: the openai-agents SDK takes `instructions` as a plain string with no hook for emitting Anthropic `cache_control` markers. OpenAI's automatic prefix caching already applies because the system prompt is byte-stable across the 10 coverage calls; closing the Anthropic gap is SDK-level work, not application-level. Tests pass (463; unrelated trafilatura env failure in test_url_ingest not touched).
Code reviewReviewed commit 2b94014 (perf fixes). Found 1 issue:
Lower-confidence observations (non-blocking):
🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
The QUERY_WIKI_MAX_CALLS=3 limit added in 2b94014 was over-engineered defence: - It was based on a theoretical worst-case bound (80 outer turns × 50 inner turns) from a perf review, not on observed misuse. - There's no telemetry showing real compiles ever approach pathological query_wiki usage. - The prompt's "narrow follow-ups only" guidance plus the docstring ("nested LLM call, prefer direct reads") are already the right primary control. Adding a structural hard-cap on top of that says "we trust the agent to make 80 reasoning steps but not to count to 3", which is inconsistent. - The cap had a real cost: a legitimately complex cross-document intent that genuinely needed a 4th query_wiki got cut off and forced back to direct reads, silently degrading skill quality. Also fixes the docstring bug a reviewer flagged on 2b94014: the docstring referenced the literal Python name `QUERY_WIKI_MAX_CALLS` rather than the value, so the model saw a meaningless symbol. Removing the cap removes the docstring claim too. If runaway compiles ever show up in telemetry, prefer a logged warning (observe-and-tune) over a silent block.
Both attributions were wrong: - Structural validation was credited to "Codex skill-creator" at github.com/openai/skills — but that repo has no skill-creator, and validator.py is explicitly modeled on Anthropic's quick_validate.py (see the module docstring). - Trigger-accuracy evals were credited to Anthropic skill-creator — but evaluator.py has no such inspiration in its docstrings; the design (description-only trigger grader + body-aware coverage grader) is OpenKB's own. Iteration pattern in workspace.py is genuinely borrowed from Anthropic skill-creator, and that credit lives in the module docstring where it belongs. Reframed the README line to describe what the gates do, without parading non-existent or backwards lineage.
Same problem as the README "Quality gates" line just removed: the attribution is louder than the actual lineage. The workspace module implements directory backup with monotonically numbered iteration slots — a generic backup pattern, not anything specific to Anthropic's skill-creator. The inspiration was looking at their SKILL.md while designing the rollback story, not copying any algorithm. Calling it "Borrowed from" overstates that. Keep the rest of the docstring (what the module does, when it runs) and drop the false-pedigree footer.
Summary
openkb skill new <name> "<prompt>"compiles the wiki into anAnthropic-Skills directory at
<kb>/output/skills/<name>/./skill newslash command insideopenkb chat— same primitive,conversational front end, lets users iterate via natural-language
follow-ups (chat agent now has Write access to
<kb>/output/**+<kb>/wiki/explorations/**).<kb>/.claude-plugin/marketplace.jsonlists allcompiled skills; pushing the KB to GitHub makes
npx skills@latest add <owner>/<repo>work end-to-end.Generatorprimitive (openkb/generator.py) is architected so futureppt/podcast/reporttargets slot in without restructuring.skill_submission PR template) — no new CLI surface needed for
submission.
Architecture
Reuses the existing
openai-agentsSDK + LiteLLM stack fromopenkb.agent.query.The skill-compile agent is constructed via
build_skill_compile_agentwitha system prompt from
openkb/prompts/skill_compile.mdinterpolated withthe user's intent + skill name + wiki schema. Tools are scoped: read wiki,
query wiki, write only within
<kb>/output/skills/<name>/. Marketplace.jsonregenerates synchronously after the agent finishes (no LLM call).
Test plan
uv run pytest— full suite green (430+ tests)overwrite logic, marketplace.json regen all covered by unit tests
/skill newcovered by 3 dedicated tests + write_kb_fileallow-list covered by 9 dedicated tests
openkb skill new <name> "<intent>" -yagainst a real KB with compiled wiki content, verify the resulting
output/skills/<name>/SKILL.mdactivates appropriately in ClaudeCode after
cp -r output/skills/<name> ~/.claude/skills/. Speccriteria from
docs/superpowers/specs/2026-05-18-skill-factory-design.md§11 success criteria.
openkb chat, run/skill new, then ask theagent to tweak the description or a references/ file; verify the
file actually changes on disk.
Spec:
docs/superpowers/specs/2026-05-18-skill-factory-design.md(gitignored locally; design notes in PR review)