Skip to content

feat(skill): distill an OpenKB wiki into a redistributable Anthropic Skill#57

Open
KylinMountain wants to merge 33 commits into
mainfrom
feat/skill-factory
Open

feat(skill): distill an OpenKB wiki into a redistributable Anthropic Skill#57
KylinMountain wants to merge 33 commits into
mainfrom
feat/skill-factory

Conversation

@KylinMountain
Copy link
Copy Markdown
Collaborator

Summary

  • New openkb skill new <name> "<prompt>" compiles the wiki into an
    Anthropic-Skills directory at <kb>/output/skills/<name>/.
  • New /skill new slash command inside openkb chat — same primitive,
    conversational front end, lets users iterate via natural-language
    follow-ups (chat agent now has Write access to <kb>/output/** +
    <kb>/wiki/explorations/**).
  • Auto-regenerated <kb>/.claude-plugin/marketplace.json lists all
    compiled skills; pushing the KB to GitHub makes npx skills@latest add <owner>/<repo> work end-to-end.
  • Generator primitive (openkb/generator.py) is architected so future
    ppt / podcast / report targets slot in without restructuring.
  • Community contribution path documented (CONTRIBUTING.md +
    skill_submission PR template) — no new CLI surface needed for
    submission.

Architecture

Reuses the existing openai-agents SDK + LiteLLM stack from openkb.agent.query.
The skill-compile agent is constructed via build_skill_compile_agent with
a system prompt from openkb/prompts/skill_compile.md interpolated with
the user's intent + skill name + wiki schema. Tools are scoped: read wiki,
query wiki, write only within <kb>/output/skills/<name>/. Marketplace.json
regenerates synchronously after the agent finishes (no LLM call).

Test plan

  • uv run pytest — full suite green (430+ tests)
  • CLI command name validation, KB detection, wiki-content gate,
    overwrite logic, marketplace.json regen all covered by unit tests
  • Chat /skill new covered by 3 dedicated tests + write_kb_file
    allow-list covered by 9 dedicated tests
  • Manual acceptance: run openkb skill new <name> "<intent>" -y
    against a real KB with compiled wiki content, verify the resulting
    output/skills/<name>/SKILL.md activates appropriately in Claude
    Code after cp -r output/skills/<name> ~/.claude/skills/. Spec
    criteria from docs/superpowers/specs/2026-05-18-skill-factory-design.md
    §11 success criteria.
  • Iteration: in openkb chat, run /skill new, then ask the
    agent to tweak the description or a references/ file; verify the
    file actually changes on disk.

Spec: docs/superpowers/specs/2026-05-18-skill-factory-design.md (gitignored locally; design notes in PR review)

The previous c.islower() check accepted Unicode lowercase letters like
'é' and 'ü', contradicting the [a-z0-9-] docstring promise. Names are
used as filesystem directories and YAML frontmatter, so the slug must
stay ASCII. Add explicit a-z / 0-9 range check + 2 regression tests.
The plan-verbatim implementation omitted the 'owner' field at the top
level. Some marketplace consumers (Claude Code's /plugin marketplace add)
expect 'owner' for listing/attribution. Derive from git config; fall back
to 'openkb-user' if git isn't configured. Mirror as plugin-level 'author'
to match the existing repo convention.
The plan called for verifying that wiki/concepts/ or wiki/summaries/
has actual files before allowing skill compilation. Earlier impl loosened
this to 'any file in wiki/', which silently accepted freshly-init'd KBs
(openkb init pre-creates empty concepts/ + summaries/ dirs). Restore
the strict check + populate test fixture + add regression test for
the freshly-init'd case.
- /skill new: catch shlex.split ValueError on unclosed quotes so a typo
  doesn't crash the chat REPL
- write_kb_file: reject bare directory paths (e.g. 'output') that would
  otherwise raise IsADirectoryError on write_text
- chat.py: drop stale build_query_agent import (chat now uses build_chat_agent)
- test_chat_slash_commands.py: update patch target from build_query_agent
  -> build_chat_agent so the test exercises the right symbol
- Add tests/test_write_kb_file.py covering allow-list, traversal,
  bare-directory rejection
Comment thread tests/test_marketplace.py Fixed
Comment thread openkb/cli.py Fixed
Comment thread tests/test_skill_cli.py Fixed
Comment thread tests/test_skill_tools.py Fixed
C1: /skill new in chat had only name validation — no wiki-exists check,
no wiki-content check, no overwrite guard. The CLI had all three. Extract
the gates into _preflight_skill_new and call from both. Add explicit
'remove existing skill first' message in chat (no -y equivalent there).

C2: System prompt advertised tool names (list_wiki_dir, read_wiki_file,
write_skill_file) that didn't match what was registered with
@function_tool (list_wiki, read_wiki, write_skill). LLM saw the registered
names; prompt references would confuse it. Rename the wrappers to match.

I1: query_wiki was a sync @function_tool calling asyncio.run() on
run_query — works only because openai-agents SDK runs sync tools on
worker threads. Convert to async @function_tool so the runner awaits it
in the same loop, eliminating the nested-asyncio fragility.

Add 2 regression tests for the chat safety gates.
Comment thread openkb/cli.py Fixed
The previous impl used the KB directory name as both the marketplace
'name' and the plugin 'name', and stitched together a metadata
description by truncating the first skill's SKILL.md description at
200 chars (often mid-word). Lock the convention to match
skills/openkb/.claude-plugin/marketplace.json from the official skill:

- marketplace name: 'vectify' (always)
- plugin name: 'openkb' (always)
- description: fixed string, no SKILL.md content injection, no truncation

Different KBs are distinguished by <owner>/<repo> URL, not manifest
name. Users get one canonical install command (/plugin install
openkb@vectify) regardless of which KB they're consuming.

Also fix _git_owner to pass cwd=kb_dir so 'openkb --kb-dir ... skill
new' run from anywhere reads the KB's git config, not the process CWD.
Comment thread tests/test_marketplace.py Fixed
@KylinMountain
Copy link
Copy Markdown
Collaborator Author

Code review

Found 1 issue:

  1. MaxTurnsExceeded is not caught — agent hitting MAX_TURNS = 80 produces an unhandled traceback instead of a friendly error. MaxTurnsExceeded inherits from AgentsException, not RuntimeError, but both call sites (CLI and chat) only catch RuntimeError. Either widen the catch to Exception (or AgentsException + RuntimeError), or have run_skill_compile translate the SDK exception into a RuntimeError with a clear message.

# Single user message kicks off the compile. The system prompt already
# contains the intent — this just nudges the agent to start.
seed = (
f"Compile the skill '{skill_name}'. Follow the system prompt's "
f"working method. Read the wiki, then write the skill files."
)
await Runner.run(agent, seed, max_turns=MAX_TURNS)

Call sites that catch only RuntimeError:

OpenKB/openkb/cli.py

Lines 1490 to 1495 in 7467200

kb_dir=kb_dir,
model=model,
)
asyncio.run(gen.run())
except RuntimeError as exc:
click.echo(f"[ERROR] {exc}", err=True)

OpenKB/openkb/agent/chat.py

Lines 540 to 546 in 7467200

# Load model from KB config
from openkb.config import load_config, DEFAULT_CONFIG
config = load_config(kb_dir / ".openkb" / "config.yaml")
model = config.get("model", DEFAULT_CONFIG["model"])
from openkb.generator import Generator
_fmt(style, ("class:slash.help", f"Compiling skill '{name}'...\n"))

Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

…ator

- Rename openkb/agent/skill_compiler.py → skill_creator.py (and
  associated symbols + prompt file). The existing
  openkb/agent/compiler.py owns 'compile' for raw → wiki; this module
  generates a skill from compiled wiki content, which is closer to
  'create' / 'distill'. Disambiguates the verb in one package.

- Translate MaxTurnsExceeded into a friendly RuntimeError inside
  run_skill_create. Both the CLI and chat call sites only catch
  RuntimeError; the SDK exception leaked a Python traceback before.

- Defer the rmtree-on-overwrite until after _setup_llm_key and
  load_config succeed. Previously an unset API key would wipe the
  existing skill output with nothing to replace it.

- Fix marketplace.py module docstring: don't claim chat-side SKILL.md
  edits regenerate the manifest (they don't).

- Drop unused yes_flag from _preflight_skill_new + rewrite its
  docstring to match what the function actually does.

- Clean up github-code-quality bot findings: unused pytest imports
  in 2 test files, unused 'manifest' local in test_marketplace.py
  (replaced with the assertion the test intended), redundant
  in-function stdlib imports in openkb/cli.py.

Add 2 regression tests:
- test_run_skill_create_translates_max_turns_to_runtime_error
- test_skill_new_keeps_existing_skill_when_key_setup_fails
Reframe the project mental model: the wiki is the substrate, and
several primitives (query, chat, skill new) generate output from it.
Add 'Drop in a book. Out comes a digital expert.' slogan to the
Skill Factory subsection.

- Features list split into 'Wiki foundation' (compile + maintain) and
  'Generators' (query / chat / Skill Factory)
- Quick Start adds step 6 — distill a skill
- Architecture diagram extended to show the wiki branching into
  query/chat + Skill Factory + future generators (ppt/podcast/…)
- Usage section regrouped under 🧱 Wiki Foundation and ✨ Generators
- Skill Factory subsection promoted with the slogan as its heading
  and a concrete example folder tree
- Chat slash command list updated to include /skill new
Borrows from Anthropic's skill-creator: each time 'openkb skill new -y'
would overwrite an existing skill, the old version is copied to
<kb>/output/skills/<name>-workspace/iteration-N/ instead of being
destroyed. Iteration numbers monotonically increase. Each iteration
carries a diff.md showing description + reference-file delta vs the
previous version.

New commands:
  openkb skill history <name>         list past iterations
  openkb skill rollback <name>        restore latest iteration
  openkb skill rollback --to N        restore specific iteration

Tests cover: iteration numbering, restore-from-N, restore-from-latest,
diff content (description / added refs / removed refs).
Borrows from Codex skill-creator's quick_validate.py: catches the
common failure modes that would make a skill unloadable before
distribution — missing/malformed frontmatter, name/dir mismatch,
oversized files, broken references/* wikilinks, non-stdlib imports
in scripts/* (strict mode).

New CLI:
  openkb skill validate              validate all compiled skills
  openkb skill validate <name>       validate one
  openkb skill validate --strict     treat warnings as failures

openkb skill new auto-runs validation after compile and surfaces
errors + warnings so the user knows whether the freshly-compiled
skill is well-formed. Doesn't block marketplace.json regeneration —
the files are on disk and the user can fix or rollback.
Borrows from Anthropic skill-creator's evaluation loop, simplified for
v0.3: measure whether a skill's description: field actually fires when
it should and stays quiet when it shouldn't. The description is the
activation signal other agents read; a vague description silently fails
to load when it ought to.

Flow:
  1. LLM generates N should-trigger + N should-not prompts from the
     description only
  2. Grader LLM scores each: 'should this description activate this
     skill for this question?'
  3. Compare to ground truth, print pass rate + misses

New CLI:
  openkb skill eval <name>                run eval (10+10 default)
  openkb skill eval <name> --save         persist prompts to disk
  openkb skill eval <name> --eval-set X   reuse saved prompts
  openkb skill eval <name> --count N      override prompt count

Tests mock Runner.run for both generator and grader — no real LLM
calls in CI. Saved eval sets live at .openkb/eval-sets/<name>.json
for reproducibility.
Skill Factory section now lists the 4 quality-gate commands borrowed
from Codex (validate) and Anthropic skill-creator (eval + iteration):

  openkb skill validate    structural lint (frontmatter, sizes, refs)
  openkb skill eval        trigger-accuracy test of the description
  openkb skill history     list past iterations
  openkb skill rollback    restore a previous iteration

The slogan promised a 'digital expert' — these commands are what makes
the output worth that label.
Comment thread openkb/cli.py Fixed
Comment thread tests/test_skill_evaluator.py Fixed
Code review round-2 flagged the eval pipeline reintroducing the
MaxTurnsExceeded/JSONDecodeError traceback leak that round-1 caught
for skill new. Apply the same shim inside skill_evaluator + 4 other
carryover items:

- Translate MaxTurnsExceeded and json.JSONDecodeError to RuntimeError
  inside generate_eval_set and grade_one. CLI catch (RuntimeError) now
  covers both.
- Wrap _setup_llm_key in skill_eval with the same try/except/exit
  pattern as skill_new / query / chat.
- Move openkb/skill_evaluator.py -> openkb/agent/skill_evaluator.py.
  Modules that construct Agent live under openkb/agent/ per repo
  convention; top-level openkb/ keeps marketplace + generator (no
  agents SDK).
- Validator: reject '<' / '>' in description (Anthropic parser
  requirement); warn on unknown frontmatter keys (Anthropic spec
  allows a fixed set).
- Drop redundant in-function 'import asyncio' from skill_eval (already
  at module top).
- Drop unused EvalMiss import from tests.
- Validator module docstring updated to enumerate all checks.

Also delete community contribution scaffolding (CONTRIBUTING.md +
.github/PULL_REQUEST_TEMPLATE/skill_submission.md) - premature for the
project's current stage; will revisit when real contributors arrive.
All skill-related code now lives in a single openkb/skill/ subpackage
(7 modules + __init__). Drop the redundant 'skill_' prefix on filenames
since the package qualifies them already.

Moves:
  openkb/generator.py            -> openkb/skill/generator.py
  openkb/marketplace.py          -> openkb/skill/marketplace.py
  openkb/skill_validator.py      -> openkb/skill/validator.py
  openkb/skill_workspace.py      -> openkb/skill/workspace.py
  openkb/agent/skill_creator.py  -> openkb/skill/creator.py
  openkb/agent/skill_tools.py    -> openkb/skill/tools.py
  openkb/agent/skill_evaluator.py -> openkb/skill/evaluator.py

generator.py + marketplace.py go under skill/ for now (v0.x only has
skills); they're nominally generic primitives but YAGNI -- when a
second artifact type (ppt / podcast / report) actually lands, those
two will move back out to openkb/<shared>/.

No behavioral changes. All imports + test patch targets updated.
Test suite stays at 494 passed.
The Features list at the top duplicated everything already covered by
Quick Start (step-by-step feature tour) + Usage (Wiki Foundation +
Generators tables). It also duplicated the Skill Factory slogan that
now lives canonically in the Skill Factory subsection.

Replace 17 lines with a one-sentence pointer to Usage. 405 → 389 lines.
Slogan now appears exactly once (in the Skill Factory subsection)
instead of 5 places.

Other duplicates (PageIndex mentions, cross-CLI install instructions,
Karpathy comparison table) left for now — they target different
audiences (scanners vs deep readers) and aren't worth touching in this
pass.
@VectifyAI VectifyAI deleted a comment from quanqigu May 20, 2026
@KylinMountain
Copy link
Copy Markdown
Collaborator Author

Code review (design consistency)

Reviewed through the lens requested: 设计的一致性. Found 4 issues.

  1. skill rollback breaks the reversible-mutation invariant established by skill new. skill new saves the current skill to iteration-N/ before overwriting (workspace's whole reason for existing), but restore_iteration calls shutil.rmtree(dest) directly without first preserving the current state. A user who edits skill files in chat (write_kb_file allows this) and then runs openkb skill rollback loses those edits with no recovery path. Forward path is reversible; reverse path is not.

f"Iteration {n} not found for skill {skill_name!r}."
)
src = match
dest = _skill_dir(kb_dir, skill_name)
if dest.exists():
shutil.rmtree(dest)
shutil.copytree(src, dest)
return dest

  1. Quality gate diverges between CLI and chat entry points to the same generator. openkb skill new auto-runs validate_skill after gen.run() and surfaces errors/warnings; /skill new in chat takes the same code path through Generator.run() but skips validation entirely and jumps to the success message. Same operation, different post-compile policy depending on where it was invoked.

OpenKB/openkb/agent/chat.py

Lines 552 to 560 in 1679c8c

kb_dir=kb_dir,
model=model,
)
await gen.run()
except RuntimeError as exc:
_fmt(style, ("class:error", f"[ERROR] {exc}\n"))
return
_fmt(style, ("class:slash.ok", f"Saved: output/skills/{name}/\n"))

OpenKB/openkb/cli.py

Lines 1520 to 1538 in 1679c8c

# Auto-validate the freshly compiled skill. Surface issues but don't
# block — files are on disk and the user can fix or rollback.
from openkb.skill.validator import validate_skill
skill_dir = kb_dir / "output" / "skills" / name
result = validate_skill(skill_dir)
if result.errors or result.warnings:
click.echo("\n[WARN] Validation found issues:")
for err in result.errors:
click.echo(f" ERROR: {err}")
for warn in result.warnings:
click.echo(f" WARN: {warn}")
click.echo(
f"\nRun `openkb skill validate {name}` to re-check, or "
f"`openkb skill rollback {name}` to revert."
)
click.echo(f"\nSaved: output/skills/{name}/")
if saved_iteration is not None:
rel = saved_iteration.relative_to(kb_dir)

  1. output/skills/<name> is constructed independently in 6+ sites, with a 7th formula that's never consumed. creator.py, workspace.py (_skill_dir), marketplace.py (_list_skill_dirs), cli.py (multiple subcommands), and chat.py:533 each compute the path inline. Generator.output_dir introduces a different formula — kb_dir / 'output' / f'{target_type}s' / name — but both call sites discard gen.run()'s return value and rebuild the path themselves. The intended single source of truth is bypassed by every caller, so the new abstraction adds an unused fourth path-formula rather than collapsing the existing duplication.

https://github.com/VectifyAI/OpenKB/blob/1679c8cedac512583003e0d18b0ef19b07cd84db/openkb/skill/generator.py#L52-L67

def _skill_dir(kb_dir: Path, skill_name: str) -> Path:
return kb_dir / "output" / "skills" / skill_name
def _workspace_dir(kb_dir: Path, skill_name: str) -> Path:
return kb_dir / "output" / "skills" / f"{skill_name}-workspace"

  1. SKILL.md frontmatter is parsed by four different implementations across the new package. validator.py uses splitlines()+.index('---'); evaluator.py re-implements the same; marketplace.py uses a regex (_FRONTMATTER_RE/_DESCRIPTION_RE); workspace.py uses a regex plus a 2000-char prefix fallback. Four siblings parsing the same artifact, four edge-case behaviours. A description-parsing change in one will silently diverge from the other three.

def _extract_frontmatter(text: str) -> str | None:
"""Return the YAML body between the first two `---` lines, or None."""
lines = text.splitlines()
if not lines or lines[0].strip() != "---":
return None
try:
end = lines.index("---", 1)
except ValueError:
return None
return "\n".join(lines[1:end])
def _non_stdlib_imports(script: Path) -> set[str]:
"""Return imported module names that aren't in the Python stdlib."""

_FRONTMATTER_RE = re.compile(r"^---\s*\n(.*?)\n---", re.DOTALL)
_DESCRIPTION_RE = re.compile(r"^description:\s*(.+?)\s*$", re.MULTILINE)
def _git_owner(kb_dir: Path) -> dict[str, str]:
"""Read user.name and user.email from git config (run in kb_dir context).
Falls back to placeholders if git isn't configured. ``cwd=kb_dir`` so
that ``git config`` resolves the KB's local-or-walked-up settings,
not the process's working directory at the time of CLI invocation.

_DESC_RE = re.compile(
r"^description:\s*(.*?)\s*$",
re.MULTILINE,
)
def _extract_description(skill_md: Path) -> str:
"""Return the ``description:`` line from a SKILL.md frontmatter,
or empty string if absent."""
if not skill_md.is_file():
return ""
text = skill_md.read_text(encoding="utf-8", errors="replace")
# Only scan the frontmatter block to avoid matching body text.
if text.startswith("---"):
end = text.find("\n---", 3)
head = text[: end if end != -1 else len(text)]
else:
head = text[:2000]
m = _DESC_RE.search(head)
return m.group(1) if m else ""

Lower-confidence consistency notes (not blocking): skill validate/skill eval drop the "Run `openkb init` first." hint that every other command emits; /skill slash-command description is the only lowercase entry in _SLASH_COMMANDS; openkb/skill/tools.py duplicates openkb/agent/tools.py with renamed-for-no-reason functions (list_wiki_dir vs list_wiki_files); stale comment at cli.py:1468 still references openkb/skill_workspace.py.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@KylinMountain
Copy link
Copy Markdown
Collaborator Author

Code review (Skills domain quality)

Reviewed through the lens requested: 能否真的把一本书蒸馏成专家. The short answer is: today it produces a well-formed knowledge briefing, not an expert. Four structural reasons:

1. skill eval measures description self-consistency, not skill quality. Both generate_eval_set and grade_one receive only the skill's description: string — nothing reads the SKILL.md body, the references, or the wiki sources. A skill with a coherent description and a LOREM IPSUM body would score 100%. This is a closed loop, not a quality gate. Because the eval is marketed as how you verify the skill is good, this gap is load-bearing.

async def generate_eval_set(
skill_dir: Path,
*,
model: str,
count: int = EVAL_DEFAULT_COUNT,
) -> list[EvalPrompt]:
"""Use an LLM to generate ``count`` should-trigger + ``count`` should-not
eval prompts based on the skill's description.
"""
desc = _read_description(skill_dir)
instructions = (
"You are designing an evaluation set for a knowledge-base skill. "
f"The skill's activation description is:\n\n"
f" {desc}\n\n"
f"Produce exactly {count} 'should-trigger' user questions (questions where "
f"an agent SHOULD load this skill to answer well) and exactly {count} "
f"'should-not' user questions (plausible-sounding questions about other "
f"topics where this skill is NOT the right tool).\n\n"
f"Output ONLY a JSON object with this exact shape:\n"
f' {{"should_trigger": [...{count} strings...], '
f'"should_not": [...{count} strings...]}}\n\n'
f"No prose. No markdown. Just the JSON object."
)
agent = Agent(
name="eval-set-generator",
instructions=instructions,
model=f"litellm/{model}",
model_settings=ModelSettings(parallel_tool_calls=False),
)
from agents.exceptions import MaxTurnsExceeded
try:
result = await Runner.run(agent, "Generate the eval set now.", max_turns=3)
except MaxTurnsExceeded as exc:
raise RuntimeError(
"Eval set generation hit the max-turn cap. The model may be "
"looping; try a different model or a smaller --count."
) from exc
raw = (result.final_output or "").strip()
# Strip optional code fence
if raw.startswith("```"):
raw = raw.split("\n", 1)[1].rsplit("```", 1)[0].strip()
if raw.startswith("json"):
raw = raw[4:].lstrip()
try:
data = json.loads(raw)
except json.JSONDecodeError as exc:
raise RuntimeError(
f"Eval set generator returned non-JSON output: {exc.msg}. "
f"Try a more capable model — small models often ignore "
f"'output only JSON' instructions. First 200 chars: {raw[:200]!r}"
) from exc
prompts: list[EvalPrompt] = []
for q in data.get("should_trigger", []):
prompts.append(EvalPrompt(question=q, expected="trigger"))
for q in data.get("should_not", []):
prompts.append(EvalPrompt(question=q, expected="no-trigger"))
return prompts
async def grade_one(
description: str,
question: str,
*,
model: str,
) -> Literal["trigger", "no-trigger"]:
"""Ask the grader LLM whether the description suggests this skill
should be loaded for the given question."""
instructions = (
"You are deciding whether an agent should load a specific skill to "
"answer a user question. You will be given the skill's activation "
"description and a single user question. Answer with one word: "
"TRIGGER (load the skill) or NO-TRIGGER (don't load).\n\n"
f"Skill description:\n {description}\n\n"
"Reply with exactly one of: TRIGGER, NO-TRIGGER."
)
agent = Agent(
name="trigger-grader",
instructions=instructions,
model=f"litellm/{model}",
model_settings=ModelSettings(parallel_tool_calls=False),
)
from agents.exceptions import MaxTurnsExceeded
try:
result = await Runner.run(agent, f"Question: {question}", max_turns=2)
except MaxTurnsExceeded as exc:
raise RuntimeError(
f"Trigger grader hit the max-turn cap on question: {question!r}. "
f"Try a more capable model."
) from exc
raw = (result.final_output or "").strip().upper()
if "NO-TRIGGER" in raw or "NO TRIGGER" in raw:
return "no-trigger"
if "TRIGGER" in raw:
return "trigger"
# Default: assume no-trigger on ambiguous output
return "no-trigger"

2. The agent reads compressed wiki, never the raw source. The compile prompt's working method directs the agent to read wiki/concepts/*.md and wiki/summaries/*.md — both of which are themselves LLM-synthesized compressions of the original document. There's no instruction to open wiki/sources/, and the anti-hallucination block even discourages it ("Do not copy large verbatim passages from wiki/sources/"). For a book input, the agent is distilling a summary of a summary; the specific examples, named techniques, edge cases, and counter-intuitive findings — the things that make a book worth distilling — get lost two compression hops above where the agent reads.

1. Read `wiki/index.md` to see what's in the KB.
2. Form a brief mental plan: which concepts and summaries best serve the
user's intent? Which references will you need?
3. Read the relevant `wiki/concepts/*.md` and `wiki/summaries/*.md` files.
4. Write `SKILL.md` first. Then references, then scripts (if any).
5. Self-check: does every `[[references/...]]` link in `SKILL.md` resolve
to a file you actually wrote? Is the description specific? Is the
`name:` frontmatter field exactly `{skill_name}`?
6. Call `done(summary)` with a one-paragraph summary of what you wrote.
## Anti-hallucination rules
* Cite sources for non-trivial claims using `[[concepts/<slug>]]` or
`[[summaries/<doc>]]` wikilinks back to the wiki.
* If the wiki doesn't cover something the user's intent implies, write
what you can and note the gap in `SKILL.md`'s body — do not fabricate.
* Do not copy large verbatim passages from `wiki/sources/`. Summarise and
cite. The skill must be redistributable; bulk-copying source content
could carry copyright risk that the user takes on at submission time.

3. Granularity mismatch with the Anthropic Skills model. Reference Anthropic skills are narrow operational tools — pdf, skill-creator, each with a tight trigger predicate and a procedural body. The PR's pipeline produces one skill per openkb skill new invocation, with no fan-out. The dogfooded ai-native-startup-advisor (in output/skills/) has a description spanning "idea, MVP, launch, and scale" — basically every question about running an AI startup. That description will over-trigger; that's the structural consequence of compiling a whole book into one skill rather than N narrow skills per book. The right unit of compilation is roughly one concept page per skill, which the pipeline doesn't support today.

https://github.com/VectifyAI/OpenKB/blob/1679c8cedac512583003e0d18b0ef19b07cd84db/openkb/skill/creator.py#L101-L154

4. Source-grounded citation is a suggestion, not a gate. The prompt instructs the agent to cite non-trivial claims with [[wikilinks]], but the validator only checks that wikilinks which appear in the body resolve to real files — it does not require any wikilinks to be present. The team's own committed skills/openkb/SKILL.md contains zero body wikilinks. So a skill with hallucinated claims and no audit trail passes validation cleanly. For "expert distillation" framing, this is the difference between a citable specialist and a confident-sounding generalist.

# references/ wikilink resolution
wikilinks = WIKILINK_RE.findall(text)
refs_dir = skill_dir / "references"
for link in wikilinks:
# link may or may not include .md suffix
target = refs_dir / link
if not target.suffix:
target = target.with_suffix(".md")
if not target.exists():
result.errors.append(
f"SKILL.md references [[references/{link}]] but "
f"{target.relative_to(skill_dir)} doesn't exist."
)
# references/*.md file sizes
if refs_dir.is_dir():
for ref in refs_dir.rglob("*.md"):
size = ref.stat().st_size
if size > REFERENCE_MAX_BYTES:
result.errors.append(
f"{ref.relative_to(skill_dir)} is {size} bytes; "
f"max is {REFERENCE_MAX_BYTES} bytes."
)
# scripts/*.py imports — strict only
if strict:
scripts_dir = skill_dir / "scripts"
if scripts_dir.is_dir():
for script in scripts_dir.rglob("*.py"):
bad = _non_stdlib_imports(script)
if bad:
result.warnings.append(
f"{script.relative_to(skill_dir)} imports non-stdlib "
f"modules: {', '.join(sorted(bad))}. Skill scripts run "
f"in unknown environments — prefer stdlib only."
)
return result

Verdict. Drop a book in today and you reliably get: a structurally valid Markdown skill with a passable description, navigable wikilinks to whichever concept pages the agent picked, and skill validate green. You do not reliably get: the book's actual arguments encoded as decision rules, citation-grounded claims, or a trigger predicate narrow enough to avoid over-firing. The marketing claim ("Drop in a book. Out comes a digital expert.") is closer to the ceiling than the floor; the floor is "a well-organized research briefing."

The minimum changes that would close the largest part of the gap:

  • Eval generator/grader read the SKILL.md body and references, not just the description, so a hollow body actually fails.
  • Working method requires touching wiki/sources/ (with the summary's full_text pointer) at least once per skill, not just summaries-of-summaries.
  • A skill improve path that feeds the previous SKILL.md back as context so iteration accumulates rather than restarts.
  • Require ≥1 body wikilink per N words; validator fails skills with zero source-grounded claims.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Two rounds of review fixes on top of the skill factory.

Design consistency (from PR-level review):
- workspace: restore_iteration now saves the current skill before
  overwriting, matching skill new's reversibility invariant — a user who
  edits files in chat then rolls back keeps those edits as a new iteration
- generator: validate_skill moved inside Generator.run() so /skill new in
  chat gets the same quality gate as openkb skill new (was CLI-only)
- skill/__init__: single-source path helpers (skills_root / skill_dir /
  skill_workspace_dir) and frontmatter parser (extract_frontmatter /
  extract_description). Removes duplicated path construction across 6+
  sites and four divergent frontmatter implementations
- cli/chat: route every "output/skills/<name>" construction through the
  helper; add missing "Run openkb init first." hint to skill validate /
  skill eval; capitalize /skill autocomplete; fix stale comment

Skills domain quality (the compile pipeline can actually distill now):
- creator/tools: skill agent gets the query agent's deep-retrieval
  toolset — get_page_content for page-range source reads on PageIndex
  docs, get_image for figures. query_wiki demoted to narrow follow-ups
  only; primary traversal is direct file reads
- skill_create.md: rewritten end-to-end. Working method now mirrors
  query's 6-step strategy (survey -> summaries -> sources via full_text
  pointer -> concepts -> draft -> review-and-revise). Required output
  adds a "Core decision rules" section (>=5 if-X-then-Y rules) and a
  "Known gaps" section. Description-writing rules optimize for trigger
  predicate (situations, keywords, exclusions), not topic accuracy
- evaluator: was a tautology (generator and grader both saw only the
  description -> a LOREM IPSUM body could pass 100%). Now the generator
  reads body + reference excerpts so prompts reflect actual claims, and
  a second grader (grade_coverage) checks whether the body has substance
  to support what the description promises. CLI surfaces both Trigger
  accuracy and Body coverage
- validator: foreign-wikilink gate. SKILL.md and references/*.md cannot
  contain [[concepts/]] / [[summaries/]] / [[sources/]] — those point at
  the producer's wiki, which doesn't ship; on the consumer's machine
  they are dead links plus wasted context tokens. Only
  [[references/<slug>]] (which ships with the skill) is valid
- prompt's source-use rules: paraphrase, short quotes <=40 words, no
  bulk copying; provenance audit is the producer's responsibility at
  compile time, not something to ship in the artifact

Tests: +9 new (rollback preservation, body-coverage grader, foreign
wikilink detection in body and references). 504 passing.
Comment thread openkb/skill/marketplace.py Fixed
@KylinMountain
Copy link
Copy Markdown
Collaborator Author

Code review

Reviewed commit ce918cf. Found 1 issue:

  1. extract_description is imported but never used in openkb/skill/marketplace.py. The function was added during the frontmatter-parser consolidation, but _build_manifest only emits fixed strings — no per-skill description interpolation happens here. Dead import.

from openkb.config import load_config
from openkb.skill import extract_description, skills_root

Lower-confidence observations worth a glance (not blocking):

  • validator.py runs FOREIGN_WIKILINK_RE.findall(text) on the full SKILL.md including frontmatter. If a description: value happens to contain a [[concepts/...]]-shaped literal, the gate fires with a "SKILL.md contains foreign wikilinks" message that points at the body, not the frontmatter. Behavior is arguably correct (a foreign wikilink in description is also a problem) but the error message is misleading. Either narrow the scan to the body or update the wording.

    # Foreign wikilinks. The skill ships *without* the producer's wiki, so
    # any [[concepts/...]] / [[summaries/...]] / [[sources/...]] left in
    # the body or references is a dead link on the consumer's machine plus
    # wasted context tokens. The compile prompt's "Linking rules" section
    # makes this explicit; this is the structural enforcement.
    foreign = FOREIGN_WIKILINK_RE.findall(text)
    if foreign:
    kinds = sorted({k.lower() for k in foreign})
    result.errors.append(
    f"SKILL.md contains foreign wikilinks ({', '.join(kinds)}) back "
    f"to the producer's wiki. Those don't ship with the skill and "
    f"are dead on the consumer's machine — paraphrase the content "
    f"inline or move it into `references/<slug>.md`."
    )
    refs_dir = skill_dir / "references"
    if refs_dir.is_dir():
    for ref in refs_dir.rglob("*.md"):
    ref_text = ref.read_text(encoding="utf-8", errors="replace")
    if FOREIGN_WIKILINK_RE.search(ref_text):
    result.errors.append(
    f"{ref.relative_to(skill_dir)} contains foreign "
    f"wikilinks back to the producer's wiki. References "
    f"ship with the skill and must be self-contained."
    )

  • evaluator.py grade_coverage fail-closes to unsupported on ambiguous LLM output. The intent is honest (don't falsely claim coverage) but a garbled grader response now silently inflates coverage_misses with no trace, polluting coverage_rate. Consider a separate ambiguity counter or surfacing the raw output once for debugging.

    ) from exc
    raw = (result.final_output or "").strip()
    upper = raw.upper()
    verdict: Literal["supported", "unsupported"]
    if "UNSUPPORTED" in upper:
    verdict = "unsupported"
    elif "SUPPORTED" in upper:
    verdict = "supported"
    else:
    # Ambiguous output — treat as unsupported (fail closed).
    verdict = "unsupported"
    reason = ""

  • MaxTurnsExceeded is imported inside grade_coverage only; the other two functions in the same module import it the same way. Not a circular-import workaround — could be lifted to module top for consistency and a small per-call savings.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Address self-review findings on ce918cf.

- marketplace: drop unused `extract_description` import left over from
  the frontmatter-parser consolidation. The manifest builder emits only
  fixed strings; no per-skill description is interpolated here.
- validator: scope the foreign-wikilink scan. Was running on the full
  SKILL.md text including frontmatter, which produced a body-pointing
  error message even when the offending wikilink was inside the
  description. Now scans the description and body separately, with
  location-specific error wording.
- skill/__init__: add `extract_body(text)` — a line-anchored body
  extractor that mirrors `extract_frontmatter`'s logic. The validator
  and evaluator both route through it, replacing the brittle
  `text.split("---", 2)[-1]` shortcut that mis-handled bodies starting
  with a Markdown horizontal rule.
- evaluator: `grade_coverage` no longer fail-closes ambiguous LLM
  outputs to "unsupported". A third "ambiguous" verdict surfaces grader
  malfunction as a distinct state on `EvalResult.coverage_ambiguous`,
  which is excluded from both numerator and denominator of
  `coverage_rate` so a garbled grader doesn't masquerade as a hollow
  skill. CLI prints a separate WARN block when ambiguous outputs occur.
- evaluator: lift `from agents.exceptions import MaxTurnsExceeded` to
  the module top; the three intra-function imports it had were not
  circular-import workarounds.

Tests: +1 covering the ambiguous-vs-unsupported segregation; the
previous fail-closed test is rewritten to assert the new "ambiguous"
state. 505 passing.
The eval-set generator, trigger grader, and coverage grader have no
tools — they only produce text. OpenAI's API rejects
`parallel_tool_calls` when `tools` is unset:

  Invalid value for 'parallel_tool_calls': 'parallel_tool_calls' is
  only allowed when 'tools' are specified.

`skill eval` would die at the very first LLM call, before any prompt
was graded. Tests masked it because they patch `Runner.run` and never
hit the LiteLLM layer.

Fix: stop passing `model_settings=ModelSettings(parallel_tool_calls=False)`
on those three agents — the flag is both invalid and meaningless when
the agent has zero tools to parallelise. `ModelSettings` is no longer
referenced anywhere in this module, so the import is dropped too.

Pre-existing bug carried over from the original evaluator design; this
is the first time anyone has run `skill eval` end-to-end against a
real LLM (everything else mocks Runner.run).
uv's recommended practice since 0.5 is to commit the lockfile for both
applications and libraries — pins the dev environment so contributors
and CI resolve identical versions, and dep drift shows up as a
reviewable diff.

This repo has never committed it (not gitignored either — just plain
untracked from before uv-first workflow). Establishing it now as
source of truth. Downstream PyPI consumers continue to resolve via
`pyproject.toml` ranges, so this is purely a dev-side change.
@KylinMountain
Copy link
Copy Markdown
Collaborator Author

Code review (performance)

Reviewed b6607ec through 5 opus-model agents. Three issues ≥ 80 confidence:

  1. run_eval awaits 30 LLM calls sequentially — the single largest wall-clock win in the PR. Trigger graders and coverage graders have zero state coupling between prompts, but the loop awaits each one before starting the next. For count=10 that's 20 trigger + 10 coverage = ~30 round-trips at ~1-2 s each = ~30-60 s. await asyncio.gather(*[grade_one(...) for p in eval_set]) plus a separate gather for coverage collapses this to ~2 s — a 15-30× speedup. Wrap in asyncio.Semaphore(8) to bound provider concurrency. ~10 LOC fix.

    content = _skill_content_block(skill_dir)
    result = EvalResult(prompts=eval_set)
    for prompt in eval_set:
    graded = await grade_one(desc, prompt.question, model=model)
    if graded != prompt.expected:
    result.misses.append(EvalMiss(prompt=prompt, graded=graded))
    # Body alignment only meaningful on questions the skill claims to
    # handle — for should-not questions the body is correctly empty
    # of relevant material.
    if prompt.expected == "trigger":
    verdict, reason = await grade_coverage(content, prompt.question, model=model)
    if verdict == "ambiguous":
    result.coverage_ambiguous.append(
    CoverageMiss(prompt=prompt, reason=reason)
    )
    elif verdict == "unsupported":
    result.coverage_misses.append(
    CoverageMiss(prompt=prompt, reason=reason)
    )
    return result

  2. grade_coverage re-sends the same ~16 K body+references block 10 times with no prompt-caching marker. For a 50 KB SKILL.md + 3×4 KB references, the coverage block is ~15.7 K input tokens, multiplied by 10 grader calls = ~157 K input tokens per eval. With provider prompt caching (Anthropic cache_control for 90% discount, or OpenAI's automatic ≥1024-token prefix caching), 9 of those 10 calls become cache hits. As coded, skill eval runs ~8× the input tokens of the pre-PR design. At gpt-4o pricing that's roughly $0.48/eval vs $0.06, enough that users will skip eval on cost grounds and lose the coverage signal the PR was designed to add.

    one-line reason. Callers should NOT collapse ``"ambiguous"`` into
    ``"unsupported"`` — see :class:`EvalResult.coverage_ambiguous`.
    """
    instructions = (
    "You are auditing a skill for content quality. You will be given "
    "the skill's body (SKILL.md without frontmatter) and any "
    "reference excerpts, plus a user question that the skill's "
    "description claims to handle. Decide whether the body has "
    "substantive material to answer the question.\n\n"
    "Answer with EXACTLY this two-line shape:\n"
    "VERDICT: SUPPORTED (or UNSUPPORTED)\n"
    "REASON: <one short sentence>\n\n"
    f"{skill_content}"
    )
    agent = Agent(
    name="coverage-grader",
    instructions=instructions,
    model=f"litellm/{model}",
    )
    try:
    result = await Runner.run(agent, f"Question: {question}", max_turns=2)
    except MaxTurnsExceeded as exc:
    raise RuntimeError(
    f"Coverage grader hit the max-turn cap on question: {question!r}. "

  3. Compile agent forces parallel_tool_calls=False. The skill-create agent has a natural read-fan-out phase early in the compile — survey directories, read N summaries, follow full_text pointers to source pages. With parallel_tool_calls=False each read_wiki_file / get_page_content costs one full LLM round-trip. Allowing parallel tool calls lets the model batch independent reads, saving roughly 5-10 outer turns (~20-40 s per compile). Write phase serializes naturally because each write_skill_file depends on accumulated reads, so correctness risk is low.

    done,
    ],
    model=f"litellm/{model}",
    model_settings=ModelSettings(parallel_tool_calls=False),
    )

Lower-confidence observations worth a glance:

  • query_wiki (creator.py) is a nested Runner.run(max_turns=50) inside the outer max_turns=80. Worst case is 80×50=4000 LLM calls before MaxTurnsExceeded. The "narrow follow-ups only" docstring is the only enforcement. Consider a per-compile call counter that returns an error after N invocations (e.g. 3).
  • _skill_content_block is called twice per run_eval (once in generate_eval_set, once at the top of run_eval); each invocation re-reads SKILL.md and references. Trivial CPU/IO win but ~5 LOC to memoize.
  • restore_iteration does copytree → rmtree → copytree. A Path.rename(workspace_slot) instead of the first copytree removes one full tree copy, ~10-50 ms on large skills.
  • REFERENCES_PREVIEW_BYTES = 4000 is per-file. A skill with 8 references already hits 32 KB before the body. A per-skill total budget would protect against ref-bloat.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Three findings from the performance review on b6607ec:

1. `run_eval` was awaiting ~30 LLM calls sequentially (20 trigger
   graders + 10 coverage graders). Each prompt is independent — same
   `desc`/`content` inputs, results accumulated in eval_set order — so
   `asyncio.gather` is correctness-preserving. Wrap in a
   `Semaphore(EVAL_CONCURRENCY=8)` to bound simultaneous requests under
   provider rate limits. Expected wall-clock cut for `openkb skill eval`
   is roughly 4-15x depending on provider latency, with the floor set
   by the semaphore.

2. The compile agent had `parallel_tool_calls=False`, forcing every
   read tool (`list_wiki_dir`, `read_wiki_file`, `get_page_content`)
   into its own turn. The early phase of compile is naturally a
   read-fan-out (survey directories, read N summaries, follow
   `full_text` pointers to source page-ranges). Allowing parallel tool
   calls lets the model batch independent reads, saving roughly 5-10
   outer turns per compile (~20-40s at Opus-class latencies). Writes
   serialise naturally because each `write_skill_file` depends on
   accumulated reads.

3. `query_wiki` is a nested `Runner.run(max_turns=50)` inside the outer
   compile (`max_turns=80`). The docstring's "narrow follow-ups only"
   was the only enforcement; a pathological run could spawn many nested
   calls. Added a per-compile counter via closure: after
   `QUERY_WIKI_MAX_CALLS=3` invocations, the tool returns an error
   string steering the agent back to direct file reads. Bounds tail
   latency without breaking the common case (legitimate cross-document
   sub-questions still get answered).

Prompt-caching for `grade_coverage` (the fourth finding in the review)
was deferred: the openai-agents SDK takes `instructions` as a plain
string with no hook for emitting Anthropic `cache_control` markers.
OpenAI's automatic prefix caching already applies because the system
prompt is byte-stable across the 10 coverage calls; closing the
Anthropic gap is SDK-level work, not application-level.

Tests pass (463; unrelated trafilatura env failure in test_url_ingest
not touched).
@KylinMountain
Copy link
Copy Markdown
Collaborator Author

Code review

Reviewed commit 2b94014 (perf fixes). Found 1 issue:

  1. query_wiki tool docstring exposes the literal Python name QUERY_WIKI_MAX_CALLS instead of the value 3 to the model. @function_tool captures the docstring at decoration time as the tool description visible to the LLM. The string isn't f-string-interpolated, so the model sees "Capped at QUERY_WIKI_MAX_CALLS invocations per compile" — meaningless to it. The agent can't self-regulate before hitting the cap; it just gets a surprise error string back. Either hardcode "3" in the docstring with a comment pointing at the constant, or rebuild the docstring after decoration so the model sees the actual number.

    async def query_wiki(question: str) -> str:
    """Semantic search over the wiki — narrow follow-ups only.
    This is a nested LLM call. Use ONLY when you have a specific
    sub-question that direct file reads can't easily answer (e.g. "what
    does the book say about X across multiple chapters?"). For primary
    traversal, use list/read/get_page_content instead — they are
    cheaper and give you the raw text, not another LLM's summary.
    Capped at ``QUERY_WIKI_MAX_CALLS`` invocations per compile.
    """
    nonlocal query_wiki_calls
    query_wiki_calls += 1
    if query_wiki_calls > QUERY_WIKI_MAX_CALLS:
    return (
    f"query_wiki call cap reached "
    f"({QUERY_WIKI_MAX_CALLS} per compile). Use direct file "

Lower-confidence observations (non-blocking):

  • asyncio.gather does not pass return_exceptions=True. A single grader raising RuntimeError (e.g. MaxTurnsExceeded on one prompt) aborts the whole eval and discards results from the other ~29 prompts that have already completed. Fail-fast is a defensible default for CLI tools, but on a 30-prompt eval that wastes a lot of completed work. Consider return_exceptions=True on the inner gathers with post-processing to keep partial results.

    trigger_results, coverage_results = await asyncio.gather(
    asyncio.gather(*trigger_tasks),
    asyncio.gather(*coverage_tasks),
    )

  • The evaluator module docstring's "Flow" section reads as a linear 1→5 sequence; it doesn't mention that steps 3-4 (the trigger and coverage graders) now run concurrently with a semaphore cap. Stale relative to the new code.

    """Quality evaluation for compiled skills.
    Two metrics, two LLM passes:
    **Trigger accuracy** — given *only* the description, does an external
    agent decide correctly whether to load the skill for a given question?
    This catches under-specific descriptions (false negatives) and
    over-broad descriptions (false positives).
    **Body alignment** — given the *full* SKILL.md (body + references), can
    the skill actually answer the should-trigger questions it claims to
    handle? This catches the failure mode where a well-written description
    promises capability that the body doesn't deliver — a hollow skill that
    would trigger but fail in practice.
    Flow:
    1. Read the SKILL.md frontmatter (description) + body + references/*.
    2. Generator LLM produces N should-trigger + N should-not prompts,
    using description AND body so prompts reflect what the skill
    actually claims to cover, not just description vibes.
    3. Trigger grader sees ONLY the description and answers
    trigger / no-trigger for each prompt — same as before.
    4. Alignment grader sees the SKILL.md body+references and each
    should-trigger prompt; answers "supported / unsupported" — does the
    skill have the substance to handle this question?
    5. Report both pass rates and the specific misses.
    Uses the same LiteLLM model the rest of the KB uses (config.yaml). No
    real LLM calls in tests — both generator and graders are patched.
    """

  • The EVAL_CONCURRENCY = 8 comment claims "~4x the sequential baseline" — with default count=10 there are 30 grader calls, and both gather groups feed the same semaphore, so peak concurrency is 8 across the combined pool. The realistic speedup is closer to 30/8 ≈ 3.75× to 8× depending on provider latency variance, not a fixed ~4×.

    # Bound on concurrent grader LLM calls in run_eval. Without this the
    # default count=10 would fire ~30 simultaneous requests, which most
    # providers rate-limit. 8 is a conservative starting point — runs ~4x
    # the sequential baseline while staying well under typical RPM caps.
    EVAL_CONCURRENCY = 8

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

The QUERY_WIKI_MAX_CALLS=3 limit added in 2b94014 was over-engineered
defence:

- It was based on a theoretical worst-case bound (80 outer turns × 50
  inner turns) from a perf review, not on observed misuse.
- There's no telemetry showing real compiles ever approach pathological
  query_wiki usage.
- The prompt's "narrow follow-ups only" guidance plus the docstring
  ("nested LLM call, prefer direct reads") are already the right
  primary control. Adding a structural hard-cap on top of that says
  "we trust the agent to make 80 reasoning steps but not to count to
  3", which is inconsistent.
- The cap had a real cost: a legitimately complex cross-document
  intent that genuinely needed a 4th query_wiki got cut off and forced
  back to direct reads, silently degrading skill quality.

Also fixes the docstring bug a reviewer flagged on 2b94014: the
docstring referenced the literal Python name `QUERY_WIKI_MAX_CALLS`
rather than the value, so the model saw a meaningless symbol. Removing
the cap removes the docstring claim too.

If runaway compiles ever show up in telemetry, prefer a logged warning
(observe-and-tune) over a silent block.
@VectifyAI VectifyAI deleted a comment from quanqigu May 20, 2026
@KylinMountain KylinMountain changed the title feat(cli): skill factory — openkb skill new + /skill new + auto marketplace.json feat(skill): distill an OpenKB wiki into a redistributable Anthropic Skill May 20, 2026
Both attributions were wrong:

- Structural validation was credited to "Codex skill-creator" at
  github.com/openai/skills — but that repo has no skill-creator, and
  validator.py is explicitly modeled on Anthropic's quick_validate.py
  (see the module docstring).
- Trigger-accuracy evals were credited to Anthropic skill-creator —
  but evaluator.py has no such inspiration in its docstrings; the
  design (description-only trigger grader + body-aware coverage
  grader) is OpenKB's own.

Iteration pattern in workspace.py is genuinely borrowed from Anthropic
skill-creator, and that credit lives in the module docstring where it
belongs.

Reframed the README line to describe what the gates do, without
parading non-existent or backwards lineage.
Same problem as the README "Quality gates" line just removed: the
attribution is louder than the actual lineage.

The workspace module implements directory backup with monotonically
numbered iteration slots — a generic backup pattern, not anything
specific to Anthropic's skill-creator. The inspiration was looking at
their SKILL.md while designing the rollback story, not copying any
algorithm. Calling it "Borrowed from" overstates that.

Keep the rest of the docstring (what the module does, when it runs)
and drop the false-pedigree footer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant