fix(cli): unbreak openkb remove — registry pruning + scoped lint pass (closes #58)#62
Open
SeungwookHan wants to merge 2 commits into
Open
Conversation
…ps legacy entries
`HashRegistry.remove_by_doc_name` matched on
`meta.get("doc_name") == doc_name`. Registry rows ingested before
commit c504e26 carry only `{name, type}` — no `doc_name` key — so the
comparison was always `None == "<slug>"` and the method silently
returned False. The CLI never checked the return value, leaving an
orphan hash in `hashes.json` that then re-bound the next `openkb add`
of the same file via PageIndex / SHA dedup.
Add `HashRegistry.remove_by_hash(file_hash) -> bool` and call it from
`remove` with the hash already resolved by `_resolve_doc_identifier`.
The hash key is always present, so pruning no longer depends on the
metadata shape.
Tests: `test_hash_registry_remove_by_hash` (modern + legacy entry
shapes, idempotency, unknown-hash returns False) and the end-to-end
`test_cli_remove_prunes_legacy_registry_entry_without_doc_name`, which
seeds a registry entry in the exact pre-PR#51 shape from the issue.
Refs VectifyAI#58 (Bug 1).
`remove` ran `fix_broken_links(wiki_dir)` after every deletion. That helper was built for `openkb lint --fix` — an explicit, user-invoked wiki-wide cleanup — and strips *every* dangling `[[wikilink]]` in the KB. PR VectifyAI#51 reused it for the post-remove pass, whose declared intent (per its own PR description) was only to strip links pointing at the just-removed summary and deleted concept pages. That mismatch meant a single `openkb remove` reformatted unrelated pages KB-wide — 39-file / 1254-line diff in the repro — and silently demoted pre-existing ghost links the user meant to keep, e.g. a hand-written link to a concept they plan to add later. Add a keyword-only `restrict_to` parameter to `fix_broken_links`: when given, only those files are rewritten, while the valid-target whitelist is still computed wiki-wide so legitimate cross-links in the scoped files still resolve. `remove` now passes the union of the concept pages it modified (`concept_result["modified"]`) and `index.md`. `openkb lint --fix` is untouched — it omits the parameter and keeps the wiki-wide behavior, which is correct for that command. `test_cli_remove_lint_cleans_dangling_links` previously planted a stray link in an unrelated page and asserted it was stripped, which enshrined the over-zealous behavior. Renamed to `..._in_modified_page` and re-targeted at a page the removal actually modifies; the new `test_cli_remove_preserves_ghosts_in_unrelated_pages` locks in the opposite contract. `TestFixBrokenLinksRestrictTo` adds 5 unit cases covering default behavior, scoping, empty list, paths outside the wiki, and the wiki-wide target set. Closes VectifyAI#58.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two regressions from #51 surface together on a single
openkb removerun but have independent root causes — reported and fixed together because they share one repro, each landed as its own atomic commit.Closes #58.
Bug 1 — orphan hash entry on legacy documents
After
openkb removereports success,hashes.jsonstill contains the removed doc's entry. Re-runningopenkb add <same-file>is then wrongly treated as a duplicate via SHA dedup.Root cause. Commit c504e26 fixed
add_single_fileto persistdoc_nameinto the registry, but rows ingested before that commit were never backfilled — they still carry only{name, type}.HashRegistry.remove_by_doc_namematched onmeta.get("doc_name") == doc_name, which evaluated toNone == "<slug>"and silently returnedFalse. The CLI never checked the return value, so the failure was invisible at the surface.Fix (commit 1). Prune by the already-resolved
file_hash(returned by_resolve_doc_identifier) via a newHashRegistry.remove_by_hash(file_hash) -> bool. The hash key is always present, so pruning no longer depends on the metadata shape — and it drops the slug round-trip the CLI never needed.Bug 2 —
fix_broken_linksreformatting unrelated wiki pagesRemoving a single doc produced a sprawling diff. In the repro, removing one
ollama.mdproduced a 39-file / 1254-line diff; 27 of those were concept pages that never listedollamaas a source.Root cause.
fix_broken_linkswas built foropenkb lint --fix— an explicit, user-invoked wiki-wide cleanup — and strips every dangling[[wikilink]]in the KB. #51 reused it forremove's post-pass, whose declared intent (per the #51 PR description itself) was only to strip links pointing at the just-removed summary and deleted concept pages. The post-remove cleanup intent was scoped; the helper it called was not. That mismatch swept the rest of the KB and silently demoted pre-existing ghost links the user meant to keep (e.g. a hand-written link to a concept they plan to add later).To be precise: the wiki-wide behavior is correct for
openkb lint --fix. This is a #51-side scoping fix, not a change to #49's behavior.Fix (commit 2, the preferred path from the issue). Add a keyword-only
restrict_toparameter tofix_broken_links.removepasses only the union of the concept pages it modified (concept_result["modified"]) andindex.md, so the post-remove pass cleans only the dangling links this remove just created. The valid-target whitelist is still computed wiki-wide so legitimate[[concepts/sibling]]links in the scoped pages still resolve.openkb lint --fixis untouched — it omits the parameter and keeps the wiki-wide behavior.Test contract change
test_cli_remove_lint_cleans_dangling_linkspreviously planted a stray link inconcepts/llm.md(an unrelated page) and asserted it was stripped — that test enshrined the over-zealous behavior and couldn't tell scoped cleanup from global. Renamed to..._in_modified_pageand re-targeted atconcepts/attention.md(a page this removal actually modifies); the newtest_cli_remove_preserves_ghosts_in_unrelated_pageslocks in the opposite contract.Test plan
tests/test_remove.py— 46/46 pass, incl.test_hash_registry_remove_by_hash,test_cli_remove_prunes_legacy_registry_entry_without_doc_name(E2E Bug 1),test_cli_remove_preserves_ghosts_in_unrelated_pages(E2E Bug 2),test_cli_remove_lint_cleans_dangling_links_in_modified_page.tests/test_lint.py::TestFixBrokenLinksRestrictTo— 5 new cases: default behavior, scoped, empty list = no-op, paths outsidewiki/skipped, wiki-wide target set still consulted.openkb lint --fix.AttributeErroronremove_by_hash,TypeErroron therestrict_tokwarg, the legacyh_legacyentry surviving, and[[concepts/agent-loops]]getting demoted to plain text — and the 2 sentinel tests pass on both sides of the patch.Commits
fix(cli): prune the hash registry by file_hash so openkb remove drops legacy entries— Bug 1fix(cli): scope openkb remove's post-pass lint to the files it touched— Bug 2