Skip to content

fix(cli): unbreak openkb remove — registry pruning + scoped lint pass (closes #58)#62

Open
SeungwookHan wants to merge 2 commits into
VectifyAI:mainfrom
SeungwookHan:fix/58-orphan-hash-and-scoped-lint
Open

fix(cli): unbreak openkb remove — registry pruning + scoped lint pass (closes #58)#62
SeungwookHan wants to merge 2 commits into
VectifyAI:mainfrom
SeungwookHan:fix/58-orphan-hash-and-scoped-lint

Conversation

@SeungwookHan
Copy link
Copy Markdown
Contributor

Summary

Two regressions from #51 surface together on a single openkb remove run but have independent root causes — reported and fixed together because they share one repro, each landed as its own atomic commit.

Closes #58.

Bug 1 — orphan hash entry on legacy documents

After openkb remove reports success, hashes.json still contains the removed doc's entry. Re-running openkb add <same-file> is then wrongly treated as a duplicate via SHA dedup.

Root cause. Commit c504e26 fixed add_single_file to persist doc_name into the registry, but rows ingested before that commit were never backfilled — they still carry only {name, type}. HashRegistry.remove_by_doc_name matched on meta.get("doc_name") == doc_name, which evaluated to None == "<slug>" and silently returned False. The CLI never checked the return value, so the failure was invisible at the surface.

Fix (commit 1). Prune by the already-resolved file_hash (returned by _resolve_doc_identifier) via a new HashRegistry.remove_by_hash(file_hash) -> bool. The hash key is always present, so pruning no longer depends on the metadata shape — and it drops the slug round-trip the CLI never needed.

Bug 2 — fix_broken_links reformatting unrelated wiki pages

Removing a single doc produced a sprawling diff. In the repro, removing one ollama.md produced a 39-file / 1254-line diff; 27 of those were concept pages that never listed ollama as a source.

Root cause. fix_broken_links was built for openkb lint --fix — an explicit, user-invoked wiki-wide cleanup — and strips every dangling [[wikilink]] in the KB. #51 reused it for remove's post-pass, whose declared intent (per the #51 PR description itself) was only to strip links pointing at the just-removed summary and deleted concept pages. The post-remove cleanup intent was scoped; the helper it called was not. That mismatch swept the rest of the KB and silently demoted pre-existing ghost links the user meant to keep (e.g. a hand-written link to a concept they plan to add later).

To be precise: the wiki-wide behavior is correct for openkb lint --fix. This is a #51-side scoping fix, not a change to #49's behavior.

Fix (commit 2, the preferred path from the issue). Add a keyword-only restrict_to parameter to fix_broken_links. remove passes only the union of the concept pages it modified (concept_result["modified"]) and index.md, so the post-remove pass cleans only the dangling links this remove just created. The valid-target whitelist is still computed wiki-wide so legitimate [[concepts/sibling]] links in the scoped pages still resolve. openkb lint --fix is untouched — it omits the parameter and keeps the wiki-wide behavior.

Test contract change

test_cli_remove_lint_cleans_dangling_links previously planted a stray link in concepts/llm.md (an unrelated page) and asserted it was stripped — that test enshrined the over-zealous behavior and couldn't tell scoped cleanup from global. Renamed to ..._in_modified_page and re-targeted at concepts/attention.md (a page this removal actually modifies); the new test_cli_remove_preserves_ghosts_in_unrelated_pages locks in the opposite contract.

Test plan

  • tests/test_remove.py — 46/46 pass, incl. test_hash_registry_remove_by_hash, test_cli_remove_prunes_legacy_registry_entry_without_doc_name (E2E Bug 1), test_cli_remove_preserves_ghosts_in_unrelated_pages (E2E Bug 2), test_cli_remove_lint_cleans_dangling_links_in_modified_page.
  • tests/test_lint.py::TestFixBrokenLinksRestrictTo — 5 new cases: default behavior, scoped, empty list = no-op, paths outside wiki/ skipped, wiki-wide target set still consulted.
  • Full suite: 380 passed, no regressions, no behavior change for openkb lint --fix.
  • TDD verification. With only the code patches reverted (tests applied), the 7 new regression assertions fail for the right reasons — AttributeError on remove_by_hash, TypeError on the restrict_to kwarg, the legacy h_legacy entry surviving, and [[concepts/agent-loops]] getting demoted to plain text — and the 2 sentinel tests pass on both sides of the patch.
  • Each commit verified in isolation (Bug 1 commit: 96/96 pass with Bug 2 changes stashed).

Commits

  1. fix(cli): prune the hash registry by file_hash so openkb remove drops legacy entries — Bug 1
  2. fix(cli): scope openkb remove's post-pass lint to the files it touched — Bug 2

…ps legacy entries

`HashRegistry.remove_by_doc_name` matched on
`meta.get("doc_name") == doc_name`. Registry rows ingested before
commit c504e26 carry only `{name, type}` — no `doc_name` key — so the
comparison was always `None == "<slug>"` and the method silently
returned False. The CLI never checked the return value, leaving an
orphan hash in `hashes.json` that then re-bound the next `openkb add`
of the same file via PageIndex / SHA dedup.

Add `HashRegistry.remove_by_hash(file_hash) -> bool` and call it from
`remove` with the hash already resolved by `_resolve_doc_identifier`.
The hash key is always present, so pruning no longer depends on the
metadata shape.

Tests: `test_hash_registry_remove_by_hash` (modern + legacy entry
shapes, idempotency, unknown-hash returns False) and the end-to-end
`test_cli_remove_prunes_legacy_registry_entry_without_doc_name`, which
seeds a registry entry in the exact pre-PR#51 shape from the issue.

Refs VectifyAI#58 (Bug 1).
`remove` ran `fix_broken_links(wiki_dir)` after every deletion. That
helper was built for `openkb lint --fix` — an explicit, user-invoked
wiki-wide cleanup — and strips *every* dangling `[[wikilink]]` in the
KB. PR VectifyAI#51 reused it for the post-remove pass, whose declared intent
(per its own PR description) was only to strip links pointing at the
just-removed summary and deleted concept pages.

That mismatch meant a single `openkb remove` reformatted unrelated
pages KB-wide — 39-file / 1254-line diff in the repro — and silently
demoted pre-existing ghost links the user meant to keep, e.g. a
hand-written link to a concept they plan to add later.

Add a keyword-only `restrict_to` parameter to `fix_broken_links`:
when given, only those files are rewritten, while the valid-target
whitelist is still computed wiki-wide so legitimate cross-links in the
scoped files still resolve. `remove` now passes the union of the
concept pages it modified (`concept_result["modified"]`) and
`index.md`. `openkb lint --fix` is untouched — it omits the parameter
and keeps the wiki-wide behavior, which is correct for that command.

`test_cli_remove_lint_cleans_dangling_links` previously planted a
stray link in an unrelated page and asserted it was stripped, which
enshrined the over-zealous behavior. Renamed to `..._in_modified_page`
and re-targeted at a page the removal actually modifies; the new
`test_cli_remove_preserves_ghosts_in_unrelated_pages` locks in the
opposite contract. `TestFixBrokenLinksRestrictTo` adds 5 unit cases
covering default behavior, scoping, empty list, paths outside the
wiki, and the wiki-wide target set.

Closes VectifyAI#58.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

openkb remove leaves orphan hash and reformats unrelated wiki

1 participant