fix(scan): exclude scan_timestamp from result_hash — closes #82#84
Conversation
In PersistentRegistry.store_scan_result(), the previous implementation
included scan_timestamp = time.time() in the result_hash computation.
Because time.time() changes on every call, result_hash was always
different — the idempotency check (SELECT 1 ... WHERE result_hash = ?)
never matched an existing row, so every re-scan inserted a fresh row
into analysis_cache. After 1000 CI scans of the same codebase the table
had 1000 rows for the same file set (cache bloat).
The existing test_store_scan_result_is_idempotent only passed because
it patched time.time() to a fixed value — meaningless in production.
Fix:
- Split the subset into stable_subset (no scan_timestamp) used for
result_hash, and subset = {**stable_subset, scan_timestamp: now}
used for result_json (storage).
- result_hash = sha256(json.dumps(stable_subset, ...))
- result_json = json.dumps(subset, ...) — still includes scan_timestamp
for trend tracking.
- On cache hit, UPDATE the existing row in place (result_json +
timestamp) so trend tracking reflects the latest scan without
inserting a duplicate row. Previously the cache-hit path just
returned, leaving the timestamp frozen at the first scan.
Tests:
- test_store_scan_result_is_idempotent: removed the time.time() patch.
Now calls store_scan_result twice with REAL time and asserts 1 row.
- test_store_scan_result_updates_timestamp_on_rescan (new): calls
store_scan_result twice, asserts 1 row + timestamp column updated
to the second call value.
- All 5 TestStoreScanResult tests pass.
- Full test suite (excluding integration): 801 passed (one more than
main), 37 pre-existing failures unchanged (date-based vuln_db
staleness + graph_model/hybrid_type_resolver — all fail identically
on main).
Files touched:
- scripts/persistent_registry.py (store_scan_result method only)
- tests/test_persistent_registry.py (TestStoreScanResult class only)
Refs: #82
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Review - APPROVEReviewed full diff (2 files, +131/-30). Excellent implementation that exceeds spec. Strengths
Verification
Minor observation (non-blocking)The VerdictMerging. This is the correct fix - exceeds spec by adding upsert behavior that makes trend tracking actually useful. Reviewed by BOS (orchestrate-workers skill) - diff read in full before approval. |
|



What changed
In
PersistentRegistry.store_scan_result()(scripts/persistent_registry.py), split the previously-singlesubsetdict into two parts:stable_subset={files_scanned, frontend_counts, backend_counts, frameworks, total_symbols}— noscan_timestamp— used to computeresult_hash.subset={**stable_subset, "scan_timestamp": now}— used forresult_json(storage).scan_timestampis still persisted for trend tracking, just excluded from the hash.Additionally, the cache-hit path was changed from a silent
returnto an in-placeUPDATEof the existing row'sresult_json+timestampcolumns. This keeps trend tracking current (latest scan wins) without inserting a duplicate row.Why
Issue #82 (P1 follow-up to PR #73 / #31): the previous implementation included
scan_timestamp = time.time()in theresult_hashcomputation. Becausetime.time()changes on every call,result_hashwas always different — the idempotency checknever matched an existing row, so every re-scan inserted a fresh row. After 1000 CI scans of the same codebase,
analysis_cachehad 1000 rows for the same file set — cache bloat.The existing
test_store_scan_result_is_idempotentonly passed because it patchedtime.time()to return a fixed value (1700000000.0) — meaningless in production, where every scan has a different timestamp and idempotency silently broke.With the fix, two
store_scan_resultcalls with real time still produce the sameresult_hash(because the stable subset is identical), the second call hits the existing row, updates itstimestampin place, and returns without inserting a duplicate.Files touched
scripts/persistent_registry.py—store_scan_resultmethod only (docstring + body). No signature change, no return-value change._compute_scan_file_set_hashuntouched (already correct per issue constraint).tests/test_persistent_registry.py—TestStoreScanResultclass only:test_store_scan_result_is_idempotent— removed thepr_module.time.timepatch; now callsstore_scan_resulttwice with real time and asserts 1 row.test_store_scan_result_updates_timestamp_on_rescan— calls twice, asserts 1 row +timestampcolumn updated to second-call value.Verification
pytest tests/test_persistent_registry.py::TestStoreScanResult -vAll 5
TestStoreScanResulttests pass — the 3 existing ones (no regression) plus the 2 added/modified by this PR.pytest tests/ -q --ignore=tests/test_integration.py801 passed, 37 failed, 14 skippedmainbaseline:800 passed, 37 failed, 14 skippedThe 37 failures are identical on both branches (verified with
diffof the failure lists — zero difference). They are pre-existing, date-based, and unrelated to this PR:test_vuln_staleness.py— VULN_DB hardcoded to2025-02-28, now 485+ days old, crosses a fixed staleness threshold.test_graph_model.py— graph schema introspection on SQLite versions.test_hybrid_type_resolver.py— clean-app end-to-end imports.My branch adds exactly one new passing test (
test_store_scan_result_updates_timestamp_on_rescan) and converts zero previously-passing tests to failures.Demo: idempotency + timestamp update with real time
Standalone script that calls
store_scan_resulttwice with realtime.time()on the same file set, then queriesanalysis_cache:— 1 row after 2 calls (idempotent), and the
timestampcolumn is updated to the second call's value (trend tracking stays current).Architectural decision
On cache hit,
UPDATEin place instead of returning silently.The issue text framed idempotency as "no duplicate insert", but acceptance criterion #4 explicitly requires the
timestampcolumn to be updated to the latest scan. Silent-return would leavetimestampfrozen at the first scan forever, breaking trend tracking. The upsert pattern (INSERT if absent, UPDATE if present) satisfies both: no duplicate rows, and the row reflects the latest scan.As a side benefit,
result_json'sscan_timestampfield is also kept in sync with thetimestampcolumn (both updated atomically in the sameUPDATE). Previously the two could drift because they came from two separatetime.time()calls.Issue link
Closes #82.
Found but not fixed (out of scope)
tests/test_vuln_staleness.py::TestScanVulnerabilitiesCacheInfo::test_vulnerable_app_has_stale_cache_info— fails identically onmain. Root cause: hardcodedVULN_DBdate2025-02-28(485+ days old) crosses the test's fixed staleness threshold. Needs the VULN_DB date bumped or the threshold parameterized — separate fix.tests/test_graph_model.py(10 failures) andtests/test_hybrid_type_resolver.py::TestCleanAppEndToEnd(3 failures) — also fail identically onmain. Not investigated; out of scope for this PR.