Skip to content

fix(scan): exclude scan_timestamp from result_hash — closes #82#84

Merged
Wolfvin merged 1 commit into
mainfrom
fix/82-store-scan-result-idempotency
Jun 29, 2026
Merged

fix(scan): exclude scan_timestamp from result_hash — closes #82#84
Wolfvin merged 1 commit into
mainfrom
fix/82-store-scan-result-idempotency

Conversation

@Wolfvin

@Wolfvin Wolfvin commented Jun 28, 2026

Copy link
Copy Markdown
Owner

What changed

In PersistentRegistry.store_scan_result() (scripts/persistent_registry.py), split the previously-single subset dict into two parts:

  • stable_subset = {files_scanned, frontend_counts, backend_counts, frameworks, total_symbols}no scan_timestamp — used to compute result_hash.
  • subset = {**stable_subset, "scan_timestamp": now} — used for result_json (storage). scan_timestamp is still persisted for trend tracking, just excluded from the hash.
stable_json = json.dumps(stable_subset, ensure_ascii=False, default=str, sort_keys=True)
result_hash = hashlib.sha256(stable_json.encode("utf-8")).hexdigest()
result_json = json.dumps(subset, ensure_ascii=False, default=str, sort_keys=True)

Additionally, the cache-hit path was changed from a silent return to an in-place UPDATE of the existing row's result_json + timestamp columns. This keeps trend tracking current (latest scan wins) without inserting a duplicate row.

Why

Issue #82 (P1 follow-up to PR #73 / #31): the previous implementation included scan_timestamp = time.time() in the result_hash computation. Because time.time() changes on every call, result_hash was always different — the idempotency check

SELECT 1 FROM analysis_cache WHERE command = ? AND file_set_hash = ? AND result_hash = ?

never matched an existing row, so every re-scan inserted a fresh row. After 1000 CI scans of the same codebase, analysis_cache had 1000 rows for the same file set — cache bloat.

The existing test_store_scan_result_is_idempotent only passed because it patched time.time() to return a fixed value (1700000000.0) — meaningless in production, where every scan has a different timestamp and idempotency silently broke.

With the fix, two store_scan_result calls with real time still produce the same result_hash (because the stable subset is identical), the second call hits the existing row, updates its timestamp in place, and returns without inserting a duplicate.

Files touched

  • scripts/persistent_registry.pystore_scan_result method only (docstring + body). No signature change, no return-value change. _compute_scan_file_set_hash untouched (already correct per issue constraint).
  • tests/test_persistent_registry.pyTestStoreScanResult class only:

Verification

pytest tests/test_persistent_registry.py::TestStoreScanResult -v

tests/test_persistent_registry.py::TestStoreScanResult::test_store_scan_result_persists_to_analysis_cache PASSED [ 20%]
tests/test_persistent_registry.py::TestStoreScanResult::test_store_scan_result_is_idempotent PASSED [ 40%]
tests/test_persistent_registry.py::TestStoreScanResult::test_store_scan_result_updates_timestamp_on_rescan PASSED [ 60%]
tests/test_persistent_registry.py::TestStoreScanResult::test_store_scan_result_different_file_sets PASSED [ 80%]
tests/test_persistent_registry.py::TestStoreScanResult::test_scan_command_sets_sqlite_persisted_flag PASSED [100%]

============================== 5 passed in 0.51s ===============================

All 5 TestStoreScanResult tests pass — the 3 existing ones (no regression) plus the 2 added/modified by this PR.

pytest tests/ -q --ignore=tests/test_integration.py

  • This branch: 801 passed, 37 failed, 14 skipped
  • main baseline: 800 passed, 37 failed, 14 skipped

The 37 failures are identical on both branches (verified with diff of the failure lists — zero difference). They are pre-existing, date-based, and unrelated to this PR:

  • test_vuln_staleness.py — VULN_DB hardcoded to 2025-02-28, now 485+ days old, crosses a fixed staleness threshold.
  • test_graph_model.py — graph schema introspection on SQLite versions.
  • test_hybrid_type_resolver.py — clean-app end-to-end imports.

My branch adds exactly one new passing test (test_store_scan_result_updates_timestamp_on_rescan) and converts zero previously-passing tests to failures.

Demo: idempotency + timestamp update with real time

Standalone script that calls store_scan_result twice with real time.time() on the same file set, then queries analysis_cache:

=== Issue #82 Demo: store_scan_result idempotency ===
After call 1: rows=1, timestamp=1782689350.6656766
After call 2: rows=1, timestamp=1782689350.717351

Idempotent (1 row after 2 calls): True
Timestamp updated on rescan     : True

RESULT: PASS

— 1 row after 2 calls (idempotent), and the timestamp column is updated to the second call's value (trend tracking stays current).

Architectural decision

On cache hit, UPDATE in place instead of returning silently.

The issue text framed idempotency as "no duplicate insert", but acceptance criterion #4 explicitly requires the timestamp column to be updated to the latest scan. Silent-return would leave timestamp frozen at the first scan forever, breaking trend tracking. The upsert pattern (INSERT if absent, UPDATE if present) satisfies both: no duplicate rows, and the row reflects the latest scan.

As a side benefit, result_json's scan_timestamp field is also kept in sync with the timestamp column (both updated atomically in the same UPDATE). Previously the two could drift because they came from two separate time.time() calls.

Issue link

Closes #82.

Found but not fixed (out of scope)

  • tests/test_vuln_staleness.py::TestScanVulnerabilitiesCacheInfo::test_vulnerable_app_has_stale_cache_info — fails identically on main. Root cause: hardcoded VULN_DB date 2025-02-28 (485+ days old) crosses the test's fixed staleness threshold. Needs the VULN_DB date bumped or the threshold parameterized — separate fix.
  • tests/test_graph_model.py (10 failures) and tests/test_hybrid_type_resolver.py::TestCleanAppEndToEnd (3 failures) — also fail identically on main. Not investigated; out of scope for this PR.

In PersistentRegistry.store_scan_result(), the previous implementation
included scan_timestamp = time.time() in the result_hash computation.
Because time.time() changes on every call, result_hash was always
different — the idempotency check (SELECT 1 ... WHERE result_hash = ?)
never matched an existing row, so every re-scan inserted a fresh row
into analysis_cache. After 1000 CI scans of the same codebase the table
had 1000 rows for the same file set (cache bloat).

The existing test_store_scan_result_is_idempotent only passed because
it patched time.time() to a fixed value — meaningless in production.

Fix:
- Split the subset into stable_subset (no scan_timestamp) used for
  result_hash, and subset = {**stable_subset, scan_timestamp: now}
  used for result_json (storage).
- result_hash = sha256(json.dumps(stable_subset, ...))
- result_json = json.dumps(subset, ...) — still includes scan_timestamp
  for trend tracking.
- On cache hit, UPDATE the existing row in place (result_json +
  timestamp) so trend tracking reflects the latest scan without
  inserting a duplicate row. Previously the cache-hit path just
  returned, leaving the timestamp frozen at the first scan.

Tests:
- test_store_scan_result_is_idempotent: removed the time.time() patch.
  Now calls store_scan_result twice with REAL time and asserts 1 row.
- test_store_scan_result_updates_timestamp_on_rescan (new): calls
  store_scan_result twice, asserts 1 row + timestamp column updated
  to the second call value.
- All 5 TestStoreScanResult tests pass.
- Full test suite (excluding integration): 801 passed (one more than
  main), 37 pre-existing failures unchanged (date-based vuln_db
  staleness + graph_model/hybrid_type_resolver — all fail identically
  on main).

Files touched:
- scripts/persistent_registry.py (store_scan_result method only)
- tests/test_persistent_registry.py (TestStoreScanResult class only)

Refs: #82
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@Wolfvin

Wolfvin commented Jun 29, 2026

Copy link
Copy Markdown
Owner Author

Review - APPROVE

Reviewed full diff (2 files, +131/-30). Excellent implementation that exceeds spec.

Strengths

  1. Split subset correctly: stable_subset (no scan_timestamp) used for result_hash, subset (with scan_timestamp) used for result_json storage. Matches issue [BUG-P1] store_scan_result idempotency broken — scan_timestamp in result_hash causes cache bloat #82 spec exactly.

  2. Upsert on cache hit (beyond spec, excellent): instead of just skipping insert on cache hit, the code now UPDATEs the existing row's timestamp + result_json in place. This means trend tracking reflects the LATEST scan, not the first scan — which is what users actually want when they re-scan.

  3. Single time.time() call: captured once as now = time.time() and reused for both scan_timestamp field and analysis_cache.timestamp column. Consistent, efficient.

  4. Comprehensive docstring update: explains the [BUG-P1] store_scan_result idempotency broken — scan_timestamp in result_hash causes cache bloat #82 bug, the fix (split subset), and the upsert behavior. References issue [BUG-P1] store_scan_result idempotency broken — scan_timestamp in result_hash causes cache bloat #82.

  5. Test improvements:

    • test_store_scan_result_is_idempotent: removed the time.time() patch (which was masking the bug). Now uses real time, calls twice, asserts 1 row. This is the test that would have caught [BUG-P1] store_scan_result idempotency broken — scan_timestamp in result_hash causes cache bloat #82 in the first place if it had been written this way originally.
    • test_store_scan_result_updates_timestamp_on_rescan (NEW): verifies idempotency (1 row) + timestamp updated to latest. Uses time.sleep(0.05) to ensure deterministic timestamp difference. Excellent test.
  6. Error handling preserved: try/except sqlite3.Error on the UPDATE path with logger.warning. Non-fatal.

Verification

  • Idempotency now works with real time (no patching) - the original bug is fixed
  • Trend tracking still works (timestamp updated on re-scan)
  • No duplicate rows inserted
  • Existing tests still pass (test_store_scan_result_persists_to_analysis_cache, test_store_scan_result_different_file_sets, test_scan_command_sets_sqlite_persisted_flag)

Minor observation (non-blocking)

The test_store_scan_result_updates_timestamp_on_rescan test uses time.sleep(0.05) for deterministic timestamp difference. On extremely slow CI (rare), 50ms might not be enough if time.time() resolution is coarser. Acceptable - if it ever flakes, bump to 0.1s.

Verdict

Merging. This is the correct fix - exceeds spec by adding upsert behavior that makes trend tracking actually useful.

Reviewed by BOS (orchestrate-workers skill) - diff read in full before approval.

@Wolfvin Wolfvin merged commit 241c762 into main Jun 29, 2026
2 of 8 checks passed
@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG-P1] store_scan_result idempotency broken — scan_timestamp in result_hash causes cache bloat

1 participant