Skip to content

Tamper-evident append-only fault audit log#487

Merged
mfaferek93 merged 9 commits into
mainfrom
feat/fault-audit-log
Jul 2, 2026
Merged

Tamper-evident append-only fault audit log#487
mfaferek93 merged 9 commits into
mainfrom
feat/fault-audit-log

Conversation

@mfaferek93

Copy link
Copy Markdown
Collaborator

Closes #483

Optional (off by default) append-only, hash-chained audit log of fault state transitions in the fault manager: one immutable row per transition, sha256(prev_hash + canonical(event)) with a persisted chain head, a verify routine, retention that seals a segment anchor before pruning, and append-only triggers.

Threat model is documented honestly: the chain is unkeyed and lives in a single writable file, so it detects non-recomputing tampering and accidental loss; a key/signature or external anchoring is a follow-up.

Tested: 121 unit tests incl. truncation and tamper cases.

Copilot AI review requested due to automatic review settings June 30, 2026 17:24
@mfaferek93 mfaferek93 marked this pull request as draft June 30, 2026 17:30

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an optional, tamper-evident (hash-chained) append-only audit log to ros2_medkit_fault_manager to make fault state-transition history independently verifiable (detecting non-recomputing edits/deletions), with configurable retention/rotation and test coverage for tamper/truncation scenarios.

Changes:

  • Introduces FaultAuditLog (SQLite-backed) with hash chaining, persisted head, verify routine, and sealed-anchor retention pruning.
  • Hooks the audit log into FaultManagerNode for write-path transitions (including timer-driven PREFAILED→CONFIRMED auto-confirmations) and adds/updates unit tests accordingly.
  • Updates dependencies/build/docs/config/changelog to include OpenSSL (EVP SHA-256) and document the threat model and parameters.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/ros2_medkit_fault_manager/src/fault_audit_log.cpp Implements the SQLite-backed append-only, hash-chained audit log with verify + rotation/anchors.
src/ros2_medkit_fault_manager/include/ros2_medkit_fault_manager/fault_audit_log.hpp Public API/types for audit events, records, head, and verification results.
src/ros2_medkit_fault_manager/src/fault_manager_node.cpp Creates/configures the audit log and appends records on fault transitions, including timer auto-confirmation.
src/ros2_medkit_fault_manager/include/ros2_medkit_fault_manager/fault_manager_node.hpp Declares audit-log plumbing and exposes test-only accessors used by new tests.
src/ros2_medkit_fault_manager/src/sqlite_fault_storage.cpp Changes time-based confirmation to return confirmed fault codes for auditing.
src/ros2_medkit_fault_manager/src/fault_storage.cpp Updates in-memory storage to return confirmed fault codes for auditing.
src/ros2_medkit_fault_manager/include/ros2_medkit_fault_manager/fault_storage.hpp Updates the check_time_based_confirmation() contract to return confirmed fault codes.
src/ros2_medkit_fault_manager/include/ros2_medkit_fault_manager/sqlite_fault_storage.hpp Updates override signature for the new confirmation return type.
src/ros2_medkit_fault_manager/test/test_fault_audit_log.cpp Adds focused unit tests for append/verify/tamper/truncation/rotation and trigger enforcement.
src/ros2_medkit_fault_manager/test/test_fault_manager.cpp Adds end-to-end node tests verifying audit logging (enabled/disabled/timer-confirm paths).
src/ros2_medkit_fault_manager/test/test_sqlite_storage.cpp Updates tests to match new confirmation return type (vector of fault codes).
src/ros2_medkit_fault_manager/CMakeLists.txt Links OpenSSL crypto, builds the new audit-log implementation and test target.
src/ros2_medkit_fault_manager/package.xml Adds OpenSSL development dependency (libssl-dev).
src/ros2_medkit_fault_manager/config/fault_manager.yaml Documents audit-log parameters and threat model in the default config.
src/ros2_medkit_fault_manager/README.md Documents audit-log behavior, parameters, rotation, and threat model.
src/ros2_medkit_fault_manager/CHANGELOG.rst Adds a forthcoming entry describing the new audit-log feature and behavior.

Comment thread src/ros2_medkit_fault_manager/test/test_fault_audit_log.cpp
Comment thread src/ros2_medkit_fault_manager/src/fault_manager_node.cpp Outdated
@mfaferek93 mfaferek93 self-assigned this Jun 30, 2026
@mfaferek93 mfaferek93 force-pushed the feat/fault-audit-log branch from dc7b64e to 835b862 Compare June 30, 2026 18:01
@mfaferek93 mfaferek93 requested a review from bburda July 1, 2026 16:49
@mfaferek93 mfaferek93 marked this pull request as ready for review July 1, 2026 16:49

@bburda bburda left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional findings outside the diff:

  • The audit log has no runtime read/verify/health surface. verify(), read(), head(), record_count(), and the dropped-writes / audit-healthy counters are reachable only as C++ getters (the test-only accessor); no runtime path calls verify(), and nothing exposes the log or its health outside the process. The README (line 119, and the fail_closed row of the parameter table) presents the dropped-writes counter as the safety net for the completeness gap, but an operator has no way to observe audit health or run verify() without stopping the node and using an offline tool that reimplements canonicalize(). The history and its verify/health result should be reachable over the gateway API (a read plus a verify/health resource), with an optional fault-manager ROS service for verify/health, so the log is observable at runtime.

Comment thread src/ros2_medkit_fault_manager/README.md Outdated
Comment thread src/ros2_medkit_fault_manager/README.md Outdated
Comment thread src/ros2_medkit_fault_manager/src/fault_audit_log.cpp Outdated
Comment thread src/ros2_medkit_fault_manager/src/fault_audit_log.cpp Outdated
Comment thread src/ros2_medkit_fault_manager/src/fault_audit_log.cpp Outdated
Comment thread src/ros2_medkit_fault_manager/src/fault_audit_log.cpp Outdated
Comment thread src/ros2_medkit_fault_manager/src/fault_manager_node.cpp
@mfaferek93

Copy link
Copy Markdown
Collaborator Author

Agreed, and this is intentional for this PR. #487 lands only the write-time primitive: the append-only hash chain plus the in-process verify() / read() / head APIs. Surfacing the log and its verify/health at runtime (a read plus a verify/health resource over the gateway API, and optionally a fault-manager service) is deliberately out of scope here and tracked as follow-up work.

I'll also tighten the README so the dropped-writes / audit-healthy signals aren't presented as operator-observable at this stage (they are in-process only until that runtime surface lands), and I'll address the concrete correctness findings in the diff in a follow-up commit.

Append-only, hash-chained audit log of fault state transitions in the
fault manager. Each transition appends one immutable row with
record_hash = sha256(prev_hash + canonical(event)) (OpenSSL EVP SHA-256),
a persisted chain head that resumes across restarts, a verify routine,
a read API, and retention that seals a segment anchor before pruning so
the surviving tail stays verifiable. Off by default.

Refs #483
verify() now reads audit_chain_head directly and fails when the head row is
missing on a non-empty log, so deleting the newest row plus the head is caught.
Timer-driven PREFAILED->CONFIRMED confirmations now call audit_transition.
Adds append-only triggers (hardening, not a security boundary) and corrects the
unkeyed/single-file threat model in the docs.

Refs #483
raw_exec_rc no longer runs sqlite3_exec on a failed sqlite3_open; it
records the failure, closes the handle and returns early. The audit-log
enabled RCLCPP_INFO now formats int64_t retention/seq with PRId64
(<cinttypes>) to satisfy -Werror=format=2.

Refs #483
Record auto-recovery as a distinct "healed" row (source auto_heal) so a
fault's end is in the chain. Append failures now bump a dropped-writes
health counter and clear an audit-healthy flag; audit_log.fail_closed
makes them a hard error. Drop the never-written "ack" kind (clear == ack),
protect audit_prune_guard with a trigger, add logging activate/deactivate
markers, and tighten the threat-model README.

Refs #483
canonicalize uses a non-throwing dump so non-UTF-8 content never breaks append;
the sqlite handle is RAII so a throwing ctor cannot leak; rotation is best-effort
after commit and never fails a durable append; anchors are pruned to the current
boundary and guard-gated. Docs: fail_closed is not a rollback, health signals are
in-process only, tail/prefix truncations are cheap; dropped the CIR citation.

Refs #483
sqlite3_column_text is NUL-terminated, so std::string(text) truncated
audit content at the first embedded NUL while bind_text stored the full
bytes, making verify() falsely report a record_hash mismatch on untampered
records. Read with sqlite3_column_bytes so the write/read round-trip stays
byte-exact and the chain still verifies.

Refs #483
Assert exactly one healed row under the HEALED latch, no duplicate
confirmed rows on latched re-confirmation, and a healing_threshold == 0
single-PASSED heal audited exactly once.

Refs #483
@mfaferek93 mfaferek93 force-pushed the feat/fault-audit-log branch from 6579aaf to 3aeb2ff Compare July 1, 2026 21:14
reclassify_healed_as_cleared() now returns the flipped fault codes on
both backends so the node can audit each transition (source
startup_reclassify) instead of silently mutating storage.

Refs #483
@mfaferek93 mfaferek93 requested a review from bburda July 2, 2026 06:08
@mfaferek93 mfaferek93 merged commit fdeb89b into main Jul 2, 2026
17 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tamper-evident append-only fault audit log

3 participants