Tamper-evident append-only fault audit log#487
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds an optional, tamper-evident (hash-chained) append-only audit log to ros2_medkit_fault_manager to make fault state-transition history independently verifiable (detecting non-recomputing edits/deletions), with configurable retention/rotation and test coverage for tamper/truncation scenarios.
Changes:
- Introduces
FaultAuditLog(SQLite-backed) with hash chaining, persisted head, verify routine, and sealed-anchor retention pruning. - Hooks the audit log into
FaultManagerNodefor write-path transitions (including timer-driven PREFAILED→CONFIRMED auto-confirmations) and adds/updates unit tests accordingly. - Updates dependencies/build/docs/config/changelog to include OpenSSL (EVP SHA-256) and document the threat model and parameters.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/ros2_medkit_fault_manager/src/fault_audit_log.cpp | Implements the SQLite-backed append-only, hash-chained audit log with verify + rotation/anchors. |
| src/ros2_medkit_fault_manager/include/ros2_medkit_fault_manager/fault_audit_log.hpp | Public API/types for audit events, records, head, and verification results. |
| src/ros2_medkit_fault_manager/src/fault_manager_node.cpp | Creates/configures the audit log and appends records on fault transitions, including timer auto-confirmation. |
| src/ros2_medkit_fault_manager/include/ros2_medkit_fault_manager/fault_manager_node.hpp | Declares audit-log plumbing and exposes test-only accessors used by new tests. |
| src/ros2_medkit_fault_manager/src/sqlite_fault_storage.cpp | Changes time-based confirmation to return confirmed fault codes for auditing. |
| src/ros2_medkit_fault_manager/src/fault_storage.cpp | Updates in-memory storage to return confirmed fault codes for auditing. |
| src/ros2_medkit_fault_manager/include/ros2_medkit_fault_manager/fault_storage.hpp | Updates the check_time_based_confirmation() contract to return confirmed fault codes. |
| src/ros2_medkit_fault_manager/include/ros2_medkit_fault_manager/sqlite_fault_storage.hpp | Updates override signature for the new confirmation return type. |
| src/ros2_medkit_fault_manager/test/test_fault_audit_log.cpp | Adds focused unit tests for append/verify/tamper/truncation/rotation and trigger enforcement. |
| src/ros2_medkit_fault_manager/test/test_fault_manager.cpp | Adds end-to-end node tests verifying audit logging (enabled/disabled/timer-confirm paths). |
| src/ros2_medkit_fault_manager/test/test_sqlite_storage.cpp | Updates tests to match new confirmation return type (vector of fault codes). |
| src/ros2_medkit_fault_manager/CMakeLists.txt | Links OpenSSL crypto, builds the new audit-log implementation and test target. |
| src/ros2_medkit_fault_manager/package.xml | Adds OpenSSL development dependency (libssl-dev). |
| src/ros2_medkit_fault_manager/config/fault_manager.yaml | Documents audit-log parameters and threat model in the default config. |
| src/ros2_medkit_fault_manager/README.md | Documents audit-log behavior, parameters, rotation, and threat model. |
| src/ros2_medkit_fault_manager/CHANGELOG.rst | Adds a forthcoming entry describing the new audit-log feature and behavior. |
dc7b64e to
835b862
Compare
bburda
left a comment
There was a problem hiding this comment.
Additional findings outside the diff:
- The audit log has no runtime read/verify/health surface.
verify(),read(),head(),record_count(), and the dropped-writes / audit-healthy counters are reachable only as C++ getters (the test-only accessor); no runtime path callsverify(), and nothing exposes the log or its health outside the process. The README (line 119, and thefail_closedrow of the parameter table) presents the dropped-writes counter as the safety net for the completeness gap, but an operator has no way to observe audit health or runverify()without stopping the node and using an offline tool that reimplementscanonicalize(). The history and its verify/health result should be reachable over the gateway API (a read plus a verify/health resource), with an optional fault-manager ROS service for verify/health, so the log is observable at runtime.
|
Agreed, and this is intentional for this PR. #487 lands only the write-time primitive: the append-only hash chain plus the in-process I'll also tighten the README so the dropped-writes / audit-healthy signals aren't presented as operator-observable at this stage (they are in-process only until that runtime surface lands), and I'll address the concrete correctness findings in the diff in a follow-up commit. |
Append-only, hash-chained audit log of fault state transitions in the fault manager. Each transition appends one immutable row with record_hash = sha256(prev_hash + canonical(event)) (OpenSSL EVP SHA-256), a persisted chain head that resumes across restarts, a verify routine, a read API, and retention that seals a segment anchor before pruning so the surviving tail stays verifiable. Off by default. Refs #483
verify() now reads audit_chain_head directly and fails when the head row is missing on a non-empty log, so deleting the newest row plus the head is caught. Timer-driven PREFAILED->CONFIRMED confirmations now call audit_transition. Adds append-only triggers (hardening, not a security boundary) and corrects the unkeyed/single-file threat model in the docs. Refs #483
raw_exec_rc no longer runs sqlite3_exec on a failed sqlite3_open; it records the failure, closes the handle and returns early. The audit-log enabled RCLCPP_INFO now formats int64_t retention/seq with PRId64 (<cinttypes>) to satisfy -Werror=format=2. Refs #483
Record auto-recovery as a distinct "healed" row (source auto_heal) so a fault's end is in the chain. Append failures now bump a dropped-writes health counter and clear an audit-healthy flag; audit_log.fail_closed makes them a hard error. Drop the never-written "ack" kind (clear == ack), protect audit_prune_guard with a trigger, add logging activate/deactivate markers, and tighten the threat-model README. Refs #483
canonicalize uses a non-throwing dump so non-UTF-8 content never breaks append; the sqlite handle is RAII so a throwing ctor cannot leak; rotation is best-effort after commit and never fails a durable append; anchors are pruned to the current boundary and guard-gated. Docs: fail_closed is not a rollback, health signals are in-process only, tail/prefix truncations are cheap; dropped the CIR citation. Refs #483
sqlite3_column_text is NUL-terminated, so std::string(text) truncated audit content at the first embedded NUL while bind_text stored the full bytes, making verify() falsely report a record_hash mismatch on untampered records. Read with sqlite3_column_bytes so the write/read round-trip stays byte-exact and the chain still verifies. Refs #483
Assert exactly one healed row under the HEALED latch, no duplicate confirmed rows on latched re-confirmation, and a healing_threshold == 0 single-PASSED heal audited exactly once. Refs #483
6579aaf to
3aeb2ff
Compare
reclassify_healed_as_cleared() now returns the flipped fault codes on both backends so the node can audit each transition (source startup_reclassify) instead of silently mutating storage. Refs #483
Closes #483
Optional (off by default) append-only, hash-chained audit log of fault state transitions in the fault manager: one immutable row per transition,
sha256(prev_hash + canonical(event))with a persisted chain head, a verify routine, retention that seals a segment anchor before pruning, and append-only triggers.Threat model is documented honestly: the chain is unkeyed and lives in a single writable file, so it detects non-recomputing tampering and accidental loss; a key/signature or external anchoring is a follow-up.
Tested: 121 unit tests incl. truncation and tamper cases.