Skip to content

inline-checksum: optional per-cluster CRC validation (TD.100226.1)#1016

Open
schmidt-scaled wants to merge 14 commits into
mainfrom
inline-checksum-validation
Open

inline-checksum: optional per-cluster CRC validation (TD.100226.1)#1016
schmidt-scaled wants to merge 14 commits into
mainfrom
inline-checksum-validation

Conversation

@schmidt-scaled
Copy link
Copy Markdown
Contributor

Summary

  • Control-plane support for the inline silent-data-error protection feature whose data-plane work landed on ultra branch checksum-validation. Optional, frozen at cluster-create time, no upgrade path for existing clusters.
  • Adds Cluster.inline_checksum, NVMeDevice.md_size / md_supported, bdev_alceml_create checksum-method kwargs, and per-cluster CLI flags on cluster create, cluster add, and sn configure.
  • Auto-detects per-device NVMe metadata support at add-node and on restart from SPDK's bdev_get_bdevs md_size field (spdk_bdev_get_md_size), so a cluster can mix md-on-device and fallback drives. Capacity overhead from the fallback layout (6 / 510 blocks per 2 MiB extent ≈ 1.17%) is charged as initial provisioned utilization rather than reducing reported raw capacity.

Why this branch and not main

  • A few inline-checksum lines previously leaked into main via the unrelated commit aa81fafe (iteration-77 hang fix) on 2026-04-30. Those four hunks were removed in 5ff3e214 so main is clean again. This branch carries the full feature in one self-contained commit on top of that clean main.

Design ref

  • TD.100226.1 (Inline Protection from Silent Data Errors).

Compatibility

  • Cluster flag defaults to False; existing clusters and code paths unchanged.
  • Data plane reads md_size directly from the bound bdev, so the control plane only flips checksum_validation_method 1/2/0 — matching the checksum-validation branch RPC schema (checksum_validation_method, cache_size, cache_eviction_threshold).

Test plan

  • 27 new unit tests in tests/test_inline_checksum.py (model defaults, find_md_lbaf_id, alceml_checksum_params, fallback-overhead math, RPC param wire-up for each method, addNvmeDevices md detection from a mocked bdev payload). All pass.
  • Full unit-test suite (870 tests) stayed green throughout development on a separate worktree.
  • Lab dual-fault test on an FT=2 cluster with --enable-inline-checksum, mixing one drive formatted with an md-capable LBAF (cv_md_method) and one without (cv_fallback_method); verify capacity reporting reflects the ~1.17% pre-charge only on the fallback drive.
  • Negative test: simulate single-bit corruption on a passtest bdev under an inline-checksum cluster and confirm read returns IO error rather than silent corruption.
  • Confirm existing clusters (no inline_checksum field in DB) read back as False and behave exactly as today.

Open follow-ups (not blocking this PR)

  • The add_cluster / cluster add controller has a pre-existing positional-arg shift around fabric (unrelated to this work). New kwarg added keyword-only so it isn't affected; bug should be fixed separately.
  • Operator-facing capacity reporting (sb cluster get-capacity) is fed from Prometheus and reflects what alceml itself reports; the per-device pre-charge added here is in the lvol-create provisioning gate. If we want it visible in cluster get-capacity too, that's a small extension to the stats collector.

🤖 Generated with Claude Code

Comment thread simplyblock_cli/cli.py
argument = subcommand.add_argument('--size-range', help='NVMe SSD device size range separated by -, can be X(m,g,t) or bytes as integer, example: --size-range 50G-1T or --size-range 1232345-67823987, --device-model and --size-range must be set together.', type=str, default='', dest='size_range', required=False)
argument = subcommand.add_argument('--nvme-names', help='Comma separated list of nvme namespace names like nvme0n1,nvme1n1.', type=str, default='', dest='nvme_names', required=False)
argument = subcommand.add_argument('--force', help='Force format detected or passed nvme pci address to 4K and clean partitions.', dest='force', action='store_true')
argument = subcommand.add_argument('--enable-inline-checksum', help='When formatting (with --force), prefer an LBAF that supports >=8 bytes of NVMe metadata per block, so alceml can run inline checksum validation in md-on-device mode. Drives with no md-capable LBAF still format to plain 4K and will use the fallback layout.', dest='inline_checksum', action='store_true')
Comment thread simplyblock_cli/cli.py
if self.developer_mode:
argument = subcommand.add_argument('--disable-monitoring', help='Disable monitoring stack, false by default. Default: `false`.', dest='disable_monitoring', action='store_true')
argument = subcommand.add_argument('--strict-node-anti-affinity', help='Enable strict node anti affinity for storage nodes. Never more than one chunk is placed on a node. This requires a minimum of _data-chunks-in-stripe + parity-chunks-in-stripe + 1_ nodes in the cluster.', dest='strict_node_anti_affinity', action='store_true')
argument = subcommand.add_argument('--enable-inline-checksum', help='Enable inline CRC checksum validation on every IO for silent-data-error protection. Cannot be enabled or disabled after cluster creation. Per-device alceml mode (md-on-device vs fallback) is auto-detected at add-node.', dest='inline_checksum', action='store_true')
Comment thread simplyblock_cli/cli.py
if self.developer_mode:
argument = subcommand.add_argument('--inflight-io-threshold', help='The number of inflight IOs allowed before the IO queuing starts. Default: `4`.', type=int, default=4, dest='inflight_io_threshold')
argument = subcommand.add_argument('--strict-node-anti-affinity', help='Enable strict node anti affinity for storage nodes. Never more than one chunk is placed on a node. This requires a minimum of _data-chunks-in-stripe + parity-chunks-in-stripe + 1_ nodes in the cluster."', dest='strict_node_anti_affinity', action='store_true')
argument = subcommand.add_argument('--enable-inline-checksum', help='Enable inline CRC checksum validation on every IO for silent-data-error protection. Cannot be enabled or disabled after cluster creation.', dest='inline_checksum', action='store_true')
Comment thread simplyblock_core/utils/__init__.py Fixed
@schmidt-scaled schmidt-scaled force-pushed the inline-checksum-validation branch from f7823d3 to 9273148 Compare May 5, 2026 09:16
Comment thread scripts/setup_lab_perf_test1.py Fixed
Comment thread simplyblock_core/utils/__init__.py Fixed
Comment thread scripts/setup_lab_perf_test1.py Fixed
Comment thread scripts/setup_lab_perf_test1.py Fixed
for attempt in range(1, 16):
try:
client = paramiko.SSHClient()
client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
Comment on lines +452 to +453
"containers=$(docker ps -aq); "
"if [ -n \"$containers\" ]; then docker rm -f $containers; fi",
Comment thread scripts/lab_dual_node_outage_soak_mixed_churn.py Fixed
Comment thread scripts/lab_dual_node_outage_soak_mixed_churn.py Fixed
@schmidt-scaled schmidt-scaled force-pushed the inline-checksum-validation branch from 4e3d8a4 to ca9fa2f Compare May 5, 2026 12:06
Comment thread scripts/lab_dual_node_outage_soak_mixed_churn.py Fixed
Comment thread scripts/lab_dual_node_outage_soak_mixed_churn.py Fixed
Comment thread simplyblock_core/storage_node_ops.py Fixed
Comment thread simplyblock_core/storage_node_ops.py Fixed
for host in self.node_hosts.values():
try:
host.close()
except Exception:
if cached is not None:
try:
cached.close()
except Exception:
schmidt-scaled and others added 12 commits May 10, 2026 13:19
Adds the sbcli control-plane support for the inline silent-data-error
protection feature whose data-plane work landed on ultra branch
checksum-validation. Frozen at cluster-create time; no upgrade path for
existing clusters. Per-device alceml mode (md-on-device vs fallback) is
auto-detected from the bound SPDK bdev's md_size at add-node and on
restart, so a cluster can have a heterogeneous mix of drives.

Cluster + per-device flags
- Cluster.inline_checksum (bool, default False): persisted only via
  create_cluster / add_cluster; no mutator.
- NVMeDevice.md_size (int) and NVMeDevice.md_supported (bool): set in
  addNvmeDevices from bdev_get_bdevs' top-level md_size field
  (spdk_bdev_get_md_size; emitted by lib/bdev/bdev_rpc.c). Refreshed on
  every restart-device path so a between-restart `nvme format` is
  reflected.

alceml RPC plumbing (matches checksum-validation branch)
- bdev_alceml_create gains optional checksum_method (1=md-on-device,
  2=fallback, default 0=off), cache_size, cache_eviction_threshold;
  zero-defaults are not sent so the data plane keeps its built-ins
  (cv_default_cache_size=2000, threshold 90%).
- utils.alceml_checksum_params(cluster, dev) picks 0/1/2 from the
  cluster flag plus md_supported. Data plane refuses method=1 on
  md_size==0, so the per-device decision is mandatory.
- Both alceml call sites (storage_node_ops._create_storage_device_stack
  and device_controller restart-device) thread the params through and
  warn when a device falls back.

CLI
- cluster create / cluster add / sn configure each gain
  --enable-inline-checksum.
- sn configure runs `find_md_lbaf_id` against `nvme id-ns` JSON to pick
  the smallest qualifying LBAF (ds=12, ms>=8) and forces a reformat
  past the existing 4K-already-formatted early-out (SectorSize stays
  4096 across an md/no-md transition).

Capacity accounting
- alceml_fallback_overhead_bytes(cluster, size): 6 lost data blocks
  per 2 MiB extent (510 -> 504, ~1.171875%) when the device runs in
  cv_fallback_method.
- lvol_controller charges that as initial provisioned utilization
  rather than reducing reported raw cluster_size_total, so the
  overhead surfaces through the existing prov_cap_warn / prov_cap_crit
  thresholds and md-on-device drives contribute zero.

Tests
- 27 new unit tests in tests/test_inline_checksum.py: model defaults,
  find_md_lbaf_id corner cases, alceml_checksum_params combos,
  fallback-overhead math (including 6/512 ratio sanity), RPC param
  wire-up for each method, and addNvmeDevices md detection from a
  mocked bdev_get_bdevs payload.

Behaviour notes
- Default-off cluster flag means no behaviour change for existing
  clusters; full unit-test suite (870 passing) stayed green during
  development on a separate worktree.
- Data plane reads md_size itself via spdk_bdev_get_md_size, so the
  control plane never has to pass it. Cv_md_method does not auto
  fall back: control plane must pick correctly per-device, hence the
  add-node-time detection plumbing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The inline-checksum-validation control-plane work requires the matching
CRC support in the data plane. Retag SIMPLY_BLOCK_SPDK_ULTRA_IMAGE from
:main-latest to :checksum-validation-latest so deployed storage nodes
pull the corresponding ultra build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… summary

Three changes that turn the lab setup script into a proper checksum-aware
deploy on the inline-checksum-validation branch.

1. Pre-deploy cleanup (Phase 1.5) on every node:
   - sbctl sn deploy-cleaner (mgmt + storage; check=False so the no-op on
     mgmt doesn't fail the run)
   - docker rm -f stragglers + docker system prune -af --volumes
     (safe BEFORE cluster create; never run after activate)
   - fresh `docker pull` of SIMPLY_BLOCK_DOCKER_IMAGE and
     SIMPLY_BLOCK_SPDK_ULTRA_IMAGE, image names read from the just-
     installed simplyblock_core/env_var so the pulls track whatever
     branch is being deployed.

2. Inline checksum enabled by default. cluster create gets
   --enable-inline-checksum (frozen at create time) and sn configure
   gets --enable-inline-checksum --force, so the configure step actually
   takes the reformat path that picks an md-capable LBAF (ds=12, ms>=8)
   instead of short-circuiting on already-4K. --no-inline-checksum
   opts back out for non-checksum runs.

3. End-of-setup ALCEML mode summary. Reads each NVMeDevice's
   md_size / md_supported via DBController and prints, per node and
   device, whether alceml is running method=1 (md-on-device) or
   method=2 (fallback / emulation), with totals. Mirrors
   simplyblock_core.utils.alceml_checksum_params so the report matches
   the actual data-plane configuration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lab counterpart of aws_dual_node_outage_soak_mixed_churn.py, designed
to run from the simplyblock jump host against the bare-metal cluster
deployed by setup_lab_perf_test1.py.

Changes vs. the AWS variant (everything else - scenario enumeration,
churn loop, fio fault detection, NIC chaos - is unchanged):
- SSH auth uses a single shared root password (--password,
  SBCLI_ROOT_PASSWORD env var, or interactive prompt) instead of an
  EC2 keypair. paramiko gets it via password=; the subprocess fallback
  uses sshpass -e + SSHPASS env so the password never hits a command
  line. paramiko is preferred (persistent connections); sshpass works
  but reconnects per command.
- Default --metadata is cluster_metadata_base.json (what the lab
  setup writes), default --expected-node-count is 4.
- mount_root uses /root for the root user (the AWS form
  /home/{user}/... breaks because root's home is /root).
- _get_data_nics also honors metadata["data_iface"] so the single-NIC
  lab metadata format works without forcing the hardcoded eth1
  fallback.
- Removed Windows/EC2 key-path resolution helpers (candidate_key_paths,
  resolve_key_path, --ssh-key flag).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two adjustments for the lab cluster:

- MAX_LVOL default drops from 100 to 25. The lab nodes have less
  capacity headroom than the AWS i3en class; 25 lvols/node is the
  intended ceiling for this hardware. Override via --max-lvol if needed.

- New Phase 1.5d on storage nodes only: explicit per-NVMe wipe before
  sn configure runs. findmnt identifies the root device and skips it;
  every other NVMe disk gets `wipefs -af` (partition + signature
  cleanup) and `nvme format -f -s 0` (namespace reset). check=False so
  a drive that refuses format -- still held by something, RO, etc. --
  logs a warning instead of aborting the deploy; sn configure
  --enable-inline-checksum --force will then surface the real problem
  and apply the md-capable LBAF on top of the clean state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1.5c hit "dial tcp [v6]:443: cannot assign requested address"
on a single storage node mid-pull, aborting the whole deploy.
public.ecr.aws fronted by S3 occasionally returns those - IPv6 source
race, signed-URL hiccup, etc. - and one blip on one node should not
take the run down with it.

Retry each docker pull up to 6 times with 15s backoff (90s of
headroom). Persistent failures still error out so a real registry
outage or auth problem is not silently masked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
sbctl sn configure --force always prompts "Type YES/Y to continue"
before reformatting NVMes (interactive safety in
simplyblock_core/utils/__init__.py around line 1789). In an automated
SSH session there's no TTY answering it, so the prompt sat for the
full 600s subprocess timeout and the deploy aborted on every storage
node:

  WARNING: Formating Nvme devices /dev/nvme1n1 /dev/nvme0n1
  Type YES/Y to continue: ...
  TimeoutExpired after 600 seconds

Wrap the configure invocation with `echo YES | ...` only when inline
checksum mode is on (the path that actually adds --force). Localized
to the one command so we don't have to plumb stdin through ssh_exec
for everything.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the two NVMes on a node format to different LBAFs (e.g. one drive
exposes an md-capable LBAF and the other doesn't, so find_md_lbaf_id
returns md=8 for one partition and md=0 for the other), SPDK's RAID1
layer rejects the JM mirror with -EINVAL:

  bdev_raid.c:3567 *ERROR*: Raid bdev 'raid_jm_<uuid>' has different
  metadata format than base bdev 'nvme_<...>n1p1'

That prevents sn add-node from completing on heterogeneous hardware,
which is intentional in the lab where we deliberately mix one
md-capable drive and one md-less drive to exercise both alceml modes
(method=1 on-device, method=2 fallback) in the same cluster.

Fallback: when bdev_raid_create returns false, log a warning and
proceed with a single-bdev JM on jm_nvme_bdevs[0]. The persisted
JMDevice carries only that bdev in jm_nvme_bdev_list, so the restart
path naturally takes its existing len==1 branch (recreate from single
bdev, no RAID). Same behavior in both call paths -- the
"jm_device.raid_bdev already set" branch (post-restart recreate) and
the fresh-create branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the AWS variant's "fio runs continuously across outages plus a
3-20 minute background churn timer" pattern with a deterministic
per-iteration cycle:

  1. start fio on every volume
  2. apply the dual-node outage pair (fio takes the hit)
  3. wait for both nodes back online + post-outage check_fio (fault gate)
  4. stop fio
  5. optionally rebuild one randomly-selected volume
     (every --churn-every-n-iters iterations; default 3)
  6. wait_for_cluster_stable + wait_for_data_migration_complete --
     UNLOADED, so migration drains fast

Why: the previous variant kept fio running across the rebalance/data-
migration drain, which dragged out iteration time massively under IO
pressure. New pattern keeps the "fio survives outage" coverage but
runs settling unloaded, so iteration time tracks the bare-cluster
rebalance time. Trades "outage + concurrent volume churn" coverage
(churn timer used to overlap outages occasionally) for predictable
per-iteration timing.

What's gone:
  - background churn thread + serial_lock + churn_stop_event +
    churn_error + reraise_churn_error + start/stop_churn_thread
  - --churn-min-seconds / --churn-max-seconds (replaced by
    --churn-every-n-iters; --no-churn still works)
  - inter-iteration NIC chaos and its --nic-chaos-duration /
    --no-nic-chaos / _ensure_all_data_nics_up / _disable_nic_on_all_nodes
    / _enable_nic_on_all_nodes helpers
  - per-job fio teardown helpers (_check_one_fio, _stop_fio_for_job):
    churn no longer touches fio because fio is already stopped when
    churn fires

What's kept:
  - all outage methods (graceful/forced/container_kill/host_reboot/
    network_outage_*)
  - method-pair x role-category enumeration, shuffling, --start-at,
    --cycles
  - check_fio fault contract (rc_file / wrapper-gone / stderr error
    are all faults)
  - lvol-disconnect/unmount/delete/recreate/connect/mkfs/mount cycle
    (just driven serially from the main loop now)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Control plane was still pulling simplyblock:main, which lacks the
inline-checksum NVMeDevice model fields. Tag the branch build to match
the ultra:checksum-validation-latest data-plane image so deploys from
this branch get a self-consistent control plane.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI tags by branch name (workflow at .github/workflows/docker-image.yml
runs sed 's|/|-|g' on the ref). The image is pushed as
simplyblock:inline-checksum-validation, not :checksum-validation-latest
(that suffix only applies to the ultra data-plane image).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cc3b481 added an FDB re-fetch at the top of _check_peer_disconnected
that bypasses the test's input mock. Without DBController patched, the
re-fetch hits an uninitialised FDB client, KeyError fires, and the
function short-circuits to True for every input — masking the four
fall-through-to-quorum branches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@schmidt-scaled schmidt-scaled force-pushed the inline-checksum-validation branch from 9ba3cbc to 9bdcc9f Compare May 10, 2026 10:19

def format_nvme_device(nvme_device: str, lbaf_id: int):
if is_namespace_4k_from_nvme_list(nvme_device):
def format_nvme_device(nvme_device: str, lbaf_id: int, force_reformat: bool = False):
schmidt-scaled and others added 2 commits May 10, 2026 18:27
…otent

Two compounding bugs hit during the 2026-05-10 AWS-dual-soak run on the
inline-checksum-validation branch (cluster d79edc69, outage pair df70710b
+ b278fd62). Restart of b278fd62 looped 4 attempts; the 4th "succeeded"
into ONLINE without ever recreating LVS_9964's upper stack, leaving
LVS_1315/hublvol and LVS_9704/hublvol absent on vm201.

Abort contract — kill SPDK reliably from any abort path.

* storage_node_ops._kill_spdk_until_dead: new module-level helper, retries
  spdk_process_kill 3 x 5 s and returns only after spdk_process_is_up
  reports down (or all attempts exhaust). Replaces the old single 5 s soft
  window that logged a warning on overrun and proceeded with a still-alive
  zombie SPDK behind it.
* recreate_lvstore._kill_app: refactored onto _kill_spdk_until_dead so the
  hardened kill is shared.
* restart_storage_node._abort_restart: same — now goes through the
  hardened kill. Previously a fire-and-forget kill that did not verify
  SPDK was actually gone before setting OFFLINE.
* restart_storage_node: when recreate_all_lvstores returns False (without
  raising), also route through _abort_restart. The 10:58:11 attempt hit
  exactly this gap — _create_bdev_stack returned err on a duplicate
  raid0_9964 (auto-examine had already surfaced it), recreate_lvstore
  returned False, the node was set OFFLINE without killing SPDK, the next
  attempt found the same stale state, and the loop continued until the
  cluster went SUSPENDED and the shortcut path papered over the failure.

Re-activation idempotency.

* recreate_lvstore + recreate_lvstore_on_non_leader: when the raid bdev
  exists but the lvstore did not surface, drop the raid via
  bdev_raid_delete and re-create the bdev stack via _create_bdev_stack
  before bdev_examine. SPDK rejects re-examine of an already-known bdev
  with "Duplicate bdev name for manual examine: raid0_<vuid>", so a plain
  examine call against a stale-but-present raid was a silent no-op that
  guaranteed the lvstore stayed missing on every cluster_activate retry.
* recreate_lvstore: before subsystem_create for each lvol, probe SPDK via
  _rpc_subsystem_exists (mirroring the existing pattern in
  recreate_lvstore_on_non_leader). Eliminates the
  "Subsystem NQN ...:lvol:<uuid> already exists" / "Unable to create
  subsystem" spam during re-activation flapping.
* _create_crypto_lvol (lvol_controller): probe bdev presence before
  recreating the key + crypto bdev; treat lvol_crypto_key_create failure
  as benign (likely existing key from a prior pass) and proceed to the
  crypto-bdev create. Without this, every re-activation pass on a node
  hosting crypto lvols would fail.

Effect: re-activation can now converge on a node whose upper stack was
torn down by a prior aborted restart, and abort paths no longer leave
zombie SPDK + stale on-disk bdev names that trap subsequent retries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-port blocks

The client-port-first ordering from incident 2026-05-06 is insufficient
because ANA multipath refills the secondary's pipeline during the 1 s
"drain" window. When the hublvol then closes, the first failed redirect
fires lvol.c:3606 spdk_lvs_trigger_leadership_switch "Leadership changed
due to receive new IO" and the secondary auto-promotes, racing the
mgmt-driven stepdown and producing a writer conflict (incident
2026-05-10 iter 3, jm_vuid=4245).

Per-LVS order is now: block client port -> set_leader=False +
force_to_non_leader -> sleep 1s -> block hublvol port. With our
leadership dropped before the hub closes, JC consensus elects the new
leader on a peer with a still-healthy hub path, the peer's lvstore
transitions cleanly via the mgmt-driven path, and the hub close is
benign.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants