inline-checksum: optional per-cluster CRC validation (TD.100226.1)#1016
Open
schmidt-scaled wants to merge 14 commits into
Open
inline-checksum: optional per-cluster CRC validation (TD.100226.1)#1016schmidt-scaled wants to merge 14 commits into
schmidt-scaled wants to merge 14 commits into
Conversation
| argument = subcommand.add_argument('--size-range', help='NVMe SSD device size range separated by -, can be X(m,g,t) or bytes as integer, example: --size-range 50G-1T or --size-range 1232345-67823987, --device-model and --size-range must be set together.', type=str, default='', dest='size_range', required=False) | ||
| argument = subcommand.add_argument('--nvme-names', help='Comma separated list of nvme namespace names like nvme0n1,nvme1n1.', type=str, default='', dest='nvme_names', required=False) | ||
| argument = subcommand.add_argument('--force', help='Force format detected or passed nvme pci address to 4K and clean partitions.', dest='force', action='store_true') | ||
| argument = subcommand.add_argument('--enable-inline-checksum', help='When formatting (with --force), prefer an LBAF that supports >=8 bytes of NVMe metadata per block, so alceml can run inline checksum validation in md-on-device mode. Drives with no md-capable LBAF still format to plain 4K and will use the fallback layout.', dest='inline_checksum', action='store_true') |
| if self.developer_mode: | ||
| argument = subcommand.add_argument('--disable-monitoring', help='Disable monitoring stack, false by default. Default: `false`.', dest='disable_monitoring', action='store_true') | ||
| argument = subcommand.add_argument('--strict-node-anti-affinity', help='Enable strict node anti affinity for storage nodes. Never more than one chunk is placed on a node. This requires a minimum of _data-chunks-in-stripe + parity-chunks-in-stripe + 1_ nodes in the cluster.', dest='strict_node_anti_affinity', action='store_true') | ||
| argument = subcommand.add_argument('--enable-inline-checksum', help='Enable inline CRC checksum validation on every IO for silent-data-error protection. Cannot be enabled or disabled after cluster creation. Per-device alceml mode (md-on-device vs fallback) is auto-detected at add-node.', dest='inline_checksum', action='store_true') |
| if self.developer_mode: | ||
| argument = subcommand.add_argument('--inflight-io-threshold', help='The number of inflight IOs allowed before the IO queuing starts. Default: `4`.', type=int, default=4, dest='inflight_io_threshold') | ||
| argument = subcommand.add_argument('--strict-node-anti-affinity', help='Enable strict node anti affinity for storage nodes. Never more than one chunk is placed on a node. This requires a minimum of _data-chunks-in-stripe + parity-chunks-in-stripe + 1_ nodes in the cluster."', dest='strict_node_anti_affinity', action='store_true') | ||
| argument = subcommand.add_argument('--enable-inline-checksum', help='Enable inline CRC checksum validation on every IO for silent-data-error protection. Cannot be enabled or disabled after cluster creation.', dest='inline_checksum', action='store_true') |
f7823d3 to
9273148
Compare
| for attempt in range(1, 16): | ||
| try: | ||
| client = paramiko.SSHClient() | ||
| client.set_missing_host_key_policy(paramiko.AutoAddPolicy()) |
Comment on lines
+452
to
+453
| "containers=$(docker ps -aq); " | ||
| "if [ -n \"$containers\" ]; then docker rm -f $containers; fi", |
4e3d8a4 to
ca9fa2f
Compare
| for host in self.node_hosts.values(): | ||
| try: | ||
| host.close() | ||
| except Exception: |
| if cached is not None: | ||
| try: | ||
| cached.close() | ||
| except Exception: |
Adds the sbcli control-plane support for the inline silent-data-error protection feature whose data-plane work landed on ultra branch checksum-validation. Frozen at cluster-create time; no upgrade path for existing clusters. Per-device alceml mode (md-on-device vs fallback) is auto-detected from the bound SPDK bdev's md_size at add-node and on restart, so a cluster can have a heterogeneous mix of drives. Cluster + per-device flags - Cluster.inline_checksum (bool, default False): persisted only via create_cluster / add_cluster; no mutator. - NVMeDevice.md_size (int) and NVMeDevice.md_supported (bool): set in addNvmeDevices from bdev_get_bdevs' top-level md_size field (spdk_bdev_get_md_size; emitted by lib/bdev/bdev_rpc.c). Refreshed on every restart-device path so a between-restart `nvme format` is reflected. alceml RPC plumbing (matches checksum-validation branch) - bdev_alceml_create gains optional checksum_method (1=md-on-device, 2=fallback, default 0=off), cache_size, cache_eviction_threshold; zero-defaults are not sent so the data plane keeps its built-ins (cv_default_cache_size=2000, threshold 90%). - utils.alceml_checksum_params(cluster, dev) picks 0/1/2 from the cluster flag plus md_supported. Data plane refuses method=1 on md_size==0, so the per-device decision is mandatory. - Both alceml call sites (storage_node_ops._create_storage_device_stack and device_controller restart-device) thread the params through and warn when a device falls back. CLI - cluster create / cluster add / sn configure each gain --enable-inline-checksum. - sn configure runs `find_md_lbaf_id` against `nvme id-ns` JSON to pick the smallest qualifying LBAF (ds=12, ms>=8) and forces a reformat past the existing 4K-already-formatted early-out (SectorSize stays 4096 across an md/no-md transition). Capacity accounting - alceml_fallback_overhead_bytes(cluster, size): 6 lost data blocks per 2 MiB extent (510 -> 504, ~1.171875%) when the device runs in cv_fallback_method. - lvol_controller charges that as initial provisioned utilization rather than reducing reported raw cluster_size_total, so the overhead surfaces through the existing prov_cap_warn / prov_cap_crit thresholds and md-on-device drives contribute zero. Tests - 27 new unit tests in tests/test_inline_checksum.py: model defaults, find_md_lbaf_id corner cases, alceml_checksum_params combos, fallback-overhead math (including 6/512 ratio sanity), RPC param wire-up for each method, and addNvmeDevices md detection from a mocked bdev_get_bdevs payload. Behaviour notes - Default-off cluster flag means no behaviour change for existing clusters; full unit-test suite (870 passing) stayed green during development on a separate worktree. - Data plane reads md_size itself via spdk_bdev_get_md_size, so the control plane never has to pass it. Cv_md_method does not auto fall back: control plane must pick correctly per-device, hence the add-node-time detection plumbing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The inline-checksum-validation control-plane work requires the matching CRC support in the data plane. Retag SIMPLY_BLOCK_SPDK_ULTRA_IMAGE from :main-latest to :checksum-validation-latest so deployed storage nodes pull the corresponding ultra build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… summary
Three changes that turn the lab setup script into a proper checksum-aware
deploy on the inline-checksum-validation branch.
1. Pre-deploy cleanup (Phase 1.5) on every node:
- sbctl sn deploy-cleaner (mgmt + storage; check=False so the no-op on
mgmt doesn't fail the run)
- docker rm -f stragglers + docker system prune -af --volumes
(safe BEFORE cluster create; never run after activate)
- fresh `docker pull` of SIMPLY_BLOCK_DOCKER_IMAGE and
SIMPLY_BLOCK_SPDK_ULTRA_IMAGE, image names read from the just-
installed simplyblock_core/env_var so the pulls track whatever
branch is being deployed.
2. Inline checksum enabled by default. cluster create gets
--enable-inline-checksum (frozen at create time) and sn configure
gets --enable-inline-checksum --force, so the configure step actually
takes the reformat path that picks an md-capable LBAF (ds=12, ms>=8)
instead of short-circuiting on already-4K. --no-inline-checksum
opts back out for non-checksum runs.
3. End-of-setup ALCEML mode summary. Reads each NVMeDevice's
md_size / md_supported via DBController and prints, per node and
device, whether alceml is running method=1 (md-on-device) or
method=2 (fallback / emulation), with totals. Mirrors
simplyblock_core.utils.alceml_checksum_params so the report matches
the actual data-plane configuration.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lab counterpart of aws_dual_node_outage_soak_mixed_churn.py, designed
to run from the simplyblock jump host against the bare-metal cluster
deployed by setup_lab_perf_test1.py.
Changes vs. the AWS variant (everything else - scenario enumeration,
churn loop, fio fault detection, NIC chaos - is unchanged):
- SSH auth uses a single shared root password (--password,
SBCLI_ROOT_PASSWORD env var, or interactive prompt) instead of an
EC2 keypair. paramiko gets it via password=; the subprocess fallback
uses sshpass -e + SSHPASS env so the password never hits a command
line. paramiko is preferred (persistent connections); sshpass works
but reconnects per command.
- Default --metadata is cluster_metadata_base.json (what the lab
setup writes), default --expected-node-count is 4.
- mount_root uses /root for the root user (the AWS form
/home/{user}/... breaks because root's home is /root).
- _get_data_nics also honors metadata["data_iface"] so the single-NIC
lab metadata format works without forcing the hardcoded eth1
fallback.
- Removed Windows/EC2 key-path resolution helpers (candidate_key_paths,
resolve_key_path, --ssh-key flag).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two adjustments for the lab cluster: - MAX_LVOL default drops from 100 to 25. The lab nodes have less capacity headroom than the AWS i3en class; 25 lvols/node is the intended ceiling for this hardware. Override via --max-lvol if needed. - New Phase 1.5d on storage nodes only: explicit per-NVMe wipe before sn configure runs. findmnt identifies the root device and skips it; every other NVMe disk gets `wipefs -af` (partition + signature cleanup) and `nvme format -f -s 0` (namespace reset). check=False so a drive that refuses format -- still held by something, RO, etc. -- logs a warning instead of aborting the deploy; sn configure --enable-inline-checksum --force will then surface the real problem and apply the md-capable LBAF on top of the clean state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1.5c hit "dial tcp [v6]:443: cannot assign requested address" on a single storage node mid-pull, aborting the whole deploy. public.ecr.aws fronted by S3 occasionally returns those - IPv6 source race, signed-URL hiccup, etc. - and one blip on one node should not take the run down with it. Retry each docker pull up to 6 times with 15s backoff (90s of headroom). Persistent failures still error out so a real registry outage or auth problem is not silently masked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
sbctl sn configure --force always prompts "Type YES/Y to continue" before reformatting NVMes (interactive safety in simplyblock_core/utils/__init__.py around line 1789). In an automated SSH session there's no TTY answering it, so the prompt sat for the full 600s subprocess timeout and the deploy aborted on every storage node: WARNING: Formating Nvme devices /dev/nvme1n1 /dev/nvme0n1 Type YES/Y to continue: ... TimeoutExpired after 600 seconds Wrap the configure invocation with `echo YES | ...` only when inline checksum mode is on (the path that actually adds --force). Localized to the one command so we don't have to plumb stdin through ssh_exec for everything. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the two NVMes on a node format to different LBAFs (e.g. one drive exposes an md-capable LBAF and the other doesn't, so find_md_lbaf_id returns md=8 for one partition and md=0 for the other), SPDK's RAID1 layer rejects the JM mirror with -EINVAL: bdev_raid.c:3567 *ERROR*: Raid bdev 'raid_jm_<uuid>' has different metadata format than base bdev 'nvme_<...>n1p1' That prevents sn add-node from completing on heterogeneous hardware, which is intentional in the lab where we deliberately mix one md-capable drive and one md-less drive to exercise both alceml modes (method=1 on-device, method=2 fallback) in the same cluster. Fallback: when bdev_raid_create returns false, log a warning and proceed with a single-bdev JM on jm_nvme_bdevs[0]. The persisted JMDevice carries only that bdev in jm_nvme_bdev_list, so the restart path naturally takes its existing len==1 branch (recreate from single bdev, no RAID). Same behavior in both call paths -- the "jm_device.raid_bdev already set" branch (post-restart recreate) and the fresh-create branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the AWS variant's "fio runs continuously across outages plus a
3-20 minute background churn timer" pattern with a deterministic
per-iteration cycle:
1. start fio on every volume
2. apply the dual-node outage pair (fio takes the hit)
3. wait for both nodes back online + post-outage check_fio (fault gate)
4. stop fio
5. optionally rebuild one randomly-selected volume
(every --churn-every-n-iters iterations; default 3)
6. wait_for_cluster_stable + wait_for_data_migration_complete --
UNLOADED, so migration drains fast
Why: the previous variant kept fio running across the rebalance/data-
migration drain, which dragged out iteration time massively under IO
pressure. New pattern keeps the "fio survives outage" coverage but
runs settling unloaded, so iteration time tracks the bare-cluster
rebalance time. Trades "outage + concurrent volume churn" coverage
(churn timer used to overlap outages occasionally) for predictable
per-iteration timing.
What's gone:
- background churn thread + serial_lock + churn_stop_event +
churn_error + reraise_churn_error + start/stop_churn_thread
- --churn-min-seconds / --churn-max-seconds (replaced by
--churn-every-n-iters; --no-churn still works)
- inter-iteration NIC chaos and its --nic-chaos-duration /
--no-nic-chaos / _ensure_all_data_nics_up / _disable_nic_on_all_nodes
/ _enable_nic_on_all_nodes helpers
- per-job fio teardown helpers (_check_one_fio, _stop_fio_for_job):
churn no longer touches fio because fio is already stopped when
churn fires
What's kept:
- all outage methods (graceful/forced/container_kill/host_reboot/
network_outage_*)
- method-pair x role-category enumeration, shuffling, --start-at,
--cycles
- check_fio fault contract (rc_file / wrapper-gone / stderr error
are all faults)
- lvol-disconnect/unmount/delete/recreate/connect/mkfs/mount cycle
(just driven serially from the main loop now)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Control plane was still pulling simplyblock:main, which lacks the inline-checksum NVMeDevice model fields. Tag the branch build to match the ultra:checksum-validation-latest data-plane image so deploys from this branch get a self-consistent control plane. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI tags by branch name (workflow at .github/workflows/docker-image.yml runs sed 's|/|-|g' on the ref). The image is pushed as simplyblock:inline-checksum-validation, not :checksum-validation-latest (that suffix only applies to the ultra data-plane image). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cc3b481 added an FDB re-fetch at the top of _check_peer_disconnected that bypasses the test's input mock. Without DBController patched, the re-fetch hits an uninitialised FDB client, KeyError fires, and the function short-circuits to True for every input — masking the four fall-through-to-quorum branches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9ba3cbc to
9bdcc9f
Compare
|
|
||
| def format_nvme_device(nvme_device: str, lbaf_id: int): | ||
| if is_namespace_4k_from_nvme_list(nvme_device): | ||
| def format_nvme_device(nvme_device: str, lbaf_id: int, force_reformat: bool = False): |
…otent Two compounding bugs hit during the 2026-05-10 AWS-dual-soak run on the inline-checksum-validation branch (cluster d79edc69, outage pair df70710b + b278fd62). Restart of b278fd62 looped 4 attempts; the 4th "succeeded" into ONLINE without ever recreating LVS_9964's upper stack, leaving LVS_1315/hublvol and LVS_9704/hublvol absent on vm201. Abort contract — kill SPDK reliably from any abort path. * storage_node_ops._kill_spdk_until_dead: new module-level helper, retries spdk_process_kill 3 x 5 s and returns only after spdk_process_is_up reports down (or all attempts exhaust). Replaces the old single 5 s soft window that logged a warning on overrun and proceeded with a still-alive zombie SPDK behind it. * recreate_lvstore._kill_app: refactored onto _kill_spdk_until_dead so the hardened kill is shared. * restart_storage_node._abort_restart: same — now goes through the hardened kill. Previously a fire-and-forget kill that did not verify SPDK was actually gone before setting OFFLINE. * restart_storage_node: when recreate_all_lvstores returns False (without raising), also route through _abort_restart. The 10:58:11 attempt hit exactly this gap — _create_bdev_stack returned err on a duplicate raid0_9964 (auto-examine had already surfaced it), recreate_lvstore returned False, the node was set OFFLINE without killing SPDK, the next attempt found the same stale state, and the loop continued until the cluster went SUSPENDED and the shortcut path papered over the failure. Re-activation idempotency. * recreate_lvstore + recreate_lvstore_on_non_leader: when the raid bdev exists but the lvstore did not surface, drop the raid via bdev_raid_delete and re-create the bdev stack via _create_bdev_stack before bdev_examine. SPDK rejects re-examine of an already-known bdev with "Duplicate bdev name for manual examine: raid0_<vuid>", so a plain examine call against a stale-but-present raid was a silent no-op that guaranteed the lvstore stayed missing on every cluster_activate retry. * recreate_lvstore: before subsystem_create for each lvol, probe SPDK via _rpc_subsystem_exists (mirroring the existing pattern in recreate_lvstore_on_non_leader). Eliminates the "Subsystem NQN ...:lvol:<uuid> already exists" / "Unable to create subsystem" spam during re-activation flapping. * _create_crypto_lvol (lvol_controller): probe bdev presence before recreating the key + crypto bdev; treat lvol_crypto_key_create failure as benign (likely existing key from a prior pass) and proceed to the crypto-bdev create. Without this, every re-activation pass on a node hosting crypto lvols would fail. Effect: re-activation can now converge on a node whose upper stack was torn down by a prior aborted restart, and abort paths no longer leave zombie SPDK + stale on-disk bdev names that trap subsequent retries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-port blocks The client-port-first ordering from incident 2026-05-06 is insufficient because ANA multipath refills the secondary's pipeline during the 1 s "drain" window. When the hublvol then closes, the first failed redirect fires lvol.c:3606 spdk_lvs_trigger_leadership_switch "Leadership changed due to receive new IO" and the secondary auto-promotes, racing the mgmt-driven stepdown and producing a writer conflict (incident 2026-05-10 iter 3, jm_vuid=4245). Per-LVS order is now: block client port -> set_leader=False + force_to_non_leader -> sleep 1s -> block hublvol port. With our leadership dropped before the hub closes, JC consensus elects the new leader on a peer with a still-healthy hub path, the peer's lvstore transitions cleanly via the mgmt-driven path, and the hub close is benign. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
checksum-validation. Optional, frozen at cluster-create time, no upgrade path for existing clusters.Cluster.inline_checksum,NVMeDevice.md_size/md_supported,bdev_alceml_createchecksum-method kwargs, and per-cluster CLI flags oncluster create,cluster add, andsn configure.bdev_get_bdevsmd_sizefield (spdk_bdev_get_md_size), so a cluster can mix md-on-device and fallback drives. Capacity overhead from the fallback layout (6 / 510 blocks per 2 MiB extent ≈ 1.17%) is charged as initial provisioned utilization rather than reducing reported raw capacity.Why this branch and not main
aa81fafe(iteration-77 hang fix) on 2026-04-30. Those four hunks were removed in5ff3e214so main is clean again. This branch carries the full feature in one self-contained commit on top of that clean main.Design ref
Compatibility
False; existing clusters and code paths unchanged.md_sizedirectly from the bound bdev, so the control plane only flipschecksum_validation_method1/2/0 — matching thechecksum-validationbranch RPC schema (checksum_validation_method,cache_size,cache_eviction_threshold).Test plan
tests/test_inline_checksum.py(model defaults,find_md_lbaf_id,alceml_checksum_params, fallback-overhead math, RPC param wire-up for each method,addNvmeDevicesmd detection from a mocked bdev payload). All pass.--enable-inline-checksum, mixing one drive formatted with an md-capable LBAF (cv_md_method) and one without (cv_fallback_method); verify capacity reporting reflects the ~1.17% pre-charge only on the fallback drive.inline_checksumfield in DB) read back asFalseand behave exactly as today.Open follow-ups (not blocking this PR)
add_cluster/cluster addcontroller has a pre-existing positional-arg shift aroundfabric(unrelated to this work). New kwarg added keyword-only so it isn't affected; bug should be fixed separately.sb cluster get-capacity) is fed from Prometheus and reflects what alceml itself reports; the per-device pre-charge added here is in the lvol-create provisioning gate. If we want it visible incluster get-capacitytoo, that's a small extension to the stats collector.🤖 Generated with Claude Code