ENH: Enable HDF5 dataset compression from the WriteDREAM3DFilter by imikejackson · Pull Request #1606 · BlueQuartzSoftware/simplnx

imikejackson · 2026-04-23T15:50:46Z

Adds optional gzip (deflate) compression to HDF5 dataset writes in WriteDREAM3DFilter. Compression is on by default at level 5 for new pipelines, with a full 1–9 range. SIMPL-imported pipelines keep their original uncompressed encoding so their output is byte-for-byte what users expect.

On representative voxel/feature data, expect ~2–5× smaller .dream3d files with negligible read cost — gzip/deflate is built into every stock HDF5 install, so DREAM3D 6.x, h5py, HDFView, ParaView, MATLAB, etc. read the compressed files transparently.

* Encapsulate the chunked+deflate DCPL builder as a file-local helper in DatasetIO.cpp returning Result<hid_t>; H5Pcreate / H5Pset_chunk / H5Pset_deflate failures now propagate through writeSpan instead of silently degrading to contiguous storage. Intentional fall-throughs (level 0, empty dims, sub-16 KiB arrays) still resolve to H5P_DEFAULT. * Drop the public CreateDatasetCreationPropertyList declaration from DatasetIO.hpp; it had no callers outside DatasetIO.cpp. * setCompressionLevel now ignores out-of-range input so the documented contract matches the implementation. * Trim the misleading "honor the same compression level DataArrayIO uses" comment in NeighborListIO; there is one global compressionLevel. * Move the duplicated HDF5 dataset probe helper from H5Test.cpp and DREAM3DFileTest.cpp into a shared simplnx::UnitTestCommon header (HDF5DatasetProbe.hpp/.cpp) exposing a DatasetLayout enum class so consumers no longer need to include <hdf5.h> for layout assertions. Signed-off-by: Michael Jackson <mike.jackson@bluequartz.net>

imikejackson · 2026-04-28T15:45:20Z

Branch Review — `topic/enable_hdf5_compression`

A self-review pass across five axes of the changes in this PR (commits 7a6732e2 and 3bc4b891). Findings are scoped to code introduced or materially altered by this branch.

1) C++ Best Practices

Solid:

Result<>-based error propagation throughout; no exceptions thrown across HDF5 boundaries.
setCompressionLevel and getCompressionLevel correctly marked noexcept; Build...Dcpl uses Result<hid_t> instead of sentinel-encoded errors.
int32/usize aliases used consistently; no raw int/size_t mixing in new code.
DCPL ownership contract is documented inline at BuildChunkedDeflateDcpl and tested by the helper closing on every error branch (DatasetIO.cpp:140, 146).
New WriteDREAM3DInputValues follows the existing input-bag idiom exactly.
Includes are minimal — HDF5DatasetProbe.hpp deliberately keeps <hdf5.h> out of the public header by introducing the DatasetLayout enum class.

Issues / nits:

Unscoped int in probe code. HDF5DatasetProbe.cpp:51-60 mixes int nFilters, unsigned int flags, unsigned int filterConfig. These mirror HDF5's C ABI so they're justified, but the local i loop counter could be int once and reused; currently it lives in the for head as int and is then static_cast<unsigned int>(i) for the API call — fine, just noisy.
int32 m_CompressionLevel = 0 is a magic-default. A named constant (k_DefaultCompressionLevel = 0) inside the anonymous namespace would make it self-documenting at the point of declaration. Minor.
writeSpan<bool> doesn't propagate compression. It delegates to writeSpan<H5_BOOL_TYPE>, which inherits m_CompressionLevel from *this — that works because the call is on the same DatasetIO instance, but it's worth a one-line comment so the reader doesn't need to trace it.
Two distinct compression-level domains. WriteOptions::compressionLevel is int32 with the convention 0 = off and the filter parameter is int32 in [1, 9] plus a separate bool UseCompression. The mapping between them lives only in WriteDREAM3D.cpp:60. This mapping is correct but easy to misuse from another caller; consider documenting on WriteOptions::compressionLevel itself.

2) Object-Oriented Design

Solid:

The compression knob is layered cleanly: filter → algorithm input bag → WriteOptions → DataStructureWriter → per-DatasetIO m_CompressionLevel. Each layer owns one concern.
IDataIO polymorphism is preserved — only DataArrayIO::writeData and NeighborListIO opt in by stamping the writer's level on their DatasetIO (DataArrayIO.hpp:236, NeighborListIO.hpp:239). Geometry / AttributeMatrix / String IOs deliberately don't.
AbstractDataStore::writeHdf5 signature is unchanged, which the commit message correctly highlights as protecting the external SimplnxOoc plugin's override.
New WriteOptions is a value type, default-constructible, default arguments used. Adding fields later is non-breaking.

Issues:

WriteOptions is nested inside DataStructureWriter, but it's now serialization-policy state shared up to DREAM3D::WriteFile and down into DatasetIO. Its conceptual home is closer to a top-level nx::core::HDF5::WriteOptions. Today every consumer must spell nx::core::HDF5::DataStructureWriter::WriteOptions. Promoting it would be a one-line cleanup but is technically API-breaking.
DatasetIO::m_CompressionLevel is per-instance state that must be set before each writeSpan. This works but couples temporal ordering: callers must remember to call setCompressionLevel before writeSpan, and forgetting silently produces uncompressed output. Passing the level as a parameter to writeSpan (or a WriteSpanOptions overload) would be harder to misuse, but at the cost of churning every existing call site. The current design is a reasonable tradeoff but worth flagging.
getWriteOptions() returns const WriteOptions& but the writer mutates state (m_IdMap, etc.) during write. The returned reference is stable for the writer's lifetime, which is what callers actually need; noexcept on the getter is appropriate.

3) Defensive Programming

Solid:

Out-of-range levels rejected at three layers: filter preflight (WriteDREAM3DFilter.cpp:93-96), setCompressionLevel ignores invalid input (DatasetIO.cpp:222-228), and BuildChunkedDeflateDcpl returns a Result<hid_t> error (DatasetIO.cpp:74-77). Belt-and-suspenders, intentional.
BuildChunkedDeflateDcpl defends against:
- Empty dims,
- Zero-byte element / zero dims,
- Saturating-multiplication overflow when computing total bytes (DatasetIO.cpp:96),
- HDF5 API failures at each step, with H5Pclose on the partial DCPL before returning the error.
H5P_DEFAULT ownership contract is enforced at the only caller (DatasetIO.cpp:1063 — if(dcpl != H5P_DEFAULT) H5Pclose(dcpl)).
Tests cover the contract: small-array bypass, level-0 contiguous path, round-trip identity, monotonic compression sizes, preflight rejection of out-of-range values.

Issues:

DatasetIO::CreateH5DatasetChunkProperties (the legacy non-compression helper) silently swallows H5Pset_chunk failure (DatasetIO.cpp:1138-1144) by returning H5P_DEFAULT. This predates the PR, but the PR's review effort highlights the contrast with the new helper, which now propagates. Worth a follow-up to migrate this one too — currently a chunked-dataset write would silently degrade to contiguous on partial HDF5 failure.
AppendFile accepts WriteOptions but the per-dataset writer state isn't carried into the append path's DatasetIO::setCompressionLevel calls. Looking at DataStructureWriter::AppendFile, the setWriteOptions(options) is called on the dataStructureWriter that's about to do the writeDataObject — so this works because DataArrayIO::writeData reads dataStructureWriter.getWriteOptions().compressionLevel. ✓ verified.
writeSpan<bool> recurses into the typed overload — the try/catch block catches std::exception from value construction, but the recursive writeSpan itself returns Result<> rather than throwing. The try is wider than the actual throwing surface. Pre-existing.

4) Memory Allocation

Solid:

BuildChunkedDeflateDcpl allocates one std::vector<hsize_t> for chunk dims; capacity = rank (≤ 4 in practice). Negligible.
HDF5DatasetProbe.cpp uses std::array for the filter-name and CD-values buffers — stack-allocated, no heap traffic.
WriteOptions is a 4-byte struct (today) passed by const reference downstream.
No new heap allocations on the per-array hot path; the m_CompressionLevel field is a single int32 per DatasetIO.
flattenedData in NeighborListIO is unchanged from prior code — compression doesn't add a copy.

Issues:

HDF5 deflate filter does its own per-chunk buffering (the filter pipeline allocates a chunk-sized scratch buffer per write). At 1 MiB per chunk this is bounded and not a concern, but worth noting for very large arrays where total in-flight HDF5 buffer = ~1 MiB × concurrent writes. With single-threaded writeSpan, the worst case is 1 MiB extra resident.
Round-trip read after compression decompresses into a full AbstractDataStore — pre-existing, but compressed files are now slightly slower to import than the equivalent uncompressed file. This is the deflate trade-off, mentioned in the docs.

5) CPU Performance

Solid:

Compression off is the zero-overhead path: m_CompressionLevel == 0 short-circuits in BuildChunkedDeflateDcpl and returns H5P_DEFAULT before any HDF5 calls (DatasetIO.cpp:78-81). No regression for users who turn it off.
16 KiB small-array bypass keeps tiny datasets out of the chunk pipeline. Unit-tested at WriteDREAM3DFilter: Compression_SmallArray_Bypasses and the DatasetIO-level small-array test in H5Test.cpp:1154.
Chunk shape targets ~1 MiB along the outermost dim — gives long contiguous spans to deflate and matches DataArray's tuple-major iteration order.
Default level 5 is a justified middle ground in the docs; level 9's CPU cost is real, level 1 is fast but compresses ~30% less on representative voxel data (per the round-trip test).
HDF5 writes one chunk at a time — no global serialization, no thread contention introduced.

Issues:

Chunk-size heuristic ignores rank > 1. Inner dimensions are kept at full extent; only dims[0] is shrunk to fit ~1 MiB. For arrays with very wide component dims (e.g. an EBSD int32[N, 9] quaternion array), rowBytes = 36 so rowsPerChunk = 1 MiB / 36 ≈ 29k rows — fine. But for pathological cases (a 1-tuple, billion-component array, which doesn't exist today), the heuristic would clamp to 1 row and produce a giant chunk. Not a practical concern; document the heuristic's domain or add a MIN(chunkBytes, k_MaxChunkBytes) ceiling if you want belt-and-suspenders.
No H5Pset_chunk_cache tuning. Default HDF5 chunk cache is ~1 MiB, which is fine for a single 1-MiB chunk but pathological for partial-chunk reads of very large arrays. Out of scope here; flagged for future when a read-side performance pass happens.
H5Pset_shuffle is not used. Shuffle filter often improves deflate ratio by 10–30% on numeric arrays at near-zero CPU cost. Worth experimenting in a follow-up — file format remains compatible (HDF5 records the filter chain per dataset).
No SZIP / zstd / LZ4 escape hatch. gzip is the right default for portability (the docs make the case well), but a user with a known reader stack could get 2–3× the speed at similar ratio with zstd. Future work.

6) Out-of-scope: `docs/superpowers/` deletions

7a6732e2 deletes two docs/superpowers/plans/ and docs/superpowers/specs/ files that exist on develop. Confirmed legitimate cleanup — develop accidentally received these via #1567, and Joey has a parallel cleanup commit (6fbfc8d8 on joey/ooc-filter-optimizations) that does the same removal. Either:

Land Joey's cleanup commit on develop first and then drop these deletions from this PR, or
Leave them as-is and call out the side effect in the merge commit / PR body.

Recommend the second: the deletions are correct and the parallel commit on Joey's branches is unmerged. Just make the side-effect explicit so reviewers don't trip over it.

Summary

The PR is well-scoped and the design is layered cleanly with sensible defaults. Multiple defensive guards prevent silent corruption (overflow, partial HDF5 failure, misconfigured levels). Tests cover the contract end-to-end including round-trip equivalence and monotonic compression-by-level. No correctness blockers. The improvements suggested above are all polish or follow-up work, not gating issues.

imikejackson · 2026-04-28T15:46:42Z

Design Validation — HDF5 Compression Plumbing

A separate pass focused only on the design choices used to wire compression through the stack: where state lives, why the boundaries are drawn the way they are, and whether the trade-offs hold up under the constraints simplnx actually has.

The four design choices that shape this PR

WriteOptions struct on DataStructureWriter rather than a plain int compressionLevel parameter.
DatasetIO::m_CompressionLevel as per-instance state rather than a writeSpan(level, dims, span) parameter.
AbstractDataStore::writeHdf5 signature deliberately unchanged.
IDataIO-level opt-in — only DataArrayIO and NeighborListIO participate; geometry/AM/scalar/string IOs stay contiguous.

I'll validate each.

1) `WriteOptions` struct

Why this works. The struct gives a single forward-compatible vehicle for every write-time policy knob. Today it carries one field. Tomorrow it can carry shuffle filter on/off, alternate codec choice, chunk-size override, or a per-array filter callback — none of those additions break the eight call sites currently wired through it. The two pre-existing zero-arg WriteFile / AppendFile overloads delegate to the options-aware ones with WriteOptions{}, so no caller in the codebase has to change unless it wants to.

Why a struct beats an int level parameter. A plain int would have meant adding a parameter to every WriteFile / AppendFile / WriteDataStructure overload, then doing it again the next time we add shuffle, then again for chunk overrides, etc. Every addition would either ripple through every signature or live as a side channel. The struct centralizes that growth.

Why a struct beats a free-function setGlobalCompressionLevel. Tested. Compression policy travels with the writer instance, not the process. Two pipelines can write two files at two levels without a TLS dance.

Verdict: correct. The minor nit is the namespace placement (DataStructureWriter::WriteOptions rather than nx::core::HDF5::WriteOptions) — this is the only place where the concept name has a longer-than-needed FQN. Cosmetic, not load-bearing.

2) `DatasetIO::m_CompressionLevel` as per-instance state

Why this works. DatasetIO is short-lived (created at write-time, destroyed at scope exit) and is already stateful — it owns an hid_t, tracks open/closed state, and has move semantics. Adding one int32 of write-policy state matches the rest of the class.

Why it's not a writeSpan(dims, span, options) parameter. This is the more interesting trade. Every writeSpan<T> instantiation is currently in a hot path called via DataStoreIO::WriteDataStore. Adding a parameter would have meant:

Updating all 11 explicit-template-instantiation declarations in DatasetIO.hpp plus their definitions.
Updating every call site (DataArrayIO, NeighborListIO, plus any external plugins that call writeSpan directly — and the SimplnxOoc plugin does).

Setting a per-instance level once and letting writeSpan consult it costs one int32 per writer and zero call-site changes. The downside — a caller could forget to set the level — is bounded: forgetting falls through to the existing safe default (level 0, contiguous), so the failure mode is silent uncompressed output, not corruption.

The chosen design also dovetails with the layering: DataStructureWriter knows the policy, stamps each DatasetIO it creates, and the IDataIO subclasses don't have to thread the level through. Verdict: correct for this codebase's call-site economics.

A "harder to misuse" alternative — passing options to writeSpan directly — would be safer in isolation but would compound the API churn for very little real-world risk.

3) `AbstractDataStore::writeHdf5` signature deliberately unchanged

This is the non-obvious design call and is the right one. AbstractDataStore::writeHdf5 is overridden by the external SimplnxOoc plugin (out-of-tree). Adding a parameter would have:

Forced an SimplnxOoc release on the same beat as this PR, OR
Required a backwards-compat shim inside the AbstractDataStore base class.

By keeping the signature stable and routing compression through DatasetIO::m_CompressionLevel, the SimplnxOoc override compiles unchanged. SimplnxOoc's chunk-aware writes that route through their own writeChunk path simply don't pick up compression — which is acceptable because OOC stores already manage their own on-disk layout (the chunked-IO architecture is itself an alternative to gzip).

Verdict: correct, and the constraint is documented in the commit message. This is a load-bearing decision; future contributors should not "clean it up" without verifying the OOC plugin still builds.

4) `IDataIO`-level opt-in (DataArrayIO + NeighborListIO only)

Why this works. Compression is only worthwhile when the dataset is large and homogeneous. The compressed candidates in this PR — DataArray and NeighborList flattened payload — are exactly that. The non-compressed candidates are:

IDataIO subclass	Why exempt
Geometry IO (image/rect/triangle/edge/vertex/quad/hex/tetra)	Geometry datasets are mostly small (vertices < 1M typical), and metadata datasets are tiny.
AttributeMatrixIO	Stores tuple shape only — a few `usize` values.
ScalarDataIO	Single scalar.
StringArrayIO	Variable-length strings; compression filters apply per-chunk and the win on string data is marginal.
NeighborListIO numNeighbors array	Same shape as a small DataArray; gets the writer-level compression too via the `DataArray` path.

The 16 KiB small-array bypass inside BuildChunkedDeflateDcpl is the second line of defense: even if a small DataArray slips through (which it does), the helper falls through to contiguous storage. So the policy is "compress big array-shaped data; everything else stays contiguous" — implemented at two layers.

Why not compress everything? Three reasons:

Chunk-index overhead dominates for small datasets (motivation for the 16 KiB bypass).
File-open time grows with chunk count; metadata datasets are read on every open.
gzip on string arrays doesn't help meaningfully because LZ77 already does string-like compression and HDF5 string handling adds overhead per chunk.

Verdict: correct, and the choice is conservative — strict opt-in over opt-out. Easy to expand to additional IDataIO subclasses if benchmarks justify it later; harder to retract.

Cross-cutting design decisions

Compression-on-by-default at level 5

Defaults are policy. Three options were available:

Off by default — safest, but new files would need explicit opt-in to benefit.
On at level 1 — fast, smaller win.
On at level 5 — chosen.

Level 5 is the gzip middle of the road: typically 60-80% of level-9 ratio at 30-40% of level-9 CPU cost. For voxel/feature data this lands at ~3× smaller files with negligible read-side decompression cost (deflate decode is very fast). The CPU cost on write is bounded by the user's IO subsystem; on most modern disks gzip-5 is faster end-to-end than uncompressed (the PCI/SATA savings exceed the compression cost). Justified.

`FromSIMPLJson` sets `UseCompression = false`

This is the design decision that prevents a silent file-format compatibility surprise. SIMPL v6 wrote uncompressed files; if a SIMPL-imported pipeline silently started writing gzip-5 files, every downstream consumer that parses byte offsets, every script that hashes dream3d files for caching, every reproducibility regression suite, would see a one-time bit-for-bit difference. By forcing the default to false for SIMPL conversions, the imported pipeline produces the byte-identical output the user expected when they saved it. Verdict: this is exactly right. A linked params.linkParameters connects the bool to the level field in the UI, so user can manually enable compression if desired.

Preflight rejection of out-of-range levels (when compression is on) but acceptance when off

Looking at WriteDREAM3DFilter::preflightImpl:

if(useCompression && (compressionLevel < 1 || compressionLevel > 9))
{
  return MakePreflightErrorResult(...);
}

When useCompression == false, the level is ignored. This is consistent with the linked-parameter pattern — the disabled level field shouldn't block preflight. Tested explicitly in Compression_Preflight_RejectsOutOfRangeLevel. Verdict: correct.

Chunk shape "1 MiB along outermost dim"

The choice has three independent rationales that all point the same way:

HDF5 stores in C-order, so the outermost dim is the slowest-varying — chunking along it creates contiguous-in-memory chunks.
DataArray iteration is tuple-major; chunks aligned to tuple boundaries match the access pattern.
1 MiB is a sweet spot for gzip's LZ77 dictionary (32 KiB) — large enough to not be dictionary-bound, small enough that chunk-cache misses are cheap.

A user-configurable chunk shape was rejected (correctly, IMO) because the parameter surface for "compression knobs that almost no one will tune" is itself a maintenance cost. If a real-world dataset emerges where 1 MiB is wrong, this becomes a WriteOptions field. Verdict: defensible default with a clear escape hatch (extend WriteOptions later).

Error propagation strategy in `BuildChunkedDeflateDcpl`

The helper distinguishes three result classes:

Errors (level out of range, HDF5 API failures) → Result<hid_t>::invalid(), propagated through writeSpan.
Intentional fall-through (level 0, empty dims, sub-16 KiB) → Result<hid_t>::valid(H5P_DEFAULT).
Success → Result<hid_t>::valid(dcpl) with caller-owned ownership.

This is the right shape because the caller has to do different things in each case: (1) abort the write, (2) write contiguously without the helper failing the operation, (3) write chunked and close the DCPL. Encoding that as a sentinel-only return (H5P_DEFAULT for any non-error) would have lost the level-out-of-range diagnostic. Verdict: well-designed. The original review feedback pushed this from sentinel to Result<> and the change is materially better.

Summary

The design holds up to scrutiny on every axis I checked. The decisions are layered, the trade-offs are explicit, and each non-obvious choice (write-time state location, AbstractDataStore signature stability, IDataIO opt-in scope, FromSIMPLJson default) has a concrete rationale tied to the constraints of the simplnx ecosystem and its external plugins. The defaults are conservative (safe legacy behavior preserved) where the cost of breaking is high, and aggressive (compression on at 5) where the cost is low and the benefit is real.

The change to surface Result<hid_t> from the DCPL builder, in response to JDuffeyBQ's review, is a strict improvement and should be the model for migrating the older CreateH5DatasetChunkProperties path in a follow-up.

* BuildChunkedDeflateDcpl now returns errors for empty dims, zero elementByteSize, zero-valued dims, and total-byte overflow rather than silently falling through to H5P_DEFAULT. The intentional small-array bypass (< 16 KiB) still returns H5P_DEFAULT. * ProbeHdf5Dataset now returns std::optional<DatasetProbeInfo>; removed the DatasetLayout::Unknown sentinel. All call sites in H5Test.cpp and DREAM3DFileTest.cpp updated to check has_value(). * Trimmed implementation-detail prose from the ProbeHdf5Dataset doc comment. Signed-off-by: Michael Jackson <mike.jackson@bluequartz.net>

@return

Adds gzip (deflate) compression for DataArray datasets in .dream3d files, exposed through two new WriteDREAM3DFilter parameters: use_compression (bool, default true) compression_level (int, 1-9, default 5) Architecture: - DataStructureWriter carries a WriteOptions struct (compression level only, for now); WriteOptions flows through Dream3dIO::WriteFile -> DataStructureWriter::WriteFile. - DatasetIO carries m_CompressionLevel state; writeSpan consults it and builds a chunked+deflate DCPL via CreateDatasetCreationPropertyList when level is 1-9 and the total byte count clears a 16 KiB small-array bypass. - AbstractDataStore::writeHdf5 signature is intentionally unchanged so the external SimplnxOoc plugin's override keeps compiling. - DataArrayIO and NeighborListIO stamp the compression level on their DatasetIO before writing. Other IDataIO subclasses (geometry IOs, attribute matrix IO, scalar IO, string array IO, etc.) stay contiguous, so metadata datasets do not participate in compression. - Chunk shape targets ~1 MiB per chunk (pre-clamped, overflow-safe). - FromSIMPLJson sets use_compression=false so converted legacy pipelines preserve their on-disk encoding. File format: - No version bump. stock HDF5 readers handle deflate transparently. Tests (DREAM3DFileTest.cpp): - Compression_Off_IsContiguous - Compression_On_IsChunkedAndDeflated (layout + round-trip bit-identity) - Compression_SmallArray_Bypasses - Compression_LevelsRoundTrip (levels {1, 5, 9} monotonically non-increasing file sizes) - Preflight range check on compression_level. Docs: - WriteDREAM3DFilter.md expanded to describe compression and XDMF. - Doxygen on every new public / protected symbol introduced (DatasetIO::CreateDatasetCreationPropertyList, the two WriteFile overloads, WriteDREAM3DInputValues fields, and accessors). Squashed from the following 15 commits on topic/enable_hdf5_compression: >> DOC: Design spec for HDF5 gzip compression in .dream3d writer Adds a brainstorm-validated design for compressing DataArray datasets with gzip (deflate) on write. Two new parameters on WriteDREAM3DFilter (use_compression, compression_level 1-9, default ON at level 5) plumbed via a WriteOptions struct on DataStructureWriter. Chunking targets ~1 MiB per chunk; arrays under 16 KiB bypass compression. No file version bump required -- stock HDF5 readers handle deflate transparently. >> DOC: Update HDF5 compression spec -- keep AbstractDataStore::writeHdf5 ABI stable Reading the actual code revealed AbstractDataStore::writeHdf5 is a pure virtual overridden by the SimplnxOoc plugin's DataStore subclass (in a separate repo). Changing its signature would silently break that build. Switched the opt-in mechanism: DatasetIO carries m_CompressionLevel state; DataArrayIO sets it before calling WriteDataStore; writeSpan consults its own state. AbstractDataStore::writeHdf5 stays unchanged. Also corrected the test-file reference from a nonexistent WriteDREAM3DFilterTest.cpp to the existing DREAM3DFileTest.cpp. >> DOC: Implementation plan for HDF5 gzip compression in .dream3d writer Seven-task plan that lands the compression feature incrementally with frequent green builds and commits: 1. WriteOptions on DataStructureWriter (plumbing only) 2. Forward WriteOptions through Dream3dIO::WriteFile 3. DatasetIO gets m_CompressionLevel + DCPL helper + writeSpan wiring 4. DataArrayIO opts in on behalf of DataArray datasets 5. WriteDREAM3DFilter exposes the two user-facing parameters 6. Spec Section 6 test suite (off / on / small-bypass / levels) 7. Full build + ctest verification TDD ordering throughout; every task commits green. >> ENH: Introduce WriteOptions on DataStructureWriter Plumbing-only change. DataStructureWriter grows a WriteOptions struct (currently just compressionLevel) plus getter/setter and WriteOptions overloads on WriteFile/AppendFile. Existing overloads forward to the new ones with a default (0, no compression), preserving current behavior. >> ENH: Dream3dIO::WriteFile accepts optional WriteOptions Adds new WriteFile overloads on both FileIO-based and path-based entry points that forward WriteOptions through to DataStructureWriter. The existing no-options overloads are preserved and delegate to the new ones with a default-constructed WriteOptions, so every existing call site continues to work unchanged. Threads options through the file-scope WriteDataStructure helper (overloaded) so the layered WriteFile(path) -> WriteFile(FileIO) -> WriteDataStructure -> DataStructureWriter::WriteFile pipeline is preserved without duplicating the version/pipeline/data-structure sequence. >> CLEANUP: Remove dead no-options WriteDataStructure helper in Dream3dIO After threading WriteOptions through the WriteFile layering, the old no-options WriteDataStructure(FileIO&, DS) file-scope helper has no callers. Dropped. The remaining with-options helper is the only call site for DataStructureWriter::WriteFile from this translation unit. >> ENH: DatasetIO writes chunked+deflate when m_CompressionLevel > 0 DatasetIO learns a compression-level setting (default 0 = contiguous). writeSpan now builds a DCPL via CreateDatasetCreationPropertyList, which applies H5Pset_chunk + H5Pset_deflate targeting ~1 MiB chunks when the level is 1-9 and the total byte count clears a 16 KiB small-array bypass. Callers that don't opt in get today's contiguous behavior unchanged. >> DOC: Align compression spec with the defensive-fallback implementation Section 4.3 previously promised "no silent fallback to uncompressed". After Task 3 landed, the implementation (intentionally, per the plan) returns H5P_DEFAULT from the DCPL helper on any HDF5 setup error, causing an uncompressed write rather than failing the whole file. In practice the failure paths are unreachable on stock HDF5 -- gzip is always available, H5Pcreate never fails, and our chunk math is pre-clamped. Update the spec to describe the actual (defensive-backstop) behavior and note that preflight validation is the real user-visible guardrail. >> ENH: DataArrayIO applies WriteOptions.compressionLevel before writing DataArrayIO::writeData now stamps the compression level from the DataStructureWriter's WriteOptions onto the DatasetIO before calling WriteDataStore. Other IDataIO subclasses (geometry IOs, attribute matrix IO, scalar IO, string array IO, etc.) are unchanged, so metadata datasets stay contiguous while DataArray datasets participate in compression when requested. AbstractDataStore::writeHdf5's signature remains stable so the external SimplnxOoc plugin continues to build without changes. >> ENH: WriteDREAM3DFilter exposes HDF5 gzip compression controls Two new parameters: use_compression (bool, default true) and compression_level (int32, 1-9, default 5), linked so the level is only editable when compression is enabled. Level is range-validated at preflight. executeImpl forwards to WriteDREAM3D, which builds DataStructureWriter::WriteOptions from the values and passes them through DREAM3D::WriteFile. parametersVersion bumps to 2. >> TEST: Exercise HDF5 gzip compression in WriteDREAM3DFilter Four end-to-end tests covering the spec's required behaviors: - Compression_Off_IsContiguous: use_compression=false writes contiguous. - Compression_On_IsChunkedAndDeflated: use_compression=true writes a chunked dataset with deflate at the requested level, and the file round-trips byte-identically. - Compression_SmallArray_Bypasses: small DataArrays stay contiguous even with compression on. - Compression_LevelsRoundTrip: levels {1, 5, 9} all round-trip, and file sizes are monotonically non-increasing as level rises. All four tests pass. Adds an InspectDatasetLayout helper to the anonymous namespace that opens a .dream3d via raw HDF5 and reports layout + deflate filter presence/level, keeping the per-test assertions concise. >> FIX: Address critical review findings on HDF5 compression branch Three critical issues surfaced by review: 1. Compression_LevelsRoundTrip size-monotonicity check was dead code -- DYNAMIC_SECTION restarts the TEST_CASE body per section, so sizesByLevel was always size 1 when the size==3 guard ran. Rewrite without DYNAMIC_SECTION so sizes accumulate across all three levels and the monotonicity REQUIREs actually fire. Also adds the missing REQUIRE(fr.isValid()) before ReadFile. 2. FromSIMPLJson was silently upgrading SIMPL v6 pipelines to write compressed output. Explicitly set use_compression=false in FromSIMPLJson so converted pipelines preserve their original uncompressed on-disk encoding. New NX pipelines still default to compression on. 3. DatasetIO move constructor and move assignment did not propagate m_CompressionLevel -- a latent footgun if a DatasetIO is ever moved after setCompressionLevel was called. Fixed. >> FIX: Address important + nitpick review findings on HDF5 compression Important: - I1: Test HDF5 handle leaks on REQUIRE failure. H5Test's three DatasetIO tests now go through a local ProbeDataset() helper that handles its own cleanup, so REQUIRE on the returned struct can't leak file/dataset/dcpl handles. DREAM3DFileTest's Task 4 integration test routes through the existing InspectDatasetLayout helper for the same reason. - I2: Add REQUIRE(fileReader.isValid()) before DREAM3D::ReadFile in the round-trip check. - I4: Doxygen @param/@return on every new accessor (get/setWriteOptions on DataStructureWriter, get/setCompressionLevel on DatasetIO). - I5: H5Pget_filter2's cd_nelmts in/out parameter now matches the 8-slot cdValues buffer capacity instead of hardcoding 1. Harmless with gzip (1 cd_value) but correct API usage. - I6: CreateDatasetCreationPropertyList now has explicit overflow and empty-dims guards; comment no longer claims overflow is handled when it wasn't. Saturating multiply check falls through to contiguous on any wrap or zero dim. - I7: NeighborListIO's flattened data array now inherits the same compression level as DataArray datasets -- neighbor lists can be very large and benefit just as much. - I8: New preflight test exercises the compression_level range check (rejects 0 and 10 when use_compression=true; accepts 0 when use_compression=false). Nitpicks: - Dead range check `rowsPerChunk < 1` on unsigned type -> `== 0`. - Comment rot ("today's behavior") replaced with time-independent wording. - Noexcept on the four new trivial accessors. - Magic preflight error code -2 replaced with named k_InvalidCompressionLevelError. - Loop variable types int -> int32 in dataset probing helpers. - Local constant kTuples renamed to k_Tuples (constexpr) to match project k_-prefix convention for named constants. - Removed redundant `nx::core::` qualifications where `using namespace` was already in scope. - 1ull * 1024ull * 1024ull -> 1024ull * 1024ull. - Removed narrator-style WHAT comments in tests; the code explains itself. >> DOC: Expand WriteDREAM3DFilter documentation for compression and XDMF Replace the one-sentence description with a full explanation of: - What the .dream3d file contains (DataStructure + pipeline). - The new HDF5 gzip compression parameters: what they do, why gzip was chosen, what level 5 means relative to 1/9, and caveats around the 16 KiB small-array bypass, non-DataArray metadata staying contiguous, the automatic 1 MiB chunk-shape target, round-trip losslessness, and the unchanged file format. - The FromSIMPLJson behavior (compression off by default for legacy pipeline conversions). - What the .xdmf sidecar is for: ParaView/VisIt interoperability, metadata-only (points into the .dream3d), near-zero cost, recommended when visualization outside DREAM3D-NX is a possibility. >> DOC: Flesh out Doxygen on new HDF5 compression APIs Expand documentation on every new public / protected symbol introduced by the compression branch so each has @brief plus described @param and @return tags where applicable: - DatasetIO::CreateDatasetCreationPropertyList: full @param/@return with the ownership contract and fallthrough conditions spelled out. - Dream3dIO::WriteFile two new overloads: parameter descriptions filled in (were previously bare tag names). - WriteDREAM3DInputValues: added @struct docstring, plus inline /// comments on the two new UseCompression / CompressionLevel fields to call out the interaction (level is ignored when the switch is off). Private members and file-local anonymous-namespace helpers remain undocumented per existing house style. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Encapsulate the chunked+deflate DCPL builder as a file-local helper in DatasetIO.cpp returning Result<hid_t>; H5Pcreate / H5Pset_chunk / H5Pset_deflate failures now propagate through writeSpan instead of silently degrading to contiguous storage. Intentional fall-throughs (level 0, empty dims, sub-16 KiB arrays) still resolve to H5P_DEFAULT. * Drop the public CreateDatasetCreationPropertyList declaration from DatasetIO.hpp; it had no callers outside DatasetIO.cpp. * setCompressionLevel now ignores out-of-range input so the documented contract matches the implementation. * Trim the misleading "honor the same compression level DataArrayIO uses" comment in NeighborListIO; there is one global compressionLevel. * Move the duplicated HDF5 dataset probe helper from H5Test.cpp and DREAM3DFileTest.cpp into a shared simplnx::UnitTestCommon header (HDF5DatasetProbe.hpp/.cpp) exposing a DatasetLayout enum class so consumers no longer need to include <hdf5.h> for layout assertions. Signed-off-by: Michael Jackson <mike.jackson@bluequartz.net>

* BuildChunkedDeflateDcpl now returns errors for empty dims, zero elementByteSize, zero-valued dims, and total-byte overflow rather than silently falling through to H5P_DEFAULT. The intentional small-array bypass (< 16 KiB) still returns H5P_DEFAULT. * ProbeHdf5Dataset now returns std::optional<DatasetProbeInfo>; removed the DatasetLayout::Unknown sentinel. All call sites in H5Test.cpp and DREAM3DFileTest.cpp updated to check has_value(). * Trimmed implementation-detail prose from the ProbeHdf5Dataset doc comment. Signed-off-by: Michael Jackson <mike.jackson@bluequartz.net>

imikejackson requested a review from joeykleingers April 23, 2026 15:51

imikejackson force-pushed the topic/enable_hdf5_compression branch from e158835 to ae850e8 Compare April 24, 2026 12:41

imikejackson requested review from JDuffeyBQ and removed request for joeykleingers April 24, 2026 15:48

imikejackson force-pushed the topic/enable_hdf5_compression branch 2 times, most recently from 0bec7dc to 7a6732e Compare April 24, 2026 15:56

JDuffeyBQ requested changes Apr 24, 2026

View reviewed changes

imikejackson requested a review from JDuffeyBQ April 28, 2026 13:55

JDuffeyBQ requested changes Apr 28, 2026

View reviewed changes

Comment thread src/simplnx/Utilities/Parsing/HDF5/IO/DatasetIO.cpp

Comment thread test/UnitTestCommon/include/simplnx/UnitTest/HDF5DatasetProbe.hpp Outdated

Comment thread test/UnitTestCommon/include/simplnx/UnitTest/HDF5DatasetProbe.hpp Outdated

imikejackson and others added 3 commits April 30, 2026 18:48

imikejackson force-pushed the topic/enable_hdf5_compression branch from e04de5f to e320086 Compare April 30, 2026 22:49

imikejackson requested a review from JDuffeyBQ May 1, 2026 15:18

JDuffeyBQ approved these changes May 1, 2026

View reviewed changes

imikejackson merged commit 560529d into BlueQuartzSoftware:develop May 1, 2026
6 checks passed

imikejackson deleted the topic/enable_hdf5_compression branch May 1, 2026 15:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Enable HDF5 dataset compression from the WriteDREAM3DFilter#1606

ENH: Enable HDF5 dataset compression from the WriteDREAM3DFilter#1606
imikejackson merged 3 commits intoBlueQuartzSoftware:developfrom
imikejackson:topic/enable_hdf5_compression

imikejackson commented Apr 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

imikejackson commented Apr 28, 2026

Uh oh!

imikejackson commented Apr 28, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

imikejackson commented Apr 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

imikejackson commented Apr 28, 2026

Branch Review — topic/enable_hdf5_compression

1) C++ Best Practices

2) Object-Oriented Design

3) Defensive Programming

4) Memory Allocation

5) CPU Performance

6) Out-of-scope: docs/superpowers/ deletions

Summary

Uh oh!

imikejackson commented Apr 28, 2026

Design Validation — HDF5 Compression Plumbing

The four design choices that shape this PR

1) WriteOptions struct

2) DatasetIO::m_CompressionLevel as per-instance state

3) AbstractDataStore::writeHdf5 signature deliberately unchanged

4) IDataIO-level opt-in (DataArrayIO + NeighborListIO only)

Cross-cutting design decisions

Compression-on-by-default at level 5

FromSIMPLJson sets UseCompression = false

Preflight rejection of out-of-range levels (when compression is on) but acceptance when off

Chunk shape "1 MiB along outermost dim"

Error propagation strategy in BuildChunkedDeflateDcpl

Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Branch Review — `topic/enable_hdf5_compression`

6) Out-of-scope: `docs/superpowers/` deletions

1) `WriteOptions` struct

2) `DatasetIO::m_CompressionLevel` as per-instance state

3) `AbstractDataStore::writeHdf5` signature deliberately unchanged

4) `IDataIO`-level opt-in (DataArrayIO + NeighborListIO only)

`FromSIMPLJson` sets `UseCompression = false`

Error propagation strategy in `BuildChunkedDeflateDcpl`