ENH: Enable HDF5 dataset compression from the WriteDREAM3DFilter#1606
Conversation
e158835 to
ae850e8
Compare
0bec7dc to
7a6732e
Compare
* Encapsulate the chunked+deflate DCPL builder as a file-local helper in DatasetIO.cpp returning Result<hid_t>; H5Pcreate / H5Pset_chunk / H5Pset_deflate failures now propagate through writeSpan instead of silently degrading to contiguous storage. Intentional fall-throughs (level 0, empty dims, sub-16 KiB arrays) still resolve to H5P_DEFAULT. * Drop the public CreateDatasetCreationPropertyList declaration from DatasetIO.hpp; it had no callers outside DatasetIO.cpp. * setCompressionLevel now ignores out-of-range input so the documented contract matches the implementation. * Trim the misleading "honor the same compression level DataArrayIO uses" comment in NeighborListIO; there is one global compressionLevel. * Move the duplicated HDF5 dataset probe helper from H5Test.cpp and DREAM3DFileTest.cpp into a shared simplnx::UnitTestCommon header (HDF5DatasetProbe.hpp/.cpp) exposing a DatasetLayout enum class so consumers no longer need to include <hdf5.h> for layout assertions. Signed-off-by: Michael Jackson <mike.jackson@bluequartz.net>
Branch Review —
|
Design Validation — HDF5 Compression PlumbingA separate pass focused only on the design choices used to wire compression through the stack: where state lives, why the boundaries are drawn the way they are, and whether the trade-offs hold up under the constraints simplnx actually has. The four design choices that shape this PR
I'll validate each. 1)
|
| IDataIO subclass | Why exempt |
|---|---|
| Geometry IO (image/rect/triangle/edge/vertex/quad/hex/tetra) | Geometry datasets are mostly small (vertices < 1M typical), and metadata datasets are tiny. |
| AttributeMatrixIO | Stores tuple shape only — a few usize values. |
| ScalarDataIO | Single scalar. |
| StringArrayIO | Variable-length strings; compression filters apply per-chunk and the win on string data is marginal. |
| NeighborListIO numNeighbors array | Same shape as a small DataArray; gets the writer-level compression too via the DataArray path. |
The 16 KiB small-array bypass inside BuildChunkedDeflateDcpl is the second line of defense: even if a small DataArray slips through (which it does), the helper falls through to contiguous storage. So the policy is "compress big array-shaped data; everything else stays contiguous" — implemented at two layers.
Why not compress everything? Three reasons:
- Chunk-index overhead dominates for small datasets (motivation for the 16 KiB bypass).
- File-open time grows with chunk count; metadata datasets are read on every open.
- gzip on string arrays doesn't help meaningfully because LZ77 already does string-like compression and HDF5 string handling adds overhead per chunk.
Verdict: correct, and the choice is conservative — strict opt-in over opt-out. Easy to expand to additional IDataIO subclasses if benchmarks justify it later; harder to retract.
Cross-cutting design decisions
Compression-on-by-default at level 5
Defaults are policy. Three options were available:
- Off by default — safest, but new files would need explicit opt-in to benefit.
- On at level 1 — fast, smaller win.
- On at level 5 — chosen.
Level 5 is the gzip middle of the road: typically 60-80% of level-9 ratio at 30-40% of level-9 CPU cost. For voxel/feature data this lands at ~3× smaller files with negligible read-side decompression cost (deflate decode is very fast). The CPU cost on write is bounded by the user's IO subsystem; on most modern disks gzip-5 is faster end-to-end than uncompressed (the PCI/SATA savings exceed the compression cost). Justified.
FromSIMPLJson sets UseCompression = false
This is the design decision that prevents a silent file-format compatibility surprise. SIMPL v6 wrote uncompressed files; if a SIMPL-imported pipeline silently started writing gzip-5 files, every downstream consumer that parses byte offsets, every script that hashes dream3d files for caching, every reproducibility regression suite, would see a one-time bit-for-bit difference. By forcing the default to false for SIMPL conversions, the imported pipeline produces the byte-identical output the user expected when they saved it. Verdict: this is exactly right. A linked params.linkParameters connects the bool to the level field in the UI, so user can manually enable compression if desired.
Preflight rejection of out-of-range levels (when compression is on) but acceptance when off
Looking at WriteDREAM3DFilter::preflightImpl:
if(useCompression && (compressionLevel < 1 || compressionLevel > 9))
{
return MakePreflightErrorResult(...);
}When useCompression == false, the level is ignored. This is consistent with the linked-parameter pattern — the disabled level field shouldn't block preflight. Tested explicitly in Compression_Preflight_RejectsOutOfRangeLevel. Verdict: correct.
Chunk shape "1 MiB along outermost dim"
The choice has three independent rationales that all point the same way:
- HDF5 stores in C-order, so the outermost dim is the slowest-varying — chunking along it creates contiguous-in-memory chunks.
- DataArray iteration is tuple-major; chunks aligned to tuple boundaries match the access pattern.
- 1 MiB is a sweet spot for gzip's LZ77 dictionary (32 KiB) — large enough to not be dictionary-bound, small enough that chunk-cache misses are cheap.
A user-configurable chunk shape was rejected (correctly, IMO) because the parameter surface for "compression knobs that almost no one will tune" is itself a maintenance cost. If a real-world dataset emerges where 1 MiB is wrong, this becomes a WriteOptions field. Verdict: defensible default with a clear escape hatch (extend WriteOptions later).
Error propagation strategy in BuildChunkedDeflateDcpl
The helper distinguishes three result classes:
- Errors (level out of range, HDF5 API failures) →
Result<hid_t>::invalid(), propagated throughwriteSpan. - Intentional fall-through (level 0, empty dims, sub-16 KiB) →
Result<hid_t>::valid(H5P_DEFAULT). - Success →
Result<hid_t>::valid(dcpl)with caller-owned ownership.
This is the right shape because the caller has to do different things in each case: (1) abort the write, (2) write contiguously without the helper failing the operation, (3) write chunked and close the DCPL. Encoding that as a sentinel-only return (H5P_DEFAULT for any non-error) would have lost the level-out-of-range diagnostic. Verdict: well-designed. The original review feedback pushed this from sentinel to Result<> and the change is materially better.
Summary
The design holds up to scrutiny on every axis I checked. The decisions are layered, the trade-offs are explicit, and each non-obvious choice (write-time state location, AbstractDataStore signature stability, IDataIO opt-in scope, FromSIMPLJson default) has a concrete rationale tied to the constraints of the simplnx ecosystem and its external plugins. The defaults are conservative (safe legacy behavior preserved) where the cost of breaking is high, and aggressive (compression on at 5) where the cost is low and the benefit is real.
The change to surface Result<hid_t> from the DCPL builder, in response to JDuffeyBQ's review, is a strict improvement and should be the model for migrating the older CreateH5DatasetChunkProperties path in a follow-up.
* BuildChunkedDeflateDcpl now returns errors for empty dims, zero elementByteSize, zero-valued dims, and total-byte overflow rather than silently falling through to H5P_DEFAULT. The intentional small-array bypass (< 16 KiB) still returns H5P_DEFAULT. * ProbeHdf5Dataset now returns std::optional<DatasetProbeInfo>; removed the DatasetLayout::Unknown sentinel. All call sites in H5Test.cpp and DREAM3DFileTest.cpp updated to check has_value(). * Trimmed implementation-detail prose from the ProbeHdf5Dataset doc comment. Signed-off-by: Michael Jackson <mike.jackson@bluequartz.net>
Adds gzip (deflate) compression for DataArray datasets in .dream3d files,
exposed through two new WriteDREAM3DFilter parameters:
use_compression (bool, default true)
compression_level (int, 1-9, default 5)
Architecture:
- DataStructureWriter carries a WriteOptions struct (compression level
only, for now); WriteOptions flows through Dream3dIO::WriteFile ->
DataStructureWriter::WriteFile.
- DatasetIO carries m_CompressionLevel state; writeSpan consults it and
builds a chunked+deflate DCPL via CreateDatasetCreationPropertyList
when level is 1-9 and the total byte count clears a 16 KiB small-array
bypass.
- AbstractDataStore::writeHdf5 signature is intentionally unchanged so
the external SimplnxOoc plugin's override keeps compiling.
- DataArrayIO and NeighborListIO stamp the compression level on their
DatasetIO before writing. Other IDataIO subclasses (geometry IOs,
attribute matrix IO, scalar IO, string array IO, etc.) stay
contiguous, so metadata datasets do not participate in compression.
- Chunk shape targets ~1 MiB per chunk (pre-clamped, overflow-safe).
- FromSIMPLJson sets use_compression=false so converted legacy pipelines
preserve their on-disk encoding.
File format:
- No version bump. stock HDF5 readers handle deflate transparently.
Tests (DREAM3DFileTest.cpp):
- Compression_Off_IsContiguous
- Compression_On_IsChunkedAndDeflated (layout + round-trip bit-identity)
- Compression_SmallArray_Bypasses
- Compression_LevelsRoundTrip (levels {1, 5, 9} monotonically
non-increasing file sizes)
- Preflight range check on compression_level.
Docs:
- WriteDREAM3DFilter.md expanded to describe compression and XDMF.
- Doxygen on every new public / protected symbol introduced
(DatasetIO::CreateDatasetCreationPropertyList, the two WriteFile
overloads, WriteDREAM3DInputValues fields, and accessors).
Squashed from the following 15 commits on topic/enable_hdf5_compression:
>> DOC: Design spec for HDF5 gzip compression in .dream3d writer
Adds a brainstorm-validated design for compressing DataArray datasets
with gzip (deflate) on write. Two new parameters on WriteDREAM3DFilter
(use_compression, compression_level 1-9, default ON at level 5) plumbed
via a WriteOptions struct on DataStructureWriter. Chunking targets
~1 MiB per chunk; arrays under 16 KiB bypass compression. No file
version bump required -- stock HDF5 readers handle deflate transparently.
>> DOC: Update HDF5 compression spec -- keep AbstractDataStore::writeHdf5 ABI stable
Reading the actual code revealed AbstractDataStore::writeHdf5 is a pure
virtual overridden by the SimplnxOoc plugin's DataStore subclass (in a
separate repo). Changing its signature would silently break that build.
Switched the opt-in mechanism: DatasetIO carries m_CompressionLevel
state; DataArrayIO sets it before calling WriteDataStore; writeSpan
consults its own state. AbstractDataStore::writeHdf5 stays unchanged.
Also corrected the test-file reference from a nonexistent
WriteDREAM3DFilterTest.cpp to the existing DREAM3DFileTest.cpp.
>> DOC: Implementation plan for HDF5 gzip compression in .dream3d writer
Seven-task plan that lands the compression feature incrementally with
frequent green builds and commits:
1. WriteOptions on DataStructureWriter (plumbing only)
2. Forward WriteOptions through Dream3dIO::WriteFile
3. DatasetIO gets m_CompressionLevel + DCPL helper + writeSpan wiring
4. DataArrayIO opts in on behalf of DataArray datasets
5. WriteDREAM3DFilter exposes the two user-facing parameters
6. Spec Section 6 test suite (off / on / small-bypass / levels)
7. Full build + ctest verification
TDD ordering throughout; every task commits green.
>> ENH: Introduce WriteOptions on DataStructureWriter
Plumbing-only change. DataStructureWriter grows a WriteOptions struct
(currently just compressionLevel) plus getter/setter and WriteOptions
overloads on WriteFile/AppendFile. Existing overloads forward to the new
ones with a default (0, no compression), preserving current behavior.
>> ENH: Dream3dIO::WriteFile accepts optional WriteOptions
Adds new WriteFile overloads on both FileIO-based and path-based entry
points that forward WriteOptions through to DataStructureWriter. The
existing no-options overloads are preserved and delegate to the new ones
with a default-constructed WriteOptions, so every existing call site
continues to work unchanged.
Threads options through the file-scope WriteDataStructure helper
(overloaded) so the layered WriteFile(path) -> WriteFile(FileIO) ->
WriteDataStructure -> DataStructureWriter::WriteFile pipeline is
preserved without duplicating the version/pipeline/data-structure
sequence.
>> CLEANUP: Remove dead no-options WriteDataStructure helper in Dream3dIO
After threading WriteOptions through the WriteFile layering, the old
no-options WriteDataStructure(FileIO&, DS) file-scope helper has no
callers. Dropped. The remaining with-options helper is the only call
site for DataStructureWriter::WriteFile from this translation unit.
>> ENH: DatasetIO writes chunked+deflate when m_CompressionLevel > 0
DatasetIO learns a compression-level setting (default 0 = contiguous).
writeSpan now builds a DCPL via CreateDatasetCreationPropertyList, which
applies H5Pset_chunk + H5Pset_deflate targeting ~1 MiB chunks when the
level is 1-9 and the total byte count clears a 16 KiB small-array bypass.
Callers that don't opt in get today's contiguous behavior unchanged.
>> DOC: Align compression spec with the defensive-fallback implementation
Section 4.3 previously promised "no silent fallback to uncompressed".
After Task 3 landed, the implementation (intentionally, per the plan)
returns H5P_DEFAULT from the DCPL helper on any HDF5 setup error,
causing an uncompressed write rather than failing the whole file. In
practice the failure paths are unreachable on stock HDF5 -- gzip is
always available, H5Pcreate never fails, and our chunk math is
pre-clamped. Update the spec to describe the actual (defensive-backstop)
behavior and note that preflight validation is the real user-visible
guardrail.
>> ENH: DataArrayIO applies WriteOptions.compressionLevel before writing
DataArrayIO::writeData now stamps the compression level from the
DataStructureWriter's WriteOptions onto the DatasetIO before calling
WriteDataStore. Other IDataIO subclasses (geometry IOs, attribute matrix
IO, scalar IO, string array IO, etc.) are unchanged, so metadata datasets
stay contiguous while DataArray datasets participate in compression when
requested.
AbstractDataStore::writeHdf5's signature remains stable so the external
SimplnxOoc plugin continues to build without changes.
>> ENH: WriteDREAM3DFilter exposes HDF5 gzip compression controls
Two new parameters: use_compression (bool, default true) and
compression_level (int32, 1-9, default 5), linked so the level is only
editable when compression is enabled. Level is range-validated at
preflight. executeImpl forwards to WriteDREAM3D, which builds
DataStructureWriter::WriteOptions from the values and passes them
through DREAM3D::WriteFile. parametersVersion bumps to 2.
>> TEST: Exercise HDF5 gzip compression in WriteDREAM3DFilter
Four end-to-end tests covering the spec's required behaviors:
- Compression_Off_IsContiguous: use_compression=false writes contiguous.
- Compression_On_IsChunkedAndDeflated: use_compression=true writes a
chunked dataset with deflate at the requested level, and the file
round-trips byte-identically.
- Compression_SmallArray_Bypasses: small DataArrays stay contiguous
even with compression on.
- Compression_LevelsRoundTrip: levels {1, 5, 9} all round-trip, and
file sizes are monotonically non-increasing as level rises.
All four tests pass. Adds an InspectDatasetLayout helper to the
anonymous namespace that opens a .dream3d via raw HDF5 and reports
layout + deflate filter presence/level, keeping the per-test assertions
concise.
>> FIX: Address critical review findings on HDF5 compression branch
Three critical issues surfaced by review:
1. Compression_LevelsRoundTrip size-monotonicity check was dead code --
DYNAMIC_SECTION restarts the TEST_CASE body per section, so
sizesByLevel was always size 1 when the size==3 guard ran. Rewrite
without DYNAMIC_SECTION so sizes accumulate across all three levels
and the monotonicity REQUIREs actually fire. Also adds the missing
REQUIRE(fr.isValid()) before ReadFile.
2. FromSIMPLJson was silently upgrading SIMPL v6 pipelines to write
compressed output. Explicitly set use_compression=false in
FromSIMPLJson so converted pipelines preserve their original
uncompressed on-disk encoding. New NX pipelines still default to
compression on.
3. DatasetIO move constructor and move assignment did not propagate
m_CompressionLevel -- a latent footgun if a DatasetIO is ever moved
after setCompressionLevel was called. Fixed.
>> FIX: Address important + nitpick review findings on HDF5 compression
Important:
- I1: Test HDF5 handle leaks on REQUIRE failure. H5Test's three DatasetIO
tests now go through a local ProbeDataset() helper that handles its own
cleanup, so REQUIRE on the returned struct can't leak file/dataset/dcpl
handles. DREAM3DFileTest's Task 4 integration test routes through the
existing InspectDatasetLayout helper for the same reason.
- I2: Add REQUIRE(fileReader.isValid()) before DREAM3D::ReadFile in the
round-trip check.
- I4: Doxygen @param/@return on every new accessor (get/setWriteOptions
on DataStructureWriter, get/setCompressionLevel on DatasetIO).
- I5: H5Pget_filter2's cd_nelmts in/out parameter now matches the 8-slot
cdValues buffer capacity instead of hardcoding 1. Harmless with gzip
(1 cd_value) but correct API usage.
- I6: CreateDatasetCreationPropertyList now has explicit overflow and
empty-dims guards; comment no longer claims overflow is handled when
it wasn't. Saturating multiply check falls through to contiguous on
any wrap or zero dim.
- I7: NeighborListIO's flattened data array now inherits the same
compression level as DataArray datasets -- neighbor lists can be very
large and benefit just as much.
- I8: New preflight test exercises the compression_level range check
(rejects 0 and 10 when use_compression=true; accepts 0 when
use_compression=false).
Nitpicks:
- Dead range check `rowsPerChunk < 1` on unsigned type -> `== 0`.
- Comment rot ("today's behavior") replaced with time-independent wording.
- Noexcept on the four new trivial accessors.
- Magic preflight error code -2 replaced with named
k_InvalidCompressionLevelError.
- Loop variable types int -> int32 in dataset probing helpers.
- Local constant kTuples renamed to k_Tuples (constexpr) to match
project k_-prefix convention for named constants.
- Removed redundant `nx::core::` qualifications where `using namespace`
was already in scope.
- 1ull * 1024ull * 1024ull -> 1024ull * 1024ull.
- Removed narrator-style WHAT comments in tests; the code explains
itself.
>> DOC: Expand WriteDREAM3DFilter documentation for compression and XDMF
Replace the one-sentence description with a full explanation of:
- What the .dream3d file contains (DataStructure + pipeline).
- The new HDF5 gzip compression parameters: what they do, why gzip was
chosen, what level 5 means relative to 1/9, and caveats around the
16 KiB small-array bypass, non-DataArray metadata staying contiguous,
the automatic 1 MiB chunk-shape target, round-trip losslessness, and
the unchanged file format.
- The FromSIMPLJson behavior (compression off by default for legacy
pipeline conversions).
- What the .xdmf sidecar is for: ParaView/VisIt interoperability,
metadata-only (points into the .dream3d), near-zero cost, recommended
when visualization outside DREAM3D-NX is a possibility.
>> DOC: Flesh out Doxygen on new HDF5 compression APIs
Expand documentation on every new public / protected symbol introduced
by the compression branch so each has @brief plus described @param and
@return tags where applicable:
- DatasetIO::CreateDatasetCreationPropertyList: full @param/@return with
the ownership contract and fallthrough conditions spelled out.
- Dream3dIO::WriteFile two new overloads: parameter descriptions filled
in (were previously bare tag names).
- WriteDREAM3DInputValues: added @struct docstring, plus inline ///
comments on the two new UseCompression / CompressionLevel fields to
call out the interaction (level is ignored when the switch is off).
Private members and file-local anonymous-namespace helpers remain
undocumented per existing house style.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Encapsulate the chunked+deflate DCPL builder as a file-local helper in DatasetIO.cpp returning Result<hid_t>; H5Pcreate / H5Pset_chunk / H5Pset_deflate failures now propagate through writeSpan instead of silently degrading to contiguous storage. Intentional fall-throughs (level 0, empty dims, sub-16 KiB arrays) still resolve to H5P_DEFAULT. * Drop the public CreateDatasetCreationPropertyList declaration from DatasetIO.hpp; it had no callers outside DatasetIO.cpp. * setCompressionLevel now ignores out-of-range input so the documented contract matches the implementation. * Trim the misleading "honor the same compression level DataArrayIO uses" comment in NeighborListIO; there is one global compressionLevel. * Move the duplicated HDF5 dataset probe helper from H5Test.cpp and DREAM3DFileTest.cpp into a shared simplnx::UnitTestCommon header (HDF5DatasetProbe.hpp/.cpp) exposing a DatasetLayout enum class so consumers no longer need to include <hdf5.h> for layout assertions. Signed-off-by: Michael Jackson <mike.jackson@bluequartz.net>
* BuildChunkedDeflateDcpl now returns errors for empty dims, zero elementByteSize, zero-valued dims, and total-byte overflow rather than silently falling through to H5P_DEFAULT. The intentional small-array bypass (< 16 KiB) still returns H5P_DEFAULT. * ProbeHdf5Dataset now returns std::optional<DatasetProbeInfo>; removed the DatasetLayout::Unknown sentinel. All call sites in H5Test.cpp and DREAM3DFileTest.cpp updated to check has_value(). * Trimmed implementation-detail prose from the ProbeHdf5Dataset doc comment. Signed-off-by: Michael Jackson <mike.jackson@bluequartz.net>
e04de5f to
e320086
Compare
Adds optional gzip (deflate) compression to HDF5 dataset writes in WriteDREAM3DFilter. Compression is on by default at level 5 for new pipelines, with a full 1–9 range. SIMPL-imported pipelines keep their original uncompressed encoding so their output is byte-for-byte what users expect.
On representative voxel/feature data, expect ~2–5× smaller .dream3d files with negligible read cost — gzip/deflate is built into every stock HDF5 install, so DREAM3D 6.x, h5py, HDFView, ParaView, MATLAB, etc. read the compressed files transparently.