Skip to content

ENH: Enable HDF5 dataset compression from the WriteDREAM3DFilter#1606

Merged
imikejackson merged 3 commits intoBlueQuartzSoftware:developfrom
imikejackson:topic/enable_hdf5_compression
May 1, 2026
Merged

ENH: Enable HDF5 dataset compression from the WriteDREAM3DFilter#1606
imikejackson merged 3 commits intoBlueQuartzSoftware:developfrom
imikejackson:topic/enable_hdf5_compression

Conversation

@imikejackson
Copy link
Copy Markdown
Contributor

Adds optional gzip (deflate) compression to HDF5 dataset writes in WriteDREAM3DFilter. Compression is on by default at level 5 for new pipelines, with a full 1–9 range. SIMPL-imported pipelines keep their original uncompressed encoding so their output is byte-for-byte what users expect.

On representative voxel/feature data, expect ~2–5× smaller .dream3d files with negligible read cost — gzip/deflate is built into every stock HDF5 install, so DREAM3D 6.x, h5py, HDFView, ParaView, MATLAB, etc. read the compressed files transparently.

@imikejackson imikejackson force-pushed the topic/enable_hdf5_compression branch from e158835 to ae850e8 Compare April 24, 2026 12:41
@imikejackson imikejackson requested review from JDuffeyBQ and removed request for joeykleingers April 24, 2026 15:48
@imikejackson imikejackson force-pushed the topic/enable_hdf5_compression branch 2 times, most recently from 0bec7dc to 7a6732e Compare April 24, 2026 15:56
Comment thread src/simplnx/DataStructure/IO/HDF5/NeighborListIO.hpp Outdated
Comment thread src/simplnx/Utilities/Parsing/HDF5/IO/DatasetIO.hpp Outdated
Comment thread src/simplnx/Utilities/Parsing/HDF5/IO/DatasetIO.hpp Outdated
Comment thread src/simplnx/Utilities/Parsing/HDF5/IO/DatasetIO.cpp Outdated
Comment thread src/Plugins/SimplnxCore/test/DREAM3DFileTest.cpp Outdated
imikejackson added a commit to imikejackson/simplnx that referenced this pull request Apr 28, 2026
* Encapsulate the chunked+deflate DCPL builder as a file-local helper in
  DatasetIO.cpp returning Result<hid_t>; H5Pcreate / H5Pset_chunk /
  H5Pset_deflate failures now propagate through writeSpan instead of
  silently degrading to contiguous storage. Intentional fall-throughs
  (level 0, empty dims, sub-16 KiB arrays) still resolve to H5P_DEFAULT.
* Drop the public CreateDatasetCreationPropertyList declaration from
  DatasetIO.hpp; it had no callers outside DatasetIO.cpp.
* setCompressionLevel now ignores out-of-range input so the documented
  contract matches the implementation.
* Trim the misleading "honor the same compression level DataArrayIO uses"
  comment in NeighborListIO; there is one global compressionLevel.
* Move the duplicated HDF5 dataset probe helper from H5Test.cpp and
  DREAM3DFileTest.cpp into a shared simplnx::UnitTestCommon header
  (HDF5DatasetProbe.hpp/.cpp) exposing a DatasetLayout enum class so
  consumers no longer need to include <hdf5.h> for layout assertions.

Signed-off-by: Michael Jackson <mike.jackson@bluequartz.net>
@imikejackson imikejackson requested a review from JDuffeyBQ April 28, 2026 13:55
@imikejackson
Copy link
Copy Markdown
Contributor Author

Branch Review — topic/enable_hdf5_compression

A self-review pass across five axes of the changes in this PR (commits 7a6732e2 and 3bc4b891). Findings are scoped to code introduced or materially altered by this branch.

1) C++ Best Practices

Solid:

  • Result<>-based error propagation throughout; no exceptions thrown across HDF5 boundaries.
  • setCompressionLevel and getCompressionLevel correctly marked noexcept; Build...Dcpl uses Result<hid_t> instead of sentinel-encoded errors.
  • int32/usize aliases used consistently; no raw int/size_t mixing in new code.
  • DCPL ownership contract is documented inline at BuildChunkedDeflateDcpl and tested by the helper closing on every error branch (DatasetIO.cpp:140, 146).
  • New WriteDREAM3DInputValues follows the existing input-bag idiom exactly.
  • Includes are minimal — HDF5DatasetProbe.hpp deliberately keeps <hdf5.h> out of the public header by introducing the DatasetLayout enum class.

Issues / nits:

  • Unscoped int in probe code. HDF5DatasetProbe.cpp:51-60 mixes int nFilters, unsigned int flags, unsigned int filterConfig. These mirror HDF5's C ABI so they're justified, but the local i loop counter could be int once and reused; currently it lives in the for head as int and is then static_cast<unsigned int>(i) for the API call — fine, just noisy.
  • int32 m_CompressionLevel = 0 is a magic-default. A named constant (k_DefaultCompressionLevel = 0) inside the anonymous namespace would make it self-documenting at the point of declaration. Minor.
  • writeSpan<bool> doesn't propagate compression. It delegates to writeSpan<H5_BOOL_TYPE>, which inherits m_CompressionLevel from *this — that works because the call is on the same DatasetIO instance, but it's worth a one-line comment so the reader doesn't need to trace it.
  • Two distinct compression-level domains. WriteOptions::compressionLevel is int32 with the convention 0 = off and the filter parameter is int32 in [1, 9] plus a separate bool UseCompression. The mapping between them lives only in WriteDREAM3D.cpp:60. This mapping is correct but easy to misuse from another caller; consider documenting on WriteOptions::compressionLevel itself.

2) Object-Oriented Design

Solid:

  • The compression knob is layered cleanly: filter → algorithm input bag → WriteOptionsDataStructureWriter → per-DatasetIO m_CompressionLevel. Each layer owns one concern.
  • IDataIO polymorphism is preserved — only DataArrayIO::writeData and NeighborListIO opt in by stamping the writer's level on their DatasetIO (DataArrayIO.hpp:236, NeighborListIO.hpp:239). Geometry / AttributeMatrix / String IOs deliberately don't.
  • AbstractDataStore::writeHdf5 signature is unchanged, which the commit message correctly highlights as protecting the external SimplnxOoc plugin's override.
  • New WriteOptions is a value type, default-constructible, default arguments used. Adding fields later is non-breaking.

Issues:

  • WriteOptions is nested inside DataStructureWriter, but it's now serialization-policy state shared up to DREAM3D::WriteFile and down into DatasetIO. Its conceptual home is closer to a top-level nx::core::HDF5::WriteOptions. Today every consumer must spell nx::core::HDF5::DataStructureWriter::WriteOptions. Promoting it would be a one-line cleanup but is technically API-breaking.
  • DatasetIO::m_CompressionLevel is per-instance state that must be set before each writeSpan. This works but couples temporal ordering: callers must remember to call setCompressionLevel before writeSpan, and forgetting silently produces uncompressed output. Passing the level as a parameter to writeSpan (or a WriteSpanOptions overload) would be harder to misuse, but at the cost of churning every existing call site. The current design is a reasonable tradeoff but worth flagging.
  • getWriteOptions() returns const WriteOptions& but the writer mutates state (m_IdMap, etc.) during write. The returned reference is stable for the writer's lifetime, which is what callers actually need; noexcept on the getter is appropriate.

3) Defensive Programming

Solid:

  • Out-of-range levels rejected at three layers: filter preflight (WriteDREAM3DFilter.cpp:93-96), setCompressionLevel ignores invalid input (DatasetIO.cpp:222-228), and BuildChunkedDeflateDcpl returns a Result<hid_t> error (DatasetIO.cpp:74-77). Belt-and-suspenders, intentional.
  • BuildChunkedDeflateDcpl defends against:
    • Empty dims,
    • Zero-byte element / zero dims,
    • Saturating-multiplication overflow when computing total bytes (DatasetIO.cpp:96),
    • HDF5 API failures at each step, with H5Pclose on the partial DCPL before returning the error.
  • H5P_DEFAULT ownership contract is enforced at the only caller (DatasetIO.cpp:1063if(dcpl != H5P_DEFAULT) H5Pclose(dcpl)).
  • Tests cover the contract: small-array bypass, level-0 contiguous path, round-trip identity, monotonic compression sizes, preflight rejection of out-of-range values.

Issues:

  • DatasetIO::CreateH5DatasetChunkProperties (the legacy non-compression helper) silently swallows H5Pset_chunk failure (DatasetIO.cpp:1138-1144) by returning H5P_DEFAULT. This predates the PR, but the PR's review effort highlights the contrast with the new helper, which now propagates. Worth a follow-up to migrate this one too — currently a chunked-dataset write would silently degrade to contiguous on partial HDF5 failure.
  • AppendFile accepts WriteOptions but the per-dataset writer state isn't carried into the append path's DatasetIO::setCompressionLevel calls. Looking at DataStructureWriter::AppendFile, the setWriteOptions(options) is called on the dataStructureWriter that's about to do the writeDataObject — so this works because DataArrayIO::writeData reads dataStructureWriter.getWriteOptions().compressionLevel. ✓ verified.
  • writeSpan<bool> recurses into the typed overload — the try/catch block catches std::exception from value construction, but the recursive writeSpan itself returns Result<> rather than throwing. The try is wider than the actual throwing surface. Pre-existing.

4) Memory Allocation

Solid:

  • BuildChunkedDeflateDcpl allocates one std::vector<hsize_t> for chunk dims; capacity = rank (≤ 4 in practice). Negligible.
  • HDF5DatasetProbe.cpp uses std::array for the filter-name and CD-values buffers — stack-allocated, no heap traffic.
  • WriteOptions is a 4-byte struct (today) passed by const reference downstream.
  • No new heap allocations on the per-array hot path; the m_CompressionLevel field is a single int32 per DatasetIO.
  • flattenedData in NeighborListIO is unchanged from prior code — compression doesn't add a copy.

Issues:

  • HDF5 deflate filter does its own per-chunk buffering (the filter pipeline allocates a chunk-sized scratch buffer per write). At 1 MiB per chunk this is bounded and not a concern, but worth noting for very large arrays where total in-flight HDF5 buffer = ~1 MiB × concurrent writes. With single-threaded writeSpan, the worst case is 1 MiB extra resident.
  • Round-trip read after compression decompresses into a full AbstractDataStore — pre-existing, but compressed files are now slightly slower to import than the equivalent uncompressed file. This is the deflate trade-off, mentioned in the docs.

5) CPU Performance

Solid:

  • Compression off is the zero-overhead path: m_CompressionLevel == 0 short-circuits in BuildChunkedDeflateDcpl and returns H5P_DEFAULT before any HDF5 calls (DatasetIO.cpp:78-81). No regression for users who turn it off.
  • 16 KiB small-array bypass keeps tiny datasets out of the chunk pipeline. Unit-tested at WriteDREAM3DFilter: Compression_SmallArray_Bypasses and the DatasetIO-level small-array test in H5Test.cpp:1154.
  • Chunk shape targets ~1 MiB along the outermost dim — gives long contiguous spans to deflate and matches DataArray's tuple-major iteration order.
  • Default level 5 is a justified middle ground in the docs; level 9's CPU cost is real, level 1 is fast but compresses ~30% less on representative voxel data (per the round-trip test).
  • HDF5 writes one chunk at a time — no global serialization, no thread contention introduced.

Issues:

  • Chunk-size heuristic ignores rank > 1. Inner dimensions are kept at full extent; only dims[0] is shrunk to fit ~1 MiB. For arrays with very wide component dims (e.g. an EBSD int32[N, 9] quaternion array), rowBytes = 36 so rowsPerChunk = 1 MiB / 36 ≈ 29k rows — fine. But for pathological cases (a 1-tuple, billion-component array, which doesn't exist today), the heuristic would clamp to 1 row and produce a giant chunk. Not a practical concern; document the heuristic's domain or add a MIN(chunkBytes, k_MaxChunkBytes) ceiling if you want belt-and-suspenders.
  • No H5Pset_chunk_cache tuning. Default HDF5 chunk cache is ~1 MiB, which is fine for a single 1-MiB chunk but pathological for partial-chunk reads of very large arrays. Out of scope here; flagged for future when a read-side performance pass happens.
  • H5Pset_shuffle is not used. Shuffle filter often improves deflate ratio by 10–30% on numeric arrays at near-zero CPU cost. Worth experimenting in a follow-up — file format remains compatible (HDF5 records the filter chain per dataset).
  • No SZIP / zstd / LZ4 escape hatch. gzip is the right default for portability (the docs make the case well), but a user with a known reader stack could get 2–3× the speed at similar ratio with zstd. Future work.

6) Out-of-scope: docs/superpowers/ deletions

7a6732e2 deletes two docs/superpowers/plans/ and docs/superpowers/specs/ files that exist on develop. Confirmed legitimate cleanup — develop accidentally received these via #1567, and Joey has a parallel cleanup commit (6fbfc8d8 on joey/ooc-filter-optimizations) that does the same removal. Either:

  • Land Joey's cleanup commit on develop first and then drop these deletions from this PR, or
  • Leave them as-is and call out the side effect in the merge commit / PR body.

Recommend the second: the deletions are correct and the parallel commit on Joey's branches is unmerged. Just make the side-effect explicit so reviewers don't trip over it.

Summary

The PR is well-scoped and the design is layered cleanly with sensible defaults. Multiple defensive guards prevent silent corruption (overflow, partial HDF5 failure, misconfigured levels). Tests cover the contract end-to-end including round-trip equivalence and monotonic compression-by-level. No correctness blockers. The improvements suggested above are all polish or follow-up work, not gating issues.

@imikejackson
Copy link
Copy Markdown
Contributor Author

Design Validation — HDF5 Compression Plumbing

A separate pass focused only on the design choices used to wire compression through the stack: where state lives, why the boundaries are drawn the way they are, and whether the trade-offs hold up under the constraints simplnx actually has.

The four design choices that shape this PR

  1. WriteOptions struct on DataStructureWriter rather than a plain int compressionLevel parameter.
  2. DatasetIO::m_CompressionLevel as per-instance state rather than a writeSpan(level, dims, span) parameter.
  3. AbstractDataStore::writeHdf5 signature deliberately unchanged.
  4. IDataIO-level opt-in — only DataArrayIO and NeighborListIO participate; geometry/AM/scalar/string IOs stay contiguous.

I'll validate each.


1) WriteOptions struct

Why this works. The struct gives a single forward-compatible vehicle for every write-time policy knob. Today it carries one field. Tomorrow it can carry shuffle filter on/off, alternate codec choice, chunk-size override, or a per-array filter callback — none of those additions break the eight call sites currently wired through it. The two pre-existing zero-arg WriteFile / AppendFile overloads delegate to the options-aware ones with WriteOptions{}, so no caller in the codebase has to change unless it wants to.

Why a struct beats an int level parameter. A plain int would have meant adding a parameter to every WriteFile / AppendFile / WriteDataStructure overload, then doing it again the next time we add shuffle, then again for chunk overrides, etc. Every addition would either ripple through every signature or live as a side channel. The struct centralizes that growth.

Why a struct beats a free-function setGlobalCompressionLevel. Tested. Compression policy travels with the writer instance, not the process. Two pipelines can write two files at two levels without a TLS dance.

Verdict: correct. The minor nit is the namespace placement (DataStructureWriter::WriteOptions rather than nx::core::HDF5::WriteOptions) — this is the only place where the concept name has a longer-than-needed FQN. Cosmetic, not load-bearing.


2) DatasetIO::m_CompressionLevel as per-instance state

Why this works. DatasetIO is short-lived (created at write-time, destroyed at scope exit) and is already stateful — it owns an hid_t, tracks open/closed state, and has move semantics. Adding one int32 of write-policy state matches the rest of the class.

Why it's not a writeSpan(dims, span, options) parameter. This is the more interesting trade. Every writeSpan<T> instantiation is currently in a hot path called via DataStoreIO::WriteDataStore. Adding a parameter would have meant:

  • Updating all 11 explicit-template-instantiation declarations in DatasetIO.hpp plus their definitions.
  • Updating every call site (DataArrayIO, NeighborListIO, plus any external plugins that call writeSpan directly — and the SimplnxOoc plugin does).

Setting a per-instance level once and letting writeSpan consult it costs one int32 per writer and zero call-site changes. The downside — a caller could forget to set the level — is bounded: forgetting falls through to the existing safe default (level 0, contiguous), so the failure mode is silent uncompressed output, not corruption.

The chosen design also dovetails with the layering: DataStructureWriter knows the policy, stamps each DatasetIO it creates, and the IDataIO subclasses don't have to thread the level through. Verdict: correct for this codebase's call-site economics.

A "harder to misuse" alternative — passing options to writeSpan directly — would be safer in isolation but would compound the API churn for very little real-world risk.


3) AbstractDataStore::writeHdf5 signature deliberately unchanged

This is the non-obvious design call and is the right one. AbstractDataStore::writeHdf5 is overridden by the external SimplnxOoc plugin (out-of-tree). Adding a parameter would have:

  • Forced an SimplnxOoc release on the same beat as this PR, OR
  • Required a backwards-compat shim inside the AbstractDataStore base class.

By keeping the signature stable and routing compression through DatasetIO::m_CompressionLevel, the SimplnxOoc override compiles unchanged. SimplnxOoc's chunk-aware writes that route through their own writeChunk path simply don't pick up compression — which is acceptable because OOC stores already manage their own on-disk layout (the chunked-IO architecture is itself an alternative to gzip).

Verdict: correct, and the constraint is documented in the commit message. This is a load-bearing decision; future contributors should not "clean it up" without verifying the OOC plugin still builds.


4) IDataIO-level opt-in (DataArrayIO + NeighborListIO only)

Why this works. Compression is only worthwhile when the dataset is large and homogeneous. The compressed candidates in this PR — DataArray and NeighborList flattened payload — are exactly that. The non-compressed candidates are:

IDataIO subclass Why exempt
Geometry IO (image/rect/triangle/edge/vertex/quad/hex/tetra) Geometry datasets are mostly small (vertices < 1M typical), and metadata datasets are tiny.
AttributeMatrixIO Stores tuple shape only — a few usize values.
ScalarDataIO Single scalar.
StringArrayIO Variable-length strings; compression filters apply per-chunk and the win on string data is marginal.
NeighborListIO numNeighbors array Same shape as a small DataArray; gets the writer-level compression too via the DataArray path.

The 16 KiB small-array bypass inside BuildChunkedDeflateDcpl is the second line of defense: even if a small DataArray slips through (which it does), the helper falls through to contiguous storage. So the policy is "compress big array-shaped data; everything else stays contiguous" — implemented at two layers.

Why not compress everything? Three reasons:

  • Chunk-index overhead dominates for small datasets (motivation for the 16 KiB bypass).
  • File-open time grows with chunk count; metadata datasets are read on every open.
  • gzip on string arrays doesn't help meaningfully because LZ77 already does string-like compression and HDF5 string handling adds overhead per chunk.

Verdict: correct, and the choice is conservative — strict opt-in over opt-out. Easy to expand to additional IDataIO subclasses if benchmarks justify it later; harder to retract.


Cross-cutting design decisions

Compression-on-by-default at level 5

Defaults are policy. Three options were available:

  • Off by default — safest, but new files would need explicit opt-in to benefit.
  • On at level 1 — fast, smaller win.
  • On at level 5 — chosen.

Level 5 is the gzip middle of the road: typically 60-80% of level-9 ratio at 30-40% of level-9 CPU cost. For voxel/feature data this lands at ~3× smaller files with negligible read-side decompression cost (deflate decode is very fast). The CPU cost on write is bounded by the user's IO subsystem; on most modern disks gzip-5 is faster end-to-end than uncompressed (the PCI/SATA savings exceed the compression cost). Justified.

FromSIMPLJson sets UseCompression = false

This is the design decision that prevents a silent file-format compatibility surprise. SIMPL v6 wrote uncompressed files; if a SIMPL-imported pipeline silently started writing gzip-5 files, every downstream consumer that parses byte offsets, every script that hashes dream3d files for caching, every reproducibility regression suite, would see a one-time bit-for-bit difference. By forcing the default to false for SIMPL conversions, the imported pipeline produces the byte-identical output the user expected when they saved it. Verdict: this is exactly right. A linked params.linkParameters connects the bool to the level field in the UI, so user can manually enable compression if desired.

Preflight rejection of out-of-range levels (when compression is on) but acceptance when off

Looking at WriteDREAM3DFilter::preflightImpl:

if(useCompression && (compressionLevel < 1 || compressionLevel > 9))
{
  return MakePreflightErrorResult(...);
}

When useCompression == false, the level is ignored. This is consistent with the linked-parameter pattern — the disabled level field shouldn't block preflight. Tested explicitly in Compression_Preflight_RejectsOutOfRangeLevel. Verdict: correct.

Chunk shape "1 MiB along outermost dim"

The choice has three independent rationales that all point the same way:

  1. HDF5 stores in C-order, so the outermost dim is the slowest-varying — chunking along it creates contiguous-in-memory chunks.
  2. DataArray iteration is tuple-major; chunks aligned to tuple boundaries match the access pattern.
  3. 1 MiB is a sweet spot for gzip's LZ77 dictionary (32 KiB) — large enough to not be dictionary-bound, small enough that chunk-cache misses are cheap.

A user-configurable chunk shape was rejected (correctly, IMO) because the parameter surface for "compression knobs that almost no one will tune" is itself a maintenance cost. If a real-world dataset emerges where 1 MiB is wrong, this becomes a WriteOptions field. Verdict: defensible default with a clear escape hatch (extend WriteOptions later).

Error propagation strategy in BuildChunkedDeflateDcpl

The helper distinguishes three result classes:

  1. Errors (level out of range, HDF5 API failures) → Result<hid_t>::invalid(), propagated through writeSpan.
  2. Intentional fall-through (level 0, empty dims, sub-16 KiB) → Result<hid_t>::valid(H5P_DEFAULT).
  3. SuccessResult<hid_t>::valid(dcpl) with caller-owned ownership.

This is the right shape because the caller has to do different things in each case: (1) abort the write, (2) write contiguously without the helper failing the operation, (3) write chunked and close the DCPL. Encoding that as a sentinel-only return (H5P_DEFAULT for any non-error) would have lost the level-out-of-range diagnostic. Verdict: well-designed. The original review feedback pushed this from sentinel to Result<> and the change is materially better.


Summary

The design holds up to scrutiny on every axis I checked. The decisions are layered, the trade-offs are explicit, and each non-obvious choice (write-time state location, AbstractDataStore signature stability, IDataIO opt-in scope, FromSIMPLJson default) has a concrete rationale tied to the constraints of the simplnx ecosystem and its external plugins. The defaults are conservative (safe legacy behavior preserved) where the cost of breaking is high, and aggressive (compression on at 5) where the cost is low and the benefit is real.

The change to surface Result<hid_t> from the DCPL builder, in response to JDuffeyBQ's review, is a strict improvement and should be the model for migrating the older CreateH5DatasetChunkProperties path in a follow-up.

Comment thread src/simplnx/Utilities/Parsing/HDF5/IO/DatasetIO.cpp
Comment thread test/UnitTestCommon/include/simplnx/UnitTest/HDF5DatasetProbe.hpp Outdated
Comment thread test/UnitTestCommon/include/simplnx/UnitTest/HDF5DatasetProbe.hpp Outdated
imikejackson added a commit to imikejackson/simplnx that referenced this pull request Apr 30, 2026
* BuildChunkedDeflateDcpl now returns errors for empty dims, zero
  elementByteSize, zero-valued dims, and total-byte overflow rather
  than silently falling through to H5P_DEFAULT. The intentional
  small-array bypass (< 16 KiB) still returns H5P_DEFAULT.
* ProbeHdf5Dataset now returns std::optional<DatasetProbeInfo>;
  removed the DatasetLayout::Unknown sentinel. All call sites in
  H5Test.cpp and DREAM3DFileTest.cpp updated to check has_value().
* Trimmed implementation-detail prose from the ProbeHdf5Dataset
  doc comment.

Signed-off-by: Michael Jackson <mike.jackson@bluequartz.net>
imikejackson and others added 3 commits April 30, 2026 18:48
Adds gzip (deflate) compression for DataArray datasets in .dream3d files,
exposed through two new WriteDREAM3DFilter parameters:

  use_compression  (bool,  default true)
  compression_level (int,   1-9, default 5)

Architecture:
- DataStructureWriter carries a WriteOptions struct (compression level
  only, for now); WriteOptions flows through Dream3dIO::WriteFile ->
  DataStructureWriter::WriteFile.
- DatasetIO carries m_CompressionLevel state; writeSpan consults it and
  builds a chunked+deflate DCPL via CreateDatasetCreationPropertyList
  when level is 1-9 and the total byte count clears a 16 KiB small-array
  bypass.
- AbstractDataStore::writeHdf5 signature is intentionally unchanged so
  the external SimplnxOoc plugin's override keeps compiling.
- DataArrayIO and NeighborListIO stamp the compression level on their
  DatasetIO before writing. Other IDataIO subclasses (geometry IOs,
  attribute matrix IO, scalar IO, string array IO, etc.) stay
  contiguous, so metadata datasets do not participate in compression.
- Chunk shape targets ~1 MiB per chunk (pre-clamped, overflow-safe).
- FromSIMPLJson sets use_compression=false so converted legacy pipelines
  preserve their on-disk encoding.

File format:
- No version bump. stock HDF5 readers handle deflate transparently.

Tests (DREAM3DFileTest.cpp):
- Compression_Off_IsContiguous
- Compression_On_IsChunkedAndDeflated  (layout + round-trip bit-identity)
- Compression_SmallArray_Bypasses
- Compression_LevelsRoundTrip          (levels {1, 5, 9} monotonically
                                         non-increasing file sizes)
- Preflight range check on compression_level.

Docs:
- WriteDREAM3DFilter.md expanded to describe compression and XDMF.
- Doxygen on every new public / protected symbol introduced
  (DatasetIO::CreateDatasetCreationPropertyList, the two WriteFile
  overloads, WriteDREAM3DInputValues fields, and accessors).

Squashed from the following 15 commits on topic/enable_hdf5_compression:

>> DOC: Design spec for HDF5 gzip compression in .dream3d writer

Adds a brainstorm-validated design for compressing DataArray datasets
with gzip (deflate) on write. Two new parameters on WriteDREAM3DFilter
(use_compression, compression_level 1-9, default ON at level 5) plumbed
via a WriteOptions struct on DataStructureWriter. Chunking targets
~1 MiB per chunk; arrays under 16 KiB bypass compression. No file
version bump required -- stock HDF5 readers handle deflate transparently.

>> DOC: Update HDF5 compression spec -- keep AbstractDataStore::writeHdf5 ABI stable

Reading the actual code revealed AbstractDataStore::writeHdf5 is a pure
virtual overridden by the SimplnxOoc plugin's DataStore subclass (in a
separate repo). Changing its signature would silently break that build.

Switched the opt-in mechanism: DatasetIO carries m_CompressionLevel
state; DataArrayIO sets it before calling WriteDataStore; writeSpan
consults its own state. AbstractDataStore::writeHdf5 stays unchanged.
Also corrected the test-file reference from a nonexistent
WriteDREAM3DFilterTest.cpp to the existing DREAM3DFileTest.cpp.

>> DOC: Implementation plan for HDF5 gzip compression in .dream3d writer

Seven-task plan that lands the compression feature incrementally with
frequent green builds and commits:
  1. WriteOptions on DataStructureWriter (plumbing only)
  2. Forward WriteOptions through Dream3dIO::WriteFile
  3. DatasetIO gets m_CompressionLevel + DCPL helper + writeSpan wiring
  4. DataArrayIO opts in on behalf of DataArray datasets
  5. WriteDREAM3DFilter exposes the two user-facing parameters
  6. Spec Section 6 test suite (off / on / small-bypass / levels)
  7. Full build + ctest verification

TDD ordering throughout; every task commits green.

>> ENH: Introduce WriteOptions on DataStructureWriter

Plumbing-only change. DataStructureWriter grows a WriteOptions struct
(currently just compressionLevel) plus getter/setter and WriteOptions
overloads on WriteFile/AppendFile. Existing overloads forward to the new
ones with a default (0, no compression), preserving current behavior.

>> ENH: Dream3dIO::WriteFile accepts optional WriteOptions

Adds new WriteFile overloads on both FileIO-based and path-based entry
points that forward WriteOptions through to DataStructureWriter. The
existing no-options overloads are preserved and delegate to the new ones
with a default-constructed WriteOptions, so every existing call site
continues to work unchanged.

Threads options through the file-scope WriteDataStructure helper
(overloaded) so the layered WriteFile(path) -> WriteFile(FileIO) ->
WriteDataStructure -> DataStructureWriter::WriteFile pipeline is
preserved without duplicating the version/pipeline/data-structure
sequence.

>> CLEANUP: Remove dead no-options WriteDataStructure helper in Dream3dIO

After threading WriteOptions through the WriteFile layering, the old
no-options WriteDataStructure(FileIO&, DS) file-scope helper has no
callers. Dropped. The remaining with-options helper is the only call
site for DataStructureWriter::WriteFile from this translation unit.

>> ENH: DatasetIO writes chunked+deflate when m_CompressionLevel > 0

DatasetIO learns a compression-level setting (default 0 = contiguous).
writeSpan now builds a DCPL via CreateDatasetCreationPropertyList, which
applies H5Pset_chunk + H5Pset_deflate targeting ~1 MiB chunks when the
level is 1-9 and the total byte count clears a 16 KiB small-array bypass.
Callers that don't opt in get today's contiguous behavior unchanged.

>> DOC: Align compression spec with the defensive-fallback implementation

Section 4.3 previously promised "no silent fallback to uncompressed".
After Task 3 landed, the implementation (intentionally, per the plan)
returns H5P_DEFAULT from the DCPL helper on any HDF5 setup error,
causing an uncompressed write rather than failing the whole file. In
practice the failure paths are unreachable on stock HDF5 -- gzip is
always available, H5Pcreate never fails, and our chunk math is
pre-clamped. Update the spec to describe the actual (defensive-backstop)
behavior and note that preflight validation is the real user-visible
guardrail.

>> ENH: DataArrayIO applies WriteOptions.compressionLevel before writing

DataArrayIO::writeData now stamps the compression level from the
DataStructureWriter's WriteOptions onto the DatasetIO before calling
WriteDataStore. Other IDataIO subclasses (geometry IOs, attribute matrix
IO, scalar IO, string array IO, etc.) are unchanged, so metadata datasets
stay contiguous while DataArray datasets participate in compression when
requested.

AbstractDataStore::writeHdf5's signature remains stable so the external
SimplnxOoc plugin continues to build without changes.

>> ENH: WriteDREAM3DFilter exposes HDF5 gzip compression controls

Two new parameters: use_compression (bool, default true) and
compression_level (int32, 1-9, default 5), linked so the level is only
editable when compression is enabled. Level is range-validated at
preflight. executeImpl forwards to WriteDREAM3D, which builds
DataStructureWriter::WriteOptions from the values and passes them
through DREAM3D::WriteFile. parametersVersion bumps to 2.

>> TEST: Exercise HDF5 gzip compression in WriteDREAM3DFilter

Four end-to-end tests covering the spec's required behaviors:
- Compression_Off_IsContiguous: use_compression=false writes contiguous.
- Compression_On_IsChunkedAndDeflated: use_compression=true writes a
  chunked dataset with deflate at the requested level, and the file
  round-trips byte-identically.
- Compression_SmallArray_Bypasses: small DataArrays stay contiguous
  even with compression on.
- Compression_LevelsRoundTrip: levels {1, 5, 9} all round-trip, and
  file sizes are monotonically non-increasing as level rises.

All four tests pass. Adds an InspectDatasetLayout helper to the
anonymous namespace that opens a .dream3d via raw HDF5 and reports
layout + deflate filter presence/level, keeping the per-test assertions
concise.

>> FIX: Address critical review findings on HDF5 compression branch

Three critical issues surfaced by review:

1. Compression_LevelsRoundTrip size-monotonicity check was dead code --
   DYNAMIC_SECTION restarts the TEST_CASE body per section, so
   sizesByLevel was always size 1 when the size==3 guard ran. Rewrite
   without DYNAMIC_SECTION so sizes accumulate across all three levels
   and the monotonicity REQUIREs actually fire. Also adds the missing
   REQUIRE(fr.isValid()) before ReadFile.

2. FromSIMPLJson was silently upgrading SIMPL v6 pipelines to write
   compressed output. Explicitly set use_compression=false in
   FromSIMPLJson so converted pipelines preserve their original
   uncompressed on-disk encoding. New NX pipelines still default to
   compression on.

3. DatasetIO move constructor and move assignment did not propagate
   m_CompressionLevel -- a latent footgun if a DatasetIO is ever moved
   after setCompressionLevel was called. Fixed.

>> FIX: Address important + nitpick review findings on HDF5 compression

Important:
- I1: Test HDF5 handle leaks on REQUIRE failure. H5Test's three DatasetIO
  tests now go through a local ProbeDataset() helper that handles its own
  cleanup, so REQUIRE on the returned struct can't leak file/dataset/dcpl
  handles. DREAM3DFileTest's Task 4 integration test routes through the
  existing InspectDatasetLayout helper for the same reason.
- I2: Add REQUIRE(fileReader.isValid()) before DREAM3D::ReadFile in the
  round-trip check.
- I4: Doxygen @param/@return on every new accessor (get/setWriteOptions
  on DataStructureWriter, get/setCompressionLevel on DatasetIO).
- I5: H5Pget_filter2's cd_nelmts in/out parameter now matches the 8-slot
  cdValues buffer capacity instead of hardcoding 1. Harmless with gzip
  (1 cd_value) but correct API usage.
- I6: CreateDatasetCreationPropertyList now has explicit overflow and
  empty-dims guards; comment no longer claims overflow is handled when
  it wasn't. Saturating multiply check falls through to contiguous on
  any wrap or zero dim.
- I7: NeighborListIO's flattened data array now inherits the same
  compression level as DataArray datasets -- neighbor lists can be very
  large and benefit just as much.
- I8: New preflight test exercises the compression_level range check
  (rejects 0 and 10 when use_compression=true; accepts 0 when
  use_compression=false).

Nitpicks:
- Dead range check `rowsPerChunk < 1` on unsigned type -> `== 0`.
- Comment rot ("today's behavior") replaced with time-independent wording.
- Noexcept on the four new trivial accessors.
- Magic preflight error code -2 replaced with named
  k_InvalidCompressionLevelError.
- Loop variable types int -> int32 in dataset probing helpers.
- Local constant kTuples renamed to k_Tuples (constexpr) to match
  project k_-prefix convention for named constants.
- Removed redundant `nx::core::` qualifications where `using namespace`
  was already in scope.
- 1ull * 1024ull * 1024ull -> 1024ull * 1024ull.
- Removed narrator-style WHAT comments in tests; the code explains
  itself.

>> DOC: Expand WriteDREAM3DFilter documentation for compression and XDMF

Replace the one-sentence description with a full explanation of:

- What the .dream3d file contains (DataStructure + pipeline).
- The new HDF5 gzip compression parameters: what they do, why gzip was
  chosen, what level 5 means relative to 1/9, and caveats around the
  16 KiB small-array bypass, non-DataArray metadata staying contiguous,
  the automatic 1 MiB chunk-shape target, round-trip losslessness, and
  the unchanged file format.
- The FromSIMPLJson behavior (compression off by default for legacy
  pipeline conversions).
- What the .xdmf sidecar is for: ParaView/VisIt interoperability,
  metadata-only (points into the .dream3d), near-zero cost, recommended
  when visualization outside DREAM3D-NX is a possibility.

>> DOC: Flesh out Doxygen on new HDF5 compression APIs

Expand documentation on every new public / protected symbol introduced
by the compression branch so each has @brief plus described @param and
@return tags where applicable:

- DatasetIO::CreateDatasetCreationPropertyList: full @param/@return with
  the ownership contract and fallthrough conditions spelled out.
- Dream3dIO::WriteFile two new overloads: parameter descriptions filled
  in (were previously bare tag names).
- WriteDREAM3DInputValues: added @struct docstring, plus inline ///
  comments on the two new UseCompression / CompressionLevel fields to
  call out the interaction (level is ignored when the switch is off).

Private members and file-local anonymous-namespace helpers remain
undocumented per existing house style.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Encapsulate the chunked+deflate DCPL builder as a file-local helper in
  DatasetIO.cpp returning Result<hid_t>; H5Pcreate / H5Pset_chunk /
  H5Pset_deflate failures now propagate through writeSpan instead of
  silently degrading to contiguous storage. Intentional fall-throughs
  (level 0, empty dims, sub-16 KiB arrays) still resolve to H5P_DEFAULT.
* Drop the public CreateDatasetCreationPropertyList declaration from
  DatasetIO.hpp; it had no callers outside DatasetIO.cpp.
* setCompressionLevel now ignores out-of-range input so the documented
  contract matches the implementation.
* Trim the misleading "honor the same compression level DataArrayIO uses"
  comment in NeighborListIO; there is one global compressionLevel.
* Move the duplicated HDF5 dataset probe helper from H5Test.cpp and
  DREAM3DFileTest.cpp into a shared simplnx::UnitTestCommon header
  (HDF5DatasetProbe.hpp/.cpp) exposing a DatasetLayout enum class so
  consumers no longer need to include <hdf5.h> for layout assertions.

Signed-off-by: Michael Jackson <mike.jackson@bluequartz.net>
* BuildChunkedDeflateDcpl now returns errors for empty dims, zero
  elementByteSize, zero-valued dims, and total-byte overflow rather
  than silently falling through to H5P_DEFAULT. The intentional
  small-array bypass (< 16 KiB) still returns H5P_DEFAULT.
* ProbeHdf5Dataset now returns std::optional<DatasetProbeInfo>;
  removed the DatasetLayout::Unknown sentinel. All call sites in
  H5Test.cpp and DREAM3DFileTest.cpp updated to check has_value().
* Trimmed implementation-detail prose from the ProbeHdf5Dataset
  doc comment.

Signed-off-by: Michael Jackson <mike.jackson@bluequartz.net>
@imikejackson imikejackson force-pushed the topic/enable_hdf5_compression branch from e04de5f to e320086 Compare April 30, 2026 22:49
@imikejackson imikejackson requested a review from JDuffeyBQ May 1, 2026 15:18
@imikejackson imikejackson merged commit 560529d into BlueQuartzSoftware:develop May 1, 2026
6 checks passed
@imikejackson imikejackson deleted the topic/enable_hdf5_compression branch May 1, 2026 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants