Skip to content

Fix GPU predictor kernel stride for multi-sample TIFFs#1222

Merged
brendancol merged 1 commit intomasterfrom
issue-1220
Apr 19, 2026
Merged

Fix GPU predictor kernel stride for multi-sample TIFFs#1222
brendancol merged 1 commit intomasterfrom
issue-1220

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Fixes #1220.

Both call sites of _predictor_decode_kernel passed width = tile_width * samples and bytes_per_sample = itemsize * samples. The kernel body does row_bytes = width * bytes_per_sample, so row_bytes came out to tile_width * samples**2 * itemsize instead of tile_width * samples * itemsize.

For tiled multi-sample TIFFs with predictor=2, that meant the cumulative-sum loop walked samples times further per row than the row actually contained. On the last tile in the buffer it wrote past the end of d_decomp itself, so it's an OOB GPU write with no error surface.

To reproduce, any tiled TIFF with SamplesPerPixel > 1 (RGB, RGBA) and predictor=2, read via open_geotiff(path, gpu=True) or read_geotiff_gpu, produces pixels that don't match the CPU decode.

The fix is to pass width = tile_width at both call sites. bytes_per_sample = itemsize * samples stays. No kernel change. This matches what the CPU path already does at _reader._apply_predictor(..., bytes_per_sample * samples).

The new test file xrspatial/geotiff/tests/test_predictor_multisample.py builds tiled multi-sample TIFFs (RGB/RGBA, uint8/uint16, even and uneven tile grids) with predictor=2, decodes on the GPU, and checks byte-for-byte equality with the CPU decode. I confirmed the tests fail on the unpatched code and pass after the fix by stashing the change and re-running.

Test plan:

  • pytest xrspatial/geotiff/tests/test_predictor_multisample.py on a box with CuPy + CUDA.
  • Tests skip cleanly where CUDA isn't available.
  • pytest xrspatial/geotiff/tests/ clean apart from the pre-existing matplotlib palette recursion failures.

One thing I noticed but didn't fix here: the predictor=3 path has its own argument pattern (bytes_per_sample=itemsize, no * samples) that also looks wrong for multi-sample data. Out of scope for this PR.

Both call sites of _predictor_decode_kernel passed
width=tile_width*samples and bytes_per_sample=itemsize*samples. The
kernel computes row_bytes = width * bytes_per_sample, so row_bytes
ended up tile_width * samples**2 * itemsize instead of
tile_width * samples * itemsize. The inner loop walked past the end
of each tile row and, on the last tile, past the end of the d_decomp
buffer (OOB GPU write).

Fix: pass width=tile_width at both call sites. bytes_per_sample stays
itemsize*samples. Matches the CPU call convention in
_reader._apply_predictor.

Regression test builds tiled multi-sample TIFFs (RGB/RGBA, uint8/uint16,
even and uneven tile grids) with predictor=2, decodes via the GPU path,
and asserts byte-for-byte equality with the CPU decode.
@github-actions github-actions bot added the performance PR touches performance-sensitive code label Apr 19, 2026
@brendancol brendancol merged commit 0fb7f4c into master Apr 19, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GPU predictor decode kernel over-indexes d_decomp for multi-sample tiled TIFFs

1 participant