Fix GPU predictor kernel stride for multi-sample TIFFs by brendancol · Pull Request #1222 · xarray-contrib/xarray-spatial

brendancol · 2026-04-19T09:15:29Z

Both call sites of _predictor_decode_kernel passed width = tile_width * samples and bytes_per_sample = itemsize * samples. The kernel body does row_bytes = width * bytes_per_sample, so row_bytes came out to tile_width * samples**2 * itemsize instead of tile_width * samples * itemsize.

For tiled multi-sample TIFFs with predictor=2, that meant the cumulative-sum loop walked samples times further per row than the row actually contained. On the last tile in the buffer it wrote past the end of d_decomp itself, so it's an OOB GPU write with no error surface.

To reproduce, any tiled TIFF with SamplesPerPixel > 1 (RGB, RGBA) and predictor=2, read via open_geotiff(path, gpu=True) or read_geotiff_gpu, produces pixels that don't match the CPU decode.

The fix is to pass width = tile_width at both call sites. bytes_per_sample = itemsize * samples stays. No kernel change. This matches what the CPU path already does at _reader._apply_predictor(..., bytes_per_sample * samples).

The new test file xrspatial/geotiff/tests/test_predictor_multisample.py builds tiled multi-sample TIFFs (RGB/RGBA, uint8/uint16, even and uneven tile grids) with predictor=2, decodes on the GPU, and checks byte-for-byte equality with the CPU decode. I confirmed the tests fail on the unpatched code and pass after the fix by stashing the change and re-running.

Test plan:

pytest xrspatial/geotiff/tests/test_predictor_multisample.py on a box with CuPy + CUDA.
Tests skip cleanly where CUDA isn't available.
pytest xrspatial/geotiff/tests/ clean apart from the pre-existing matplotlib palette recursion failures.

One thing I noticed but didn't fix here: the predictor=3 path has its own argument pattern (bytes_per_sample=itemsize, no * samples) that also looks wrong for multi-sample data. Out of scope for this PR.

Both call sites of _predictor_decode_kernel passed width=tile_width*samples and bytes_per_sample=itemsize*samples. The kernel computes row_bytes = width * bytes_per_sample, so row_bytes ended up tile_width * samples**2 * itemsize instead of tile_width * samples * itemsize. The inner loop walked past the end of each tile row and, on the last tile, past the end of the d_decomp buffer (OOB GPU write). Fix: pass width=tile_width at both call sites. bytes_per_sample stays itemsize*samples. Matches the CPU call convention in _reader._apply_predictor. Regression test builds tiled multi-sample TIFFs (RGB/RGBA, uint8/uint16, even and uneven tile grids) with predictor=2, decodes via the GPU path, and asserts byte-for-byte equality with the CPU decode.

github-actions bot added the performance PR touches performance-sensitive code label Apr 19, 2026

brendancol merged commit 0fb7f4c into master Apr 19, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GPU predictor kernel stride for multi-sample TIFFs#1222

Fix GPU predictor kernel stride for multi-sample TIFFs#1222
brendancol merged 1 commit intomasterfrom
issue-1220

brendancol commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brendancol commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant