From b24db5987337dcbcf81637ccf563c08fad07fb44 Mon Sep 17 00:00:00 2001 From: Michel Schanen Date: Thu, 18 Jun 2026 15:44:39 +0000 Subject: [PATCH] Give workgroup barriers their memory-fence flags `barrier(0)` lowers to an `OpControlBarrier` with `SequentiallyConsistent` semantics but no storage-class bit, which the SPIR-V spec treats as ordering no memory. So shared-local (and global) writes are not guaranteed visible to other work-items after the barrier, which can silently drop updates (e.g. a workgroup local-atomic accumulation losing counts). Pass the appropriate fence flags so the barrier actually orders memory: `LOCAL_MEM_FENCE | GLOBAL_MEM_FENCE` for KA `@synchronize` (matching CUDA `__syncthreads`), and `LOCAL_MEM_FENCE` for the mapreduce reduce_group shared-memory tree. Co-Authored-By: Claude Opus 4.8 (1M context) --- src/mapreduce.jl | 6 +++++- src/oneAPIKernels.jl | 8 +++++++- 2 files changed, 12 insertions(+), 2 deletions(-) diff --git a/src/mapreduce.jl b/src/mapreduce.jl index 822b9b16..645db2cd 100644 --- a/src/mapreduce.jl +++ b/src/mapreduce.jl @@ -33,7 +33,11 @@ # perform a reduction d = 1 while d < items - barrier(0) + # Fence local memory: `barrier(0)` lowers to an OpControlBarrier without the + # WorkgroupMemory storage-class bit, which does not order the shared-local tree + # accesses across the barrier. Fence local memory so each tree step sees the + # previous step's `shared[]` writes. + barrier(SPIRVIntrinsics.LOCAL_MEM_FENCE) index = 2 * d * (item-1) + 1 @inbounds if index <= items other_val = if index + d <= items diff --git a/src/oneAPIKernels.jl b/src/oneAPIKernels.jl index 6e092397..bc6f3218 100644 --- a/src/oneAPIKernels.jl +++ b/src/oneAPIKernels.jl @@ -214,7 +214,13 @@ end ## Synchronization and Printing @device_override @inline function KA.__synchronize() - barrier(0) + # Fence both local and global memory across the workgroup barrier, matching CUDA + # `__syncthreads` semantics. `barrier(0)` lowers to `OpControlBarrier` with + # `SequentiallyConsistent` but WITHOUT any storage-class bit, which the SPIR-V spec + # treats as ordering *no* memory — so shared-local or global writes are not guaranteed + # visible to other work-items after the barrier. `LOCAL_MEM_FENCE | GLOBAL_MEM_FENCE` + # ORs in the WorkgroupMemory/CrossWorkgroupMemory fence bits. + barrier(SPIRVIntrinsics.LOCAL_MEM_FENCE | SPIRVIntrinsics.GLOBAL_MEM_FENCE) end @device_override @inline function KA.__print(args...)