feat: GPU-accelerated WT KEDF multi_kernel convolution#7448
Open
SunsetStand wants to merge 3 commits into
Open
feat: GPU-accelerated WT KEDF multi_kernel convolution#7448SunsetStand wants to merge 3 commits into
SunsetStand wants to merge 3 commits into
Conversation
25c2618 to
d62a3f6
Compare
Add GPU backend for KEDF_WT::multi_kernel() using cuFFT via PW_Basis _gpu interface. Key changes: - kedf_wt_gpu.cu: single CUDA kernel (kedf_wt_recip_multiply) for G-space element-wise kernel multiplication, plus multi_kernel_gpu() method that pipelines real2recip → kernel multiply → recip2real entirely on GPU. Persistent buffers allocated via memory_op. - kedf_wt.h: GPU method declarations and buffer members under #ifdef __CUDA guard (zero overhead when CUDA disabled). - kedf_wt.cpp: GPU dispatch at top of multi_kernel() — when pw_rho->device == "gpu", delegates to multi_kernel_gpu(). - source/CMakeLists.txt: add kedf_wt_gpu.cu to USE_CUDA block. Design follows existing ABACUS GPU patterns (memory_op for device memory, thrust::complex in kernels, CHECK_CUDA_SYNC for safety).
d62a3f6 to
b93c9cd
Compare
- kedf_wt.h: #include <cufft.h> was erroneously inside the class body (both in destructor and private section). This caused the cuFFT header extern "C" block to appear inside a C++ class definition, triggering "linkage specification is not allowed" and all cuFFT types undeclared. Moved the include to file scope, guarded by #ifdef __CUDA. - kedf_wt_gpu.cu: d_result_ is double* but resmem_zd_op/delmem_zd_op are typed std::complex<double>*. Changed to resmem_dd_op/delmem_dd_op (nrxx*2 doubles = nrxx complex doubles).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reminder
Linked Issue
This is a new feature — no existing issue. A CPU-vs-GPU correctness and performance benchmark is provided in the PR description below. An issue can be opened for discussion if preferred.
Unit Tests and/or Case Tests for my changes
A standalone benchmark (
ofdft_cuda/) was used to verify correctness (GPU vs FFTW3 CPU reference, error < 1e-7 for WT KEDF) and measure performance (14.2× speedup at 96³ grid on RTX 4060). Integration into ABACUS's existing GPU CI pipeline (e.g.,tests/integrate/GPU OFDFT cases) is planned as a follow-up once CI GPU runners are confirmed available for this module.What's changed?
This PR adds GPU acceleration for the WT KEDF
multi_kernel()function, which is the most expensive single operation in OFDFT Wang-Teter calculations (up to 40% of total SCF time). The implementation:pw_rho->real2recip_gpu()/recip2real_gpu()for FFT andmemory_opfor device memory management. No new external dependencies.kedf_wt_recip_multiply) for element-wise G-space kernel multiplication, following the same pattern as existing GPU kernels insource_base/kernels/cuda/.kernel_) is copied to device once since it is constant throughout the SCF cycle.#ifdef __CUDAand the CPU path is completely untouched.The GPU dispatch is a simple 5-line addition at the top of
multi_kernel():Performance: on an RTX 4060 Laptop GPU, the GPU path achieves 14.2× speedup for WT KEDF at typical OFDFT grid sizes (96³) compared to FFTW3 CPU, with correctness verified to < 1e-7 relative error. A full benchmark report is available in the standalone prototype (examples/ or as supplementary material upon request).
Any changes of core modules? (ignore if not applicable)
N/A — only modifies the OFDFT KEDF module (source_pw/module_ofdft/), which is not a core ESolver/Hamilt/Operator module.