Skip to content

feat: GPU-accelerated WT KEDF multi_kernel convolution#7448

Open
SunsetStand wants to merge 3 commits into
deepmodeling:developfrom
SunsetStand:feature/ofdft-kedf-gpu
Open

feat: GPU-accelerated WT KEDF multi_kernel convolution#7448
SunsetStand wants to merge 3 commits into
deepmodeling:developfrom
SunsetStand:feature/ofdft-kedf-gpu

Conversation

@SunsetStand
Copy link
Copy Markdown

Reminder

  • Have you linked an issue with this pull request?
  • Have you added adequate unit tests and/or case tests for your pull request?
  • Have you noticed possible changes of behavior below or in the linked issue?
  • Have you explained the changes of codes in core modules of ESolver, HSolver, ElecState, Hamilt, Operator or Psi? (ignore if not applicable)

Linked Issue

This is a new feature — no existing issue. A CPU-vs-GPU correctness and performance benchmark is provided in the PR description below. An issue can be opened for discussion if preferred.

Unit Tests and/or Case Tests for my changes

A standalone benchmark (ofdft_cuda/) was used to verify correctness (GPU vs FFTW3 CPU reference, error < 1e-7 for WT KEDF) and measure performance (14.2× speedup at 96³ grid on RTX 4060). Integration into ABACUS's existing GPU CI pipeline (e.g., tests/integrate/ GPU OFDFT cases) is planned as a follow-up once CI GPU runners are confirmed available for this module.

What's changed?

This PR adds GPU acceleration for the WT KEDF multi_kernel() function, which is the most expensive single operation in OFDFT Wang-Teter calculations (up to 40% of total SCF time). The implementation:

  1. Uses ABACUS's existing GPU infrastructurepw_rho->real2recip_gpu() / recip2real_gpu() for FFT and memory_op for device memory management. No new external dependencies.
  2. Adds a single CUDA kernel (kedf_wt_recip_multiply) for element-wise G-space kernel multiplication, following the same pattern as existing GPU kernels in source_base/kernels/cuda/.
  3. Persistent GPU buffers are lazily allocated on first call and reused across SCF iterations. The WT kernel array (kernel_) is copied to device once since it is constant throughout the SCF cycle.
  4. Zero overhead when CUDA is disabled — all GPU code is guarded by #ifdef __CUDA and the CPU path is completely untouched.
    The GPU dispatch is a simple 5-line addition at the top of multi_kernel():
#ifdef __CUDA
    if (pw_rho->device == "gpu") {
        this->multi_kernel_gpu(prho, rkernel_rho, exponent, pw_rho);
        return;
    }
#endif

Performance: on an RTX 4060 Laptop GPU, the GPU path achieves 14.2× speedup for WT KEDF at typical OFDFT grid sizes (96³) compared to FFTW3 CPU, with correctness verified to < 1e-7 relative error. A full benchmark report is available in the standalone prototype (examples/ or as supplementary material upon request).

Any changes of core modules? (ignore if not applicable)

N/A — only modifies the OFDFT KEDF module (source_pw/module_ofdft/), which is not a core ESolver/Hamilt/Operator module.

@SunsetStand SunsetStand force-pushed the feature/ofdft-kedf-gpu branch 2 times, most recently from 25c2618 to d62a3f6 Compare June 7, 2026 03:19
Add GPU backend for KEDF_WT::multi_kernel() using cuFFT via
PW_Basis _gpu interface. Key changes:

- kedf_wt_gpu.cu: single CUDA kernel (kedf_wt_recip_multiply) for
  G-space element-wise kernel multiplication, plus multi_kernel_gpu()
  method that pipelines real2recip → kernel multiply → recip2real
  entirely on GPU. Persistent buffers allocated via memory_op.

- kedf_wt.h: GPU method declarations and buffer members under
  #ifdef __CUDA guard (zero overhead when CUDA disabled).

- kedf_wt.cpp: GPU dispatch at top of multi_kernel() — when
  pw_rho->device == "gpu", delegates to multi_kernel_gpu().

- source/CMakeLists.txt: add kedf_wt_gpu.cu to USE_CUDA block.

Design follows existing ABACUS GPU patterns (memory_op for device
memory, thrust::complex in kernels, CHECK_CUDA_SYNC for safety).
@SunsetStand SunsetStand force-pushed the feature/ofdft-kedf-gpu branch from d62a3f6 to b93c9cd Compare June 7, 2026 03:23
- kedf_wt.h: #include <cufft.h> was erroneously inside the class body
  (both in destructor and private section). This caused the cuFFT header
  extern "C" block to appear inside a C++ class definition, triggering
  "linkage specification is not allowed" and all cuFFT types undeclared.
  Moved the include to file scope, guarded by #ifdef __CUDA.

- kedf_wt_gpu.cu: d_result_ is double* but resmem_zd_op/delmem_zd_op are
  typed std::complex<double>*. Changed to resmem_dd_op/delmem_dd_op
  (nrxx*2 doubles = nrxx complex doubles).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant