Skip to content

Perf: OpenMP cache blocking and SIMD for PW_Basis FFT transform copy routines#7439

Open
MiniYuanBot wants to merge 3 commits into
deepmodeling:developfrom
mystic-qaq:feat/fft-copy-block-simd
Open

Perf: OpenMP cache blocking and SIMD for PW_Basis FFT transform copy routines#7439
MiniYuanBot wants to merge 3 commits into
deepmodeling:developfrom
mystic-qaq:feat/fft-copy-block-simd

Conversation

@MiniYuanBot
Copy link
Copy Markdown

What's changed

This PR optimizes the memory-bound copy loops in PW_Basis::real2recip and PW_Basis::recip2real (source/module_pw/pw_transform.cpp) using cache blocking and SIMD vectorization, while maintaining full numerical compatibility with the original implementation.

Key Changes

  1. Cache blocking (tiling)
    Introduced a unified block size pw_transform_cache_block = 1024 and helper block_end(). All long copy loops are rewritten in a two-level structure:

    #pragma omp parallel for schedule(static)
    for (int ib = 0; ib < nrxx_; ib += pw_transform_cache_block) {
        const int iend = block_end(ib, nrxx_);
        #pragma omp simd
        for (int ir = ib; ir < iend; ++ir) {
            auxr[ir] = in_[ir];
        }
    }

    This keeps the working set in L1/L2 cache and mitigates false sharing across OpenMP threads.

  2. SIMD vectorization
    Added #pragma omp simd to the inner stride-1 loops (continuous copy, zeroing, and accumulation). This helps the compiler emit contiguous SIMD instructions (AVX2/AVX-512) for std::complex<FPTYPE> and real-valued buffers.

  3. Alias analysis & pointer caching
    Cached frequently accessed member variables (nrxx, npw, nxyz, ig2isz) and FFT buffer pointers (auxr, auxg, rspace) as local const variables. This reduces repeated this-> indirection and improves compiler aliasing assumptions.

  4. Finer-grained timers
    Added sub-timers (real2recip_copy_r, real2recip_copy_g, recip2real_copy_r, recip2real_copy_g) to isolate memory-copy overhead from FFT library time, aiding future profiling.

Performance (256^3 grid, ecut=50, 20 repeats, WSL2 GCC 13.3.0)

Threads Time (s) Speedup Efficiency
1 29.34 1.00 100.0%
2 13.84 2.12 105.9%
4 10.17 2.88 72.1%
8 4.53 6.48 81.0%
12 4.93 5.96 49.6%
16 3.71 7.91 49.4%
  • 8 physical cores achieve 6.48× speedup at 81% parallel efficiency.
  • Efficiency drops beyond 8 threads due to Hyper-Threading and memory-bandwidth saturation, which is expected for memory-intensive FFT kernels.

Files Changed

  • source/module_pw/pw_transform.cpp — optimized copy loops and timers

@MiniYuanBot
Copy link
Copy Markdown
Author

MiniYuanBot commented Jun 5, 2026

\label project_learning
This is Problem 3 of the assignment01 on the plane wave module.
Thanks for the review: )

Copy link
Copy Markdown
Collaborator

@Qianruipku Qianruipku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Could you include a performance comparison with the original implementation?

@MiniYuanBot
Copy link
Copy Markdown
Author

Here is the benchmark on 256³ grid, 20 repeats, GCC 13.3.0, OMP_PROC_BIND=close, with pw_transform_cache_block = 128

Threads Before (s) After (s) Speedup
1 18.25 16.22 1.13×
4 5.22 4.69 1.11×
8 3.39 3.09 1.10×
16 3.20 2.79 1.15×

Block size rationale
I tested pw_transform_cache_block = 64, 128, 256, 512, 1024 and found that 1024 slowed down. Maybe it is because its working set (2 arrays × 1024 × 16 B ≈ 32 KB) nears my PC's per-core L1d limit (48 KB), causing capacity misses. So I adjust the param to 128.

Note on absolute numbers
I tested on my WSL2, so I don't know why the performance of this run is so far away from the previous one. But I think the comparison between these two is reasonable.

Thanks for the suggestion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants