Perf: OpenMP cache blocking and SIMD for PW_Basis FFT transform copy routines by MiniYuanBot · Pull Request #7439 · deepmodeling/abacus-develop

MiniYuanBot · 2026-06-05T14:51:14Z

What's changed

This PR optimizes the memory-bound copy loops in PW_Basis::real2recip and PW_Basis::recip2real (source/module_pw/pw_transform.cpp) using cache blocking and SIMD vectorization, while maintaining full numerical compatibility with the original implementation.

Key Changes

Cache blocking (tiling)
Introduced a unified block size pw_transform_cache_block = 1024 and helper block_end(). All long copy loops are rewritten in a two-level structure:

#pragma omp parallel for schedule(static)
for (int ib = 0; ib < nrxx_; ib += pw_transform_cache_block) {
    const int iend = block_end(ib, nrxx_);
    #pragma omp simd
    for (int ir = ib; ir < iend; ++ir) {
        auxr[ir] = in_[ir];
    }
}

This keeps the working set in L1/L2 cache and mitigates false sharing across OpenMP threads.

SIMD vectorization
Added #pragma omp simd to the inner stride-1 loops (continuous copy, zeroing, and accumulation). This helps the compiler emit contiguous SIMD instructions (AVX2/AVX-512) for std::complex<FPTYPE> and real-valued buffers.
Alias analysis & pointer caching
Cached frequently accessed member variables (nrxx, npw, nxyz, ig2isz) and FFT buffer pointers (auxr, auxg, rspace) as local const variables. This reduces repeated this-> indirection and improves compiler aliasing assumptions.
Finer-grained timers
Added sub-timers (real2recip_copy_r, real2recip_copy_g, recip2real_copy_r, recip2real_copy_g) to isolate memory-copy overhead from FFT library time, aiding future profiling.

Performance (256^3 grid, ecut=50, 20 repeats, WSL2 GCC 13.3.0)

Threads	Time (s)	Speedup	Efficiency
1	29.34	1.00	100.0%
2	13.84	2.12	105.9%
4	10.17	2.88	72.1%
8	4.53	6.48	81.0%
12	4.93	5.96	49.6%
16	3.71	7.91	49.4%

8 physical cores achieve 6.48× speedup at 81% parallel efficiency.
Efficiency drops beyond 8 threads due to Hyper-Threading and memory-bandwidth saturation, which is expected for memory-intensive FFT kernels.

Files Changed

source/module_pw/pw_transform.cpp — optimized copy loops and timers

MiniYuanBot · 2026-06-05T14:52:16Z

\label project_learning
This is Problem 3 of the assignment01 on the plane wave module.
Thanks for the review: )

Qianruipku

LGTM. Could you include a performance comparison with the original implementation?

MiniYuanBot · 2026-06-06T12:45:42Z

Here is the benchmark on 256³ grid, 20 repeats, GCC 13.3.0, OMP_PROC_BIND=close, with pw_transform_cache_block = 128

Threads	Before (s)	After (s)	Speedup
1	18.25	16.22	1.13×
4	5.22	4.69	1.11×
8	3.39	3.09	1.10×
16	3.20	2.79	1.15×

Block size rationale
I tested pw_transform_cache_block = 64, 128, 256, 512, 1024 and found that 1024 slowed down. Maybe it is because its working set (2 arrays × 1024 × 16 B ≈ 32 KB) nears my PC's per-core L1d limit (48 KB), causing capacity misses. So I adjust the param to 128.

Note on absolute numbers
I tested on my WSL2, so I don't know why the performance of this run is so far away from the previous one. But I think the comparison between these two is reasonable.

Thanks for the suggestion!

add simd to fft

4af6586

mohanchen added the project_learning label Jun 5, 2026

mohanchen assigned Qianruipku Jun 5, 2026

mohanchen requested a review from Qianruipku June 5, 2026 22:04

Qianruipku reviewed Jun 6, 2026

View reviewed changes

MiniYuanBot added 2 commits June 6, 2026 20:48

Set pw_transform_cache_block=128

dad1744

Merge branch 'develop' into feat/fft-copy-block-simd

0b26389

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf: OpenMP cache blocking and SIMD for PW_Basis FFT transform copy routines#7439

Perf: OpenMP cache blocking and SIMD for PW_Basis FFT transform copy routines#7439
MiniYuanBot wants to merge 3 commits into
deepmodeling:developfrom
mystic-qaq:feat/fft-copy-block-simd

MiniYuanBot commented Jun 5, 2026

Uh oh!

MiniYuanBot commented Jun 5, 2026 •

edited

Loading

Uh oh!

Qianruipku left a comment

Uh oh!

MiniYuanBot commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

MiniYuanBot commented Jun 5, 2026

What's changed

Key Changes

Performance (256^3 grid, ecut=50, 20 repeats, WSL2 GCC 13.3.0)

Files Changed

Uh oh!

MiniYuanBot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Qianruipku left a comment

Choose a reason for hiding this comment

Uh oh!

MiniYuanBot commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MiniYuanBot commented Jun 5, 2026 •

edited

Loading