Skip to content

格点积分优化#371

Open
Auxiliarycirclefzy wants to merge 472 commits into
abacusmodeling:developfrom
Auxiliarycirclefzy:Grid_integral_optimization
Open

格点积分优化#371
Auxiliarycirclefzy wants to merge 472 commits into
abacusmodeling:developfrom
Auxiliarycirclefzy:Grid_integral_optimization

Conversation

@Auxiliarycirclefzy
Copy link
Copy Markdown

我们是 bxcx-044b 小组
小组成员:涂敞2300011032 冯梓烨2300011022 闫奕成2300011070
对于格点积分模块,在这个PR中我们进行了并行的优化以及数据重排分块的优化
辛苦老师批阅!

Cstandardlib and others added 30 commits January 21, 2026 10:53
* Increase md_nstep from 3 to 4

* Increase md_nstep from 3 to 4 in INPUT file
…#6884)

* Feature: Support ML EXX for training script.

* Update the interface to libnpy

* Refactor: Update the interface of libnpy in ml_tools

* Refactor: Implement the class ML_Base, which is the base class of KEDF_ML

* Feature: Add support to ML_EXX for KSDFT and OFDFT

* Fix: Update hamilt_pw.cpp

* Update ml_base.h and ml_base.cpp

* Fix: Modify pot_ml_exx.cpp to avoid negative value of rho

* Divide ml_base.cpp to ml_base.cpp and ml_base_pot.cpp

* Fix: Update pot_ml_exx.cpp
…odeling#6881)

* Refactor: save memory for kinetic and overlap force and stress

* Test: add UT for ekinetic_new and overlap_new

* Fix: error of force and stress after refactor

* Fix: UT for ekinetic and overlap

* Fix: gamma_only error of force_stress of edm

* Refactor: unify force/stress calculation for overlap and ekinetic operators

* Fix: overlap force stress error for nspin=2

* split test to serial part and parallel part

---------

Co-authored-by: dyzheng <zhengdy@bjaisi.com>
…les (deepmodeling#6878)

* update the examples of 02_NAO_Gamma

* update

* udpate

* update

* update tests in 02_NAO_Gamma

* small updates of write_HS.hpp

* update the format of H(k) and S(k)

* update write_HS.hpp

* update

* update the number of md steps to make it equal to the input parameter, now md steps starts from 1, originally it starts from 0

* update 02_NAO_Gamma examples

* add examples 002 and 003 in 02_NAO_Gamm

* update examples 41 and 42

* updates of 43 and 57 examples

* update example 17 in 03_NAO_multik

* update 44 example of 03_NAO_multik

* update 092 in 01_PW

* update 01_PW examples

* update 04_FF examples

* update 05_rtTDDFT examples

* update 06_SDFT examples

* update 07_OFDFT examples

* update 15_rtTDDFT_GPU examples

* update 16 and 17 examples in 15_rtTDDFT_GPU

* update 02

* fix bug

* fix bug

* update

* update 16_SDFT_GPU

* update

* update 02 data

* update 005 example in 02_NAO_Gamma

* add 006 in 02

* update CASES_CPU.txt

* fix a bug in 08_EXX 06

* fix bugs

* update alllog test

* fix a bug, when reading the orbital files and something went banana, the code should not quit immediately

* update of some formats

* fix a small bug

* update examples in 03_NAO_multik

* update

* update 35 example for pchg

* update dipole output in rt examples

* update 01 example in rt-TDDFT

* update rt-TDDFT input files

* update some INPUT files in rt-TDDFT

* Fix: Add missing return true in read_orb_file function to prevent double free error

* fix unittests

* update CASES_CPU.txt in 03_NAO_multik

* Modify output filename from INPUT to INPUT.info in driver.cpp

* update catch_properties

---------

Co-authored-by: abacus_fixer <mohanchen@pku.eud.cn>
Replace token-based authentication with OIDC (OpenID Connect) for codecov-action.
This is more secure and eliminates the need to manage upload tokens.

Changes:
- Add use_oidc: true to codecov-action configuration
- Add id-token: write permission at workflow level
- Remove token parameter from codecov-action (ignored when using OIDC)

This improves security and follows codecov-action best practices.

Generated by the task: njzjz-bot/njzjz-bot#25.
…esolver (deepmodeling#6892)

* Refactor: Encapsulate timer functionality in timer_wrapper.h

* Refactor timer code and clean_esolver function

1. Remove #ifdef __MPI from timer code, encapsulate in timer_wrapper.h
2. Move ESolver clean logic to after_all_runners method
3. Replace clean_esolver calls with direct delete p_esolver
4. Remove #ifdef __MPI from delete p_esolver
5. Add Cblacs_exit(1) in after_all_runners for LCAO calculations

---------

Co-authored-by: abacus_fixer <mohanchen@pku.eud.cn>
* Refactor: Encapsulate timer functionality in timer_wrapper.h

* Refactor timer code and clean_esolver function

1. Remove #ifdef __MPI from timer code, encapsulate in timer_wrapper.h
2. Move ESolver clean logic to after_all_runners method
3. Replace clean_esolver calls with direct delete p_esolver
4. Remove #ifdef __MPI from delete p_esolver
5. Add Cblacs_exit(1) in after_all_runners for LCAO calculations

* Refactor: Move heterogeneous parallel code to source_base/module_device

---------

Co-authored-by: abacus_fixer <mohanchen@pku.eud.cn>
…ing#6888)

* Feature: add Hessian operator <\phi|\nabla_x\nabla_y|\phi>

* fix: UT of twocenterintegral

---------

Co-authored-by: dyzheng <zhengdy@bjaisi.com>
* Refactor: Encapsulate timer functionality in timer_wrapper.h

* Refactor timer code and clean_esolver function

1. Remove #ifdef __MPI from timer code, encapsulate in timer_wrapper.h
2. Move ESolver clean logic to after_all_runners method
3. Replace clean_esolver calls with direct delete p_esolver
4. Remove #ifdef __MPI from delete p_esolver
5. Add Cblacs_exit(1) in after_all_runners for LCAO calculations

* Refactor: Move heterogeneous parallel code to source_base/module_device

* Refactor heterogeneous parallel code and migrate exx_info to module_xc

1. Refactor global.h:
   - Removed heterogeneous parallel code (CUDA/ROCm error checking macros)
   - Added include for source_base/module_device/device_check.h
   - Removed GlobalC::exx_info declaration

2. Migrate exx_info:
   - Added GlobalC::exx_info declaration to exx_info.h
   - Created exx_info.cpp with GlobalC::exx_info definition
   - Removed exx_info definition from global.cpp
   - Removed duplicate exx_info definition from exx_helper.cpp

3. Update build system:
   - Added exx_info.cpp to xc_ library in CMakeLists.txt
   - Added exx_info.o to OBJS_XC in Makefile.Objects
   - Fixed formatting in Makefile.Objects

4. Ensure compatibility:
   - Verify pure PW compilation works with exx_info.cpp
   - Verify GPU compilation works with refactored code

This refactoring improves code modularity by separating heterogeneous parallel functionality from global variables and moving EXX-related global variables to their own module.

---------

Co-authored-by: abacus_fixer <mohanchen@pku.eud.cn>
…kage (deepmodeling#6898)

* Refactor: Encapsulate timer functionality in timer_wrapper.h

* Refactor timer code and clean_esolver function

1. Remove #ifdef __MPI from timer code, encapsulate in timer_wrapper.h
2. Move ESolver clean logic to after_all_runners method
3. Replace clean_esolver calls with direct delete p_esolver
4. Remove #ifdef __MPI from delete p_esolver
5. Add Cblacs_exit(1) in after_all_runners for LCAO calculations

* Refactor: Move heterogeneous parallel code to source_base/module_device

* Refactor heterogeneous parallel code and migrate exx_info to module_xc

1. Refactor global.h:
   - Removed heterogeneous parallel code (CUDA/ROCm error checking macros)
   - Added include for source_base/module_device/device_check.h
   - Removed GlobalC::exx_info declaration

2. Migrate exx_info:
   - Added GlobalC::exx_info declaration to exx_info.h
   - Created exx_info.cpp with GlobalC::exx_info definition
   - Removed exx_info definition from global.cpp
   - Removed duplicate exx_info definition from exx_helper.cpp

3. Update build system:
   - Added exx_info.cpp to xc_ library in CMakeLists.txt
   - Added exx_info.o to OBJS_XC in Makefile.Objects
   - Fixed formatting in Makefile.Objects

4. Ensure compatibility:
   - Verify pure PW compilation works with exx_info.cpp
   - Verify GPU compilation works with refactored code

This refactoring improves code modularity by separating heterogeneous parallel functionality from global variables and moving EXX-related global variables to their own module.

* Move GlobalC::restart to source_io/restart

1. Move GlobalC::restart declaration from global.h to restart.h
2. Move GlobalC::restart definition from global.cpp to restart.cpp
3. Keep the same functionality and usage
4. Improve code modularity by centralizing restart-related code in source_io module
5. Ensure compatibility with both pure PW and GPU compilation modes

* Remove unnecessary global.h includes and fix line_search.cpp compilation error

* update global.h

* update global.h

* update global.h

* update global.h

* update stress_pw.cpp

* update global.h

* update global.h in module_pwdft

* update global.h

* update module_stodft

* delete global.h in source_io

* fix source_io

* delete inclusion of global.h in source_io

* Refactor: Remove unnecessary includes and clean up global.h references

* delete global.h in source_lcao

* update

* update

* fix

* update

* fix

* update source_cell

* update source_esolver

* update esolver

* update

* update module_charge

* update module_pot

* continue

* update fix

* fix

* update dftu

* update deepks

* ifx

* delete globalc.h in module_ri

* fix

* fix

* fix dftu_io

* fix diago_lapack.cpp

* updates

* update rdmft

* update module_rt

* update td operator

* update module_pwdft/operator etc

* solve fft

* update xc

* update

* delete global.h and global.cpp, finally after nearly 20 years

* fix op_exx_lcao

* fix

* fix

---------

Co-authored-by: abacus_fixer <mohanchen@pku.eud.cn>
The gint_gpu_vars.h file already exists in the kernel directory.
This temp_gint directory was left over from a previous refactoring.

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…ling#6886)

* Add files via upload

* Add files via upload

* Add files via upload

* Add files via upload

* Delete source/ctrl_output_td.h

* Add files via upload

* Add files via upload

* Add files via upload

* Add files via upload

* Update td_info.cpp

* Update td_current_io_comm.cpp

---------

Co-authored-by: Mohan Chen <mohanchen@pku.edu.cn>
* Fix: Add override to Pot_ML_EXX::cal_v_eff to avoid compilation warning.

* Fix: Provide a clearer, friendlier error when ML KEDF is used without ENABLE_MLALGO.

* Fix: Add validation for out_elf and spin=4 combo.
* Refactor: Encapsulate timer functionality in timer_wrapper.h

* Refactor timer code and clean_esolver function

1. Remove #ifdef __MPI from timer code, encapsulate in timer_wrapper.h
2. Move ESolver clean logic to after_all_runners method
3. Replace clean_esolver calls with direct delete p_esolver
4. Remove #ifdef __MPI from delete p_esolver
5. Add Cblacs_exit(1) in after_all_runners for LCAO calculations

* Refactor spar_exx.h: add English comments and improve dependency structure

- Added detailed English comments to cal_HR_exx function
- Moved implementation to cpp file and added explicit instantiations
- Improved header file organization with sections
- Removed unnecessary LCAO_hamilt.hpp include
- Enhanced endif comments for better code readability

* Remove empty LCAO_hamilt.hpp file

The LCAO_hamilt.hpp file was empty after moving its implementation to spar_exx.cpp.
This commit removes the unused header file and updates all references to it.

* Fix circular dependency between exx_info.h and xc_functional.h

- Removed #include xc_functional.h from exx_info.h
- Removed #include exx_info.h from xc_functional.h
This breaks the circular dependency between these two header files, allowing them to compile independently.

* Fix dependencies in LCAO sparse format headers

- Removed unnecessary #include source_lcao/hamilt_lcao.h from spar_dh.h, spar_hsr.h, and spar_u.h
- Added direct dependencies to spar_dh.h: matrix.h, parallel_orbitals.h, two_center_bundle.h, ORB_read.h
- Adjusted include order in spar_hsr.h and spar_u.h
- Added necessary include to spar_hsr.cpp for HamiltLCAO access

* Add necessary includes to cpp files for compilation

- Added xc_functional.h include to esolver_ks_pw.cpp for XC_Functional class access
- Added xc_functional.h include to input_conv.cpp for XC_Functional class access
- Added parallel_comm.h include to op_exx_pw.cpp for KP_WORLD communication
- Added global_variable.h and exx_info.h includes to stress_pw.cpp for GlobalC namespace access
These changes fix compilation errors caused by the dependency refactoring.

* add #include <RI/global/Tensor.h> in spar_hsr.h

---------

Co-authored-by: abacus_fixer <mohanchen@pku.eud.cn>
Co-authored-by: Xiaoyang Zhang <tsfxwbbzxy@163.com>
…ptimized doc (deepmodeling#6858)

* Feature&Doc: support fix_axes and fix_ibrav for relax_new=false

* Fix: UT error of fixed_axes
…g#6912)

* Feature: support ELF with non-collinear spin (nspin = 4)

* fix: UT for write_elf
* Refactor: Encapsulate timer functionality in timer_wrapper.h

* Refactor timer code and clean_esolver function

1. Remove #ifdef __MPI from timer code, encapsulate in timer_wrapper.h
2. Move ESolver clean logic to after_all_runners method
3. Replace clean_esolver calls with direct delete p_esolver
4. Remove #ifdef __MPI from delete p_esolver
5. Add Cblacs_exit(1) in after_all_runners for LCAO calculations

* move exx_helper to module_pwdft

* rename pw files

* Refactor: Move and rename nonlocal_pw files to module_pwdft directory

* Refactor: Move and rename velocity_pw, veff_pw, and meta_pw files to module_pwdft directory

* Refactor: Move and rename all operator_pw files to module_pwdft directory and clean up

* Refactor: Rename stress_func_xxx files to stress_xxx by removing _func suffix

* Rename V*_in_pw files to more concise names and update references

This commit includes:
1. Renamed files in module_pwdft directory:
   - VL_in_pw.cpp/h → vl_pw.cpp/h
   - VNL_in_pw.cpp/h → vnl_pw.cpp/h
   - VNL_grad_pw.cpp → vnl_pw_grad.cpp
   - VSep_in_pw.cpp/h → vsep_pw.cpp/h

2. Updated CMakeLists.txt and Makefile.Objects to use new filenames

3. Updated include paths in 41 files across the codebase:
   - source_cell/test/klist_test.cpp and klist_test_para.cpp
   - source_esolver/esolver_fp.h, esolver_ks_pw.cpp, esolver_ks_pw.h
   - source_estate/module_pot/pot_sep.h, potential_new.h, setup_estate_pw.h
   - source_estate/test/elecstate_pw_test.cpp
   - source_io/test/for_testing_input_conv.h, for_testing_klist.h
   - source_lcao/LCAO_set.h
   - source_psi/psi_initializer.h and related files
   - source_pw/module_ofdft/of_stress_pw.h
   - source_pw/module_pwdft/* (multiple files)
   - source_pw/module_stodft/sto_stress_pw.h

4. Verified compilation success with make -j30

The renaming follows consistent naming conventions and makes filenames more concise.

* fix Makefile.Objects

* Fix CI/CD error: Update operator_pw paths to new op_pw locations

This commit fixes the CI/CD build error by updating references to the old operator_pw directory structure:

1. Updated source/source_hsolver/test/CMakeLists.txt:
   - Changed all 7 references from '../../source_pw/module_pwdft/operator_pw/operator_pw.cpp' to '../../source_pw/module_pwdft/op_pw.cpp'

2. Updated source/source_hsolver/test/diago_mock.h:
   - Changed '#include "source_pw/module_pwdft/operator_pw/operator_pw.h"' to '#include "source_pw/module_pwdft/op_pw.h"'

The operator_pw directory has been renamed and its files moved to the module_pwdft root directory with op_pw_ prefixes, so these path updates are necessary to ensure CI/CD builds succeed.

---------

Co-authored-by: abacus_fixer <mohanchen@pku.eud.cn>
…deepmodeling#6910)

* Fix: Hefei-NAMD interface of file syns_nao.csr and parameter cal_syns

* refactor: esolver to ctrl_scf_lcao
…add unit tests (deepmodeling#6917)

* add change

* add numrial fast

* add check for cal_bandgap Boundary

* Revert "add change"

This reverts commit 78444e9.

* add back mlago

* add back mlago format
* Refactor: Encapsulate timer functionality in timer_wrapper.h

* Refactor timer code and clean_esolver function

1. Remove #ifdef __MPI from timer code, encapsulate in timer_wrapper.h
2. Move ESolver clean logic to after_all_runners method
3. Replace clean_esolver calls with direct delete p_esolver
4. Remove #ifdef __MPI from delete p_esolver
5. Add Cblacs_exit(1) in after_all_runners for LCAO calculations

* Fix uninitialized variables in source_pw directory

This commit fixes 37 uninitialized variables (19 int, 18 double) in 19 files within the source_pw directory. All variables are initialized to 0 or 0.0 to prevent undefined behavior and improve code safety.

Affected files:
- source_stodft/sto_wf.cpp
- source_stodft/sto_tool.h
- source_stodft/sto_iter.h
- source_stodft/sto_iter.cpp
- source_stodft/sto_forces.cpp
- source_pwdft/stress_func_loc.cpp
- source_pwdft/soc.cpp
- source_pwdft/parallel_grid.cpp
- source_pwdft/onsite_projector.cpp
- source_pwdft/operator_pw/exx_pw_ace.cpp
- source_pwdft/onsite_proj_tools.h
- source_pwdft/nonlocal_maths.hpp
- source_pwdft/fs_nonlocal_tools.h
- source_pwdft/elecond.cpp
- source_pwdft/forces_cc.cpp
- source_pwdft/VNL_in_pw.cpp
- source_pwdft/forces_scc.cpp
- source_pwdft/forces.cpp
- source_pwdft/VNL_grad_pw.cpp

* Fix uninitialized variables in source_lcao directory

This commit fixes uninitialized variables (int and double) in 18 files within the source_lcao directory. All variables are initialized to 0 or 0.0 to prevent undefined behavior and improve code safety.

Affected files:
- spar_hsr.cpp, spar_dh.cpp
- wavefunc_in_pw.cpp
- module_rt/velocity_op.cpp, module_rt/norm_psi.cpp, module_rt/propagator.cpp, module_rt/td_folding.cpp, module_rt/boundary_fix.cpp
- module_ri/exx_abfs-io.cpp, module_ri/module_exx_symmetry/irreducible_sector_bvk.cpp, module_ri/module_exx_symmetry/symmetry_rotation.cpp
- module_lr/dm_trans/dmr_complex.cpp, module_lr/esolver_lrtd_lcao.cpp
- module_operator_lcao/dspin_force_stress.hpp, module_operator_lcao/dftu_force_stress.hpp
- module_hcontainer/func_folding.cpp, module_hcontainer/test/test_hcontainer.cpp
- module_gint/set_ddphi.cpp

* fix: initialize all uninitialized variables in source_relax and source_md

* fix: initialize all uninitialized variables in source_base

* fix: initialize all uninitialized variables in source_cell and source_estate

* fix: initialize all uninitialized variables in source_hsolver and source_io

* fix: initialize all uninitialized variables in source_lcao and source_psi

* fix some uninitalized variables

* move exx_helper to module_pwdft

* rename pw files

* Refactor: Move and rename nonlocal_pw files to module_pwdft directory

* Refactor: Move and rename velocity_pw, veff_pw, and meta_pw files to module_pwdft directory

* Refactor: Move and rename all operator_pw files to module_pwdft directory and clean up

* Refactor: Rename stress_func_xxx files to stress_xxx by removing _func suffix

* Rename V*_in_pw files to more concise names and update references

This commit includes:
1. Renamed files in module_pwdft directory:
   - VL_in_pw.cpp/h → vl_pw.cpp/h
   - VNL_in_pw.cpp/h → vnl_pw.cpp/h
   - VNL_grad_pw.cpp → vnl_pw_grad.cpp
   - VSep_in_pw.cpp/h → vsep_pw.cpp/h

2. Updated CMakeLists.txt and Makefile.Objects to use new filenames

3. Updated include paths in 41 files across the codebase:
   - source_cell/test/klist_test.cpp and klist_test_para.cpp
   - source_esolver/esolver_fp.h, esolver_ks_pw.cpp, esolver_ks_pw.h
   - source_estate/module_pot/pot_sep.h, potential_new.h, setup_estate_pw.h
   - source_estate/test/elecstate_pw_test.cpp
   - source_io/test/for_testing_input_conv.h, for_testing_klist.h
   - source_lcao/LCAO_set.h
   - source_psi/psi_initializer.h and related files
   - source_pw/module_ofdft/of_stress_pw.h
   - source_pw/module_pwdft/* (multiple files)
   - source_pw/module_stodft/sto_stress_pw.h

4. Verified compilation success with make -j30

The renaming follows consistent naming conventions and makes filenames more concise.

* fix Makefile.Objects

* Fix CI/CD error: Update operator_pw paths to new op_pw locations

This commit fixes the CI/CD build error by updating references to the old operator_pw directory structure:

1. Updated source/source_hsolver/test/CMakeLists.txt:
   - Changed all 7 references from '../../source_pw/module_pwdft/operator_pw/operator_pw.cpp' to '../../source_pw/module_pwdft/op_pw.cpp'

2. Updated source/source_hsolver/test/diago_mock.h:
   - Changed '#include "source_pw/module_pwdft/operator_pw/operator_pw.h"' to '#include "source_pw/module_pwdft/op_pw.h"'

The operator_pw directory has been renamed and its files moved to the module_pwdft root directory with op_pw_ prefixes, so these path updates are necessary to ensure CI/CD builds succeed.

* fix

* fix relax_sync.cpp

---------

Co-authored-by: abacus_fixer <mohanchen@pku.eud.cn>
…deepmodeling#6920)

* Refactor: rename to delete new from names of operators in source_lcao

* Fix: stress of nonlocal

* Fix: complining error

---------

Co-authored-by: dyzheng <zhengdy@bjaisi.com>
* Fix docs for init_chg

* Fix warning for init_chg

* Align format for pseudo warning
Cstandardlib and others added 30 commits May 18, 2026 14:56
…ount) (deepmodeling#7357)

dspInitHandle uses MY_RANK % dsp_count but dspDestoryHandle used raw MY_RANK, causing heap corruption when MY_RANK >= dsp_count. Fixes issue deepmodeling#7269.
…md > 1` evolution strategy (deepmodeling#7360)

* Remove unnecessary cout in TDDFT current file

* Fix RT-TDDFT EXX bug when using estep_per_md

* Modify cout format

* Fix a compiling issue with respect to std::vector

* Update test 08_EXX/14_NO_TDDFT_PBE0
…guard (deepmodeling#7361)

* refactor(device): remove dead code from DeviceContext, add dsp_count guard

Remove unused device_type subsystem from DeviceContext:

- Delete set_device_type(), get_device_type(), is_cpu(), is_gpu(), is_dsp() methods (all zero callers verified via exhaustive search)
- Delete is_initialized(), is_gpu_enabled() (zero callers)
- Delete device_type_ private field (only consumed by removed methods)
- Delete standalone get_device_type(const DeviceContext*) function (zero callers; all 48 call sites use the template version get_device_type(const Device*))
- Delete forward declaration in device_helpers.h

Add assert(PARAM.inp.dsp_count > 0) guard in driver.cpp to prevent
modulo-by-zero undefined behavior.

All other DeviceContext members retained (init(), get_device_id(),
get_device_count(), get_local_rank() — all have active callers).
Build verified with cmake --build (MPI+LCAO).

* fix(dsp): replace assert with runtime WARNING_QUIT for dsp_count

assert() is removed in release builds (NDEBUG), leaving modulo-by-zero\nunprotected. Replace with WARNING_QUIT that works in all builds.\n\nAlso remove now-unused #include <cassert> from the #ifdef __DSP block.\n\nAddresses PR review feedback on deepmodeling#7361.
…odeling#7364)

* Remove useless headers

* add type.h include

* Remove headers in test files
- set_phi_dphi_kernel: add WantPhi non-type template parameter and
  dispatch from the launch site. The dphi-only callers (gint_tau)
  pass phi=nullptr; with WantPhi==false the compiler drops the
  phi[] stores and the per-iw `phi != nullptr` branch entirely.
- phi_dot_dphi_kernel / phi_dot_dphi_r_kernel: replace the shared-
  memory tree reduce with a single-warp warpReduceSum and drop the
  dynamic shared-memory allocation at the launch sites. Launch
  configuration is pinned at blockDim.x == 32; a comment guards the
  invariant.
- Plain `if` (not `if constexpr`) on WantPhi keeps the code
  C++11-compliant — ABACUS targets C++11 and nvcc otherwise emits
  warning deepmodeling#2912-D. WantPhi is still a non-type template parameter,
  so the compiler folds the constant and eliminates the dead branch.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ecision loss (deepmodeling#7368)

Across CPU and GPU gint paths, accumulator buffers (hr_gint, phi_dm, rho,
and the vbatched GEMM C output) are now always allocated as double, even
when the input phi/dm/vr_eff are fp32. Multiplies stay in fp32 (cheap),
but per-block and global reductions are widened to fp64 so that summing
many atom-pair contributions into the same element does not drift.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ling#7365)

* Remove parameter.h

* Continue remove parameter.h

* Remove parameter.h dependency in pw_basis

* Remove dependency in pw_basis_k
…epmodeling#7369)

1. Add check_value callback to yukawa_potential to error out when
   both yukawa_potential and uramping are enabled simultaneously
2. Skip uramping_update() when Yukawa is enabled (U calculated directly
   from charge density every iteration)
3. Return true from u_converged() when Yukawa is enabled (U is
   self-consistently calculated, no ramping convergence needed)
…ling#7373)

* Remove parameter.h

* Continue remove parameter.h

* Remove parameter.h dependency in pw_basis

* Remove dependency in pw_basis_k

* refactor(source_basis): remove last parameter.h dependencies

Decouple module_ao and module_nao from source_io/parameter.h:

- ORB_atomic_lm / ORB_nonlocal_lm: replace PARAM.globalv.global_out_dir
  with ModuleBase::get_quit_out_dir() (new getter mirroring the existing
  set_quit_out_dir injection point).
- two_center_bundle: thread orbital_dir as a build_orb parameter; replace
  the two deepks_setorb guards with ndesc>0 / alpha_ non-null checks that
  are equivalent under the build_alpha invariant.

source_basis is now free of parameter.h.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(tool_quit): rename get_quit_out_dir to get_global_out_dir

Align the getter name with the original PARAM.globalv.global_out_dir it replaces.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ng#7382)

* feat(tests): add tests/17_DS_DFTU test suite for DFT+U with deep potential spin constraints

* feat(tests): add tests/17_DS_DFTU with CI-disabled configs and READMEs

Add the 17_DS_DFTU test suite for DeltaSpin and DFT+U functionality:
- 47 test cases covering LCAO/PW basis, collinear/noncollinear spin,
  DFT+U, DeltaSpin, and their combinations
- Comment out tests in tests/CMakeLists.txt and tests/17_DS_DFTU/CMakeLists.txt
  to prevent CI failure until DeltaSpin code is merged into develop
- Add single-line README to each test directory (printed during Autotest.sh)
- Rewrite CASES_CPU.txt with clear English comments explaining disabled tests
…7389)

* Remove debug output in renormalize_psi

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Fix: CD potential now applied to correct spin channel instead of always spin 0

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Refactor: Replace raw new/delete with std::vector in cal_vw_potential_phi for automatic memory management

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Refactor: Replace raw new/delete with std::vector in cal_CD_potential for automatic memory management

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Refactor: Replace abs(x)*abs(x) with std::norm for clarity and consistency with RK2

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Refactor: Remove unused <iostream> include

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Refactor: Remove dead nspin <= 0 checks that can never trigger

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
… diagonalization (deepmodeling#7388)

* fix(sdft): add CT (Chebyshev Trace) iter_header for pure SDFT without diagonalization

Pure SDFT (nbands=0) does not perform KS diagonalization, yet the SCF
iteration table borrowed the ks_solver label (CG/DA/etc.). Add a "CT"
entry to iter_header_dict and use it when esolver_type=sdft with nbands=0.
Mixed SDFT (nbands>0) keeps the actual ks_solver label since it still
diagonalizes KS orbitals.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test(sdft): add unit tests for SDFT iter_header CT label

Verify pure SDFT (nbands=0) outputs "CT" in ITER column, and mixed
SDFT (nbands>0) outputs the actual ks_solver label (e.g. "DA").

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* update 18_md examples

* update out_chg 2

* update out_pot function

* feat(module_io): 优化初始电荷密度/势能输出,支持 out_freq_ion 控制和动态文件名

- 添加 gen_ini_filename() 辅助函数,统一生成初始电荷密度/势能文件名
- out_freq_ion=0 时输出单个固定名称文件(不带 g#)
- out_freq_ion>0 时每个几何步输出独立文件(带 g#)
- 更新文档,说明两种模式的区别

修改文件:
- docs/advanced/output_files/output-specification.md
- source/source_io/module_chgpot/write_init.cpp
- source/source_io/module_chgpot/write_init.h
- source/source_io/module_parameter/read_input_item_output.cpp

* fix(module_io): 修正 out_freq_ion=0 时初始电荷密度/势能输出逻辑

- out_freq_ion=0 时,每个几何步都输出(覆盖同一个文件)
- out_freq_ion>0 时,只在 istep 是 out_freq_ion 倍数时输出
- 更新所有相关文档和注释

修改文件:
- docs/advanced/output_files/output-specification.md
- source/source_io/module_chgpot/write_init.cpp
- source/source_io/module_chgpot/write_init.h
- source/source_io/module_parameter/read_input_item_output.cpp

* fix a bug about out_pot

* fix bugs

* update

* update ELF and add openmp parallel

* update elf

* update elf

* update example reference data

* enable elf for rt-tddft, but results are wrong

* fix elf test

* fix elf test in 03_NAO_multik

* fix output of write_elf

* fix bug

* update potential file, fix bug

* fix elf test in ofdft

* fix: Move write_pot_init to ElecState::init_scf for correct timing

The write_pot_init was being called in ESolver_FP::before_scf before the
effective potential was computed. This caused pot_ini.cube to contain:
- All zeros for calculation=scf / first ionic step (istep=0)
- Converged potential from previous ionic step for relax/md with istep>0

The fix moves write_pot_init to ElecState::init_scf, which is called after
pot->init_pot(charge) computes the effective potential from the initial
charge density. This ensures pot_ini.cube correctly contains the effective
potential corresponding to the initial charge density.

Changes:
- Modified ElecState::init_scf signature to accept istep, out_dir, inp parameters
- Added write_pot_init call after pot->init_pot() in init_scf
- Updated pw::setup_pot to pass through the new parameters
- Updated all callers (LCAO and PW) to provide the new parameters
- Removed the premature write_pot_init call from ESolver_FP::before_scf

* Remove unused parameters from ElecState::init_scf

- Removed unused 'symm' and 'wfcpw' parameters from init_scf function
- Updated all call sites to match the new signature
- Simplified function interface by removing parameters not used in implementation

* Fix missing io_basic library link in elecstate tests

- Added io_basic library dependency to MODULE_ESTATE_elecstate_base test
- Added io_basic library dependency to MODULE_ESTATE_elecstate_pw test
- Fixes undefined reference to ModuleIO::write_pot_init

* update init_scf

* fix

* fix bug

* remove dependence of parameter for write_cube.cpp

* fix bugs

* fix bug

* add a new file init_scf

* update estate tests

* delete useless inclusion

---------

Co-authored-by: abacus_fixer <mohanchen@pku.eud.cn>
- gint_info.cpp: replace serial init_atoms_ with #pragma omp parallel for
  using iat2it/iat2ia for atom type lookup, thread-private staging for
  conflict-free BigGrid add_atom merging

- gint_rho.cpp: pre-allocate phi/phi_dm thread buffers to max_phi_len
  to eliminate per-BigGrid resize() malloc overhead

- gint_vl.cpp: pre-allocate phi/phi_vldr3 thread buffers to max_phi_len

- gint_fvl.cpp: pre-allocate 6 thread buffers (phi, phi_vldr3,
  phi_vldr3_dm, dphi_x/y/z) to max_phi_len, eliminate per-BigGrid
  6x resize() overhead
* Fix: use correct PSI for GPU in cal_tau for ELF

* Refactor: split long cast into readable lines in after_scf
…deepmodeling#7418)

* Remove obsolete cross-device copy constructor in HamiltPW

* Delete corresponding .h code
…7421)

* Fix Molden GTO normalization and coordinate conversion

Title: Improve ABACUS Molden output for wavefunction analysis

Summary:
This PR fixes several Molden conversion issues in tools/molden/molden.py while keeping the default workflow unchanged as much as possible.

Changes:
- Correct the primitive Gaussian coefficient convention when writing Molden GTO data. The NAO-to-GTO fit uses unnormalized radial primitives, while Molden readers usually normalize primitive Gaussian functions internally.
- Fix Cartesian_angstrom coordinate conversion. Coordinates in Angstrom are now converted to Bohr for the Molden [Atoms] AU section by dividing by 0.529177210903.
- Add optional multi-start NAO-to-GTO fitting. A single -r value keeps the old single-start behavior; comma-separated -r values enable multi-start fitting and keep the fit with the lowest nonlinear error.
- Add optional Molden [Nval] output via --write-nval. The values are read from UPF z_valence. This option is disabled by default.

Notes:
- The changes are limited to the Molden converter.

Validation:
- Ran the existing molden.py unit tests successfully.
- Checked that default output does not contain [Nval].
- Checked that --write-nval writes C/O/H valence charges for the PhenolDimer test case.
- Checked that Cartesian_angstrom coordinates are written at the correct Bohr scale.

* Show default values in molden.py CLI help
…g#7383)

* update ML KEDF output

* Refactor OFDFT ML KEDF logging output to use ofs_running stream

Summary of changes:
1. Modified ML_Base::set_device() to accept std::ostream& ofs_running parameter instead of using std::cout directly
2. Updated KEDF_ML::set_para() to pass ofs_running through the call chain
3. Modified KEDF_ML::init_data() to accept ofs_running parameter
4. Updated NN_OFImpl constructor to accept ofs_running parameter for logging nnode/nlayer
5. Modified Cal_MLKEDF_Descriptors::set_para() to accept ofs_running parameter for logging nkernel
6. Updated ML_EXX class methods (set_para, init_data, localTest) to use ofs_running
7. Updated all call sites to pass GlobalV::ofs_running
8. Changed 'NN' to 'Neural Network' in device initialization messages
9. Fixed 'WARNING: ML >= TF' message in KEDF_Manager::get_energy() to use ofs_running
10. Reformatted KEDF_ML::set_para() and cal_tool->set_para() calls with one parameter per line

All ML KEDF related output messages now write to the running log file instead of stdout.

* fix

* fix

* update the output formats

* update KEDF

* output format update

* update

* fix a potential bug when the net.pt model cannot be found

* update kedf and exx

* update

---------

Co-authored-by: abacus_fixer <mohanchen@pku.eud.cn>
… for Native Windows system) (deepmodeling#7423)

* Native Windows port (Phase 1 scaffolding): serial PW build on MinGW-w64

Lay the groundwork for a native Windows serial plane-wave build
(no MPI, no LCAO, no ELPA/PEXSI/hybrid). Targets MinGW-w64 GCC, which
ships the POSIX headers ABACUS uses and accepts its GCC attributes, so
the source needs only minimal, Linux-safe portability shims.

- source_base/fs_compat.h (new): portable ModuleBase::make_directory()
  wrapping _mkdir (Windows) / mkdir(path,0755) (POSIX). The Windows CRT
  mkdir takes no permission-mode argument.
- global_file.cpp, global_function.cpp: route the 7 mkdir(path,0755)
  call sites through the helper; drop unistd.h/sys/stat.h includes.
- CMakeLists.txt:
  * gate find_package(ScaLAPACK REQUIRED) on ENABLE_MPI so the serial
    build does not require a distributed-memory library;
  * define _USE_MATH_DEFINES/NOMINMAX/_CRT_SECURE_NO_WARNINGS on WIN32;
  * skip -O3 -g default flags and the -lm link for MSVC;
  * skip the post-install abacus symlink on Windows.
- tools/windows/build-native-serial.ps1 (new): MinGW configure/build helper.
- docs/advanced/install_windows_native.md (new): native-build documentation.

All changes are guarded or platform-neutral; the Linux build is unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Native Windows port (Phase 1): serial PW build compiles, links, runs

With these fixes the native Windows serial plane-wave build
(abacus_pw_ser.exe, MinGW-w64 GCC + OpenBLAS + FFTW) compiles, links,
and runs examples/02_scf/01_pw_Si2 to SCF convergence with a
deterministic total energy (-215.5057 eV, bit-identical across runs).

Build-system fixes:
- cmake/FindBlas.cmake, cmake/FindLapack.cmake: the wrappers delegate to
  CMake's builtin FindBLAS/FindLAPACK, but on the case-insensitive Windows
  filesystem the wrapper matched itself and recursed forever. Drop our
  module dir from CMAKE_MODULE_PATH around the builtin call (no-op on Linux).

Source portability fixes (all guarded or platform-neutral; Linux unaffected):
- module_fft/fft_base.h, fft_cpu.h: remove __attribute__((weak)) from the FFT
  virtuals. The weak-without-definition pattern relied on the ELF linker
  resolving unbound weak symbols to null; on Windows/PE (MinGW) it produced
  null vtable slots, so the first FFT dispatch (FFT_Bundle::setupFFT) called
  address 0 and segfaulted. Base virtuals get trivial default bodies; the
  float overrides become concrete via ENABLE_FLOAT_FFTW=ON.
- module_parameter/input_conv.h: port the POSIX <regex.h> expression parser to
  C++ <regex> (MinGW has no <regex.h>).
- module_container/base/core/cpu_allocator.cpp: replace posix_memalign with
  _aligned_malloc/_aligned_free on Windows, applied consistently to both
  allocate overloads and free.
- module_restart/restart.cpp: map POSIX S_IRUSR/S_IWUSR to _S_IREAD/_S_IWRITE
  and include <io.h> for low-level open/read/write/close on Windows.

Tooling/docs:
- tools/windows/build-native-serial.ps1: use the verified flags
  (BLA_VENDOR=OpenBLAS, ENABLE_FLOAT_FFTW=ON, COMMIT_INFO=OFF, the GCC-16
  force-include workaround).
- docs/advanced/install_windows_native.md: document the gcc-fortran package,
  the verified build/run, and every source change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Fix all-zero seeded random wavefunctions in serial (non-MPI) PW build

psi_initializer::random_t, in the pw_seed>0 branch, generates per-stick
random amplitude/phase into stickrr/stickarg and then distributes them
into the gathered tmprr/tmparg arrays via stick_to_pool() -- but that call
is guarded by #ifdef __MPI. In a serial build tmprr/tmparg therefore stay
zero-initialized, so every seeded random wavefunction is all-zero. This
later trips Gram-Schmidt orthonormalization ("psi_norm <= 0.0") and aborts
the run. The path is never hit in CI because the integration tests run
under MPI.

Add the serial counterpart: copy each stick directly into tmprr/tmparg
using the same mapping as stick_to_pool()'s rank-0 branch
(out[ixy2is_[ir]*nz + iz] = stick[iz]). ixy2is_ is populated for both
serial and MPI builds via pw_wfc_->getfftixy2is().

Verified on a representative set of 15 tests/01_PW cases run with the
native Windows serial PW build (abacus_pw_ser.exe): all converged total
energies now match the official result.ref references to <= ~7e-7 eV.
Before this fix the 6 cases using pw_seed with random wavefunctions
aborted; the other 9 already matched to ~1e-9 eV.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Windows: use the existing toolchain + serial test harness, drop bespoke scripts

Per review feedback, the native-Windows support should plug into ABACUS's
existing build/test infrastructure (like any other backend/variant) rather than
carry its own scripts.

Build: add a Windows toolchain variant, mirroring toolchain_gnu.sh /
build_abacus_gnu.sh:
- toolchain/toolchain_windows.sh   -- installs the MinGW-w64 prerequisites via
  pacman on MSYS2 (gcc, gfortran, openblas, fftw, cmake, ninja) plus bc for the
  test harness; records the prefix in install/setup like the Linux variants.
- toolchain/build_abacus_windows.sh -- configures + builds the serial PW binary
  (ENABLE_MPI/LCAO=OFF, OpenBLAS+FFTW) and writes abacus_env.sh.
Removed the one-off tools/windows/build-native-serial.ps1.

Test: reuse tests/integrate/Autotest.sh instead of a separate script. Added a
serial mode: with -n 0 the harness runs the binary directly (no mpirun), so a
serial build (any OS) reuses the standard catch_properties.sh / result.ref
comparison. Added tests/integrate/CASES_SERIAL_PW.txt listing serial-PW cases.

Validation (build_abacus_windows.sh, then Autotest.sh -n 0 -f CASES_SERIAL_PW.txt):
all 15 01_PW cases run; total energies/forces/stresses match the Linux
result.ref to ~1e-7 relative. The few WARNINGs (016/017 etot ~1e-7 eV;
003/009/019 stress/force) are absolute-threshold exceedances from cross-platform
/ cross-BLAS floating point, classified WARNING (not ERROR) by the harness.

docs/advanced/install_windows_native.md updated to describe the toolchain +
serial-Autotest flow.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Windows test: run the whole 01_PW suite, drop the curated case list

Per review: the serial PW build should be checked against the existing PW test
suite (tests/01_PW) via the standard harness, not a hand-picked subset.

- Remove tests/integrate/CASES_SERIAL_PW.txt. The canonical list already exists
  at tests/01_PW/CASES_CPU.txt and is used by the standard ctest registration
  (tests/01_PW/CMakeLists.txt runs Autotest.sh from that directory). Serial runs
  just add -n 0:
      cd tests/01_PW
      bash ../integrate/Autotest.sh -a <abacus_pw_ser.exe> -n 0
- .gitattributes: force LF for *.sh and CASES_*.txt so the toolchain scripts,
  Autotest.sh and the bash-parsed case lists work on a fresh Windows checkout
  (core.autocrlf would otherwise rewrite them to CRLF).
- docs/advanced/install_windows_native.md: document the whole-01_PW serial run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Windows toolchain: provide a generic `abacus` command after build

Mirror the Linux toolchain UX: `source abacus_env.sh` then run `abacus`.

build_abacus_windows.sh now copies the configured binary (abacus_pw_ser.exe)
to abacus.exe in the build dir. Native Windows symlinks need elevation (so the
CMake `abacus` symlink step is skipped on WIN32); the .exe copy lets a bare
`abacus` resolve in the MSYS2 shell and in cmd/PowerShell. abacus_env.sh already
puts that directory (and the MinGW runtime DLLs via the toolchain setup) on PATH.

Verified: source abacus_env.sh; abacus --version  -> runs from any directory.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Fix Binstream binary file I/O on Windows (force binary fopen mode)

Binstream::Binstream/open pass the caller's fopen mode ("r"/"w"/"a")
straight through. On Windows that opens in *text* mode, which translates
CRLF and treats 0x1A as EOF, corrupting the binary wavefunction/charge
files Binstream is built to read -> "Error in Binstream: Some data didn't
be read". On POSIX "r" == "rb", so the bug is Windows-only.

Binstream is always a binary stream, so append "b" to the mode when the
caller omitted it. Harmless no-op on Linux.

Fixes these serial 01_PW cases on the native Windows build (verified):
- 056_PW_IW          (init_wfc=file: read wfc from binary file)
- 057_PW_SO_IW       (SOC + init_wfc=file)
- 075_PW_CHG_BINARY  (binary charge I/O)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Fix uninitialized structure factor in serial bspline_sf (wrong energy)

Structure_Factor::bspline_sf (nbspline>0, B-spline structure factor)
scatters each real-space plane into tmpr via Parallel_Grid::zpiece_to_all,
which is guarded by #ifdef __MPI. In a serial build tmpr is never filled
(it is new double[nrxx], uninitialized), so real2recip(tmpr, strucFac)
produces a garbage structure factor -> grossly wrong total energy, force
and stress. CI never hits this path (integration tests run under MPI).

Add the serial branch: fill tmpr directly using the SAME real-space layout
as zpiece_to_all's serial path, rho[ir*nczp + znow] (xy outer, z innermost;
nczp==nz, znow==iz when serial).

Verified on tests/01_PW/032_PW_15_CF_CS_bspline (native Windows serial):
energy and stress now match the reference to ~1e-8 (was ~1480 eV / 30000
kbar off); residual force ~5e-3 is B-spline interpolation + cross-platform
float noise.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(windows): note pw_seed cross-platform non-reproducibility (078 is not a bug)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* toolchain(windows): clarify to run abacus_env.sh inside a mingw bash

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(lcao): guard null deref of DeePKS overlap_orb_alpha when DeePKS is off

before_scf() unconditionally dereferenced *(two_center_bundle_.overlap_orb_alpha)
to pass it to deepks.build_overlap(). overlap_orb_alpha is only built when DeePKS
is enabled (descriptor orbitals); with DeePKS off it is a null unique_ptr, so
forming the reference is undefined behaviour (caught as an abort in a debug
libstdc++ build; benign in release as the DeePKS stub ignores it). Guard the call
on the integrator being present.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Windows toolchain: add LCAO + MPI (MS-MPI + ScaLAPACK) build

Extend the native-Windows toolchain to the full supported configuration,
mirroring build_abacus_gnu.sh:

- toolchain_windows.sh: also pacman-install cereal (LCAO), msmpi (MPI), and
  scalapack (distributed LCAO eigensolver). Documents that the MS-MPI runtime
  is a separate system-wide Microsoft redistributable.
- build_abacus_windows.sh: build MPI + LCAO by default (abacus_basic_para.exe);
  ENABLE_MPI / ENABLE_LCAO env toggles select serial / PW-only. Point FindMPI at
  the MinGW MS-MPI import lib; ScaLAPACK is found automatically when ENABLE_MPI.
  abacus_env.sh now also exports OPENBLAS_NUM_THREADS=1 (required so OpenBLAS's
  multithread buffer allocator does not fail under multiple MPI ranks).
- docs/advanced/install_windows_native.md: document the LCAO+MPI build, parallel
  testing (mpiexec / mpirun shim), and the known serial gamma-only LCAO bug
  (use the MPI build, which is correct to ~1e-11 even on a single rank).

Validated against 01_PW / 02_NAO_Gamma / 03_NAO_multik via the standard harness:
under MPI all three pass within the cross-platform error range; residual
differences are float noise at strict absolute thresholds, gauge-dependent
outputs, or excluded features (SCAN/meta-GGA needs LibXC, DFT+U needs MPI).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* toolchain(windows): make the unmodified test harness drive MS-MPI

Running tests/integrate/Autotest.sh directly failed with "no mpirun found":
MS-MPI ships only mpiexec, and the harness invokes `mpirun -np N`. Three
Windows-specific gaps, all fixed in build_abacus_windows.sh so the standard
harness works unchanged:

* mpirun shim. The build now drops an `mpirun`->`mpiexec` shim next to the
  binary (on PATH via abacus_env.sh). MS-MPI's `-n`/`-np <N> <prog>` syntax
  matches what the harness passes, so forwarding args is enough.

* OpenBLAS thread pinning. MSYS2's OpenBLAS is OpenMP-threaded (links libgomp),
  so OMP_NUM_THREADS -- not OPENBLAS_NUM_THREADS -- caps its threads. Autotest
  sets OMP_NUM_THREADS=nproc/np, so each rank spawned a multithreaded BLAS, the
  ranks oversubscribed the cores, and OpenBLAS's buffer allocator died
  ("Memory allocation still failed after 10 retries"). The shim and abacus_env.sh
  now pin OMP_NUM_THREADS=1 (ABACUS is built USE_OPENMP=OFF, so parallelism is
  MPI; the BLAS pin costs nothing).

* DLL bundling. mpiexec does not propagate PATH to child ranks when stdout is
  redirected to a file (as the harness does), so the child abacus.exe failed to
  load libopenblas/libfftw3/libscalapack ("error while loading shared
  libraries"). The build now copies the dependent MinGW/OpenBLAS/FFTW/ScaLAPACK
  DLLs next to abacus.exe; Windows searches the application directory before
  PATH, making the binary self-contained.

Verified end to end with the default invocation `bash Autotest.sh -a abacus`
(np=4, via the shim): 01_PW/001, 02_NAO_Gamma/scf_afm (gamma-only LCAO), and
03_NAO_multik/scf_pp_upf201 all pass. Corrects the earlier docs/notes that
cited OPENBLAS_NUM_THREADS and a hand-made shim.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* toolchain(windows): add MS-MPI Bin (MSMPI_BIN) to PATH in abacus_env.sh

The mpirun shim died with `exec: mpiexec: not found`: MSYS2's MinGW shell does
not inherit the Windows PATH, and MS-MPI's mpiexec.exe lives in its own Bin dir
(only msmpi.dll is in System32). The MSMPI_BIN env var (set by the MS-MPI
installer) *is* inherited, so abacus_env.sh now prepends `cygpath -u "$MSMPI_BIN"`
to PATH, making both `mpiexec` and the shim resolve. Verified from a minimal
PATH: which mpiexec/mpirun both resolve and 01_PW/001 passes via the default
harness invocation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix: restore Linux link of FFT_CPU<float> and harden parse_expression

Two issues from code review of the Windows-port commits:

1. FFT_CPU<float> undefined references on Linux (regression). The port removed
   __attribute__((weak)) from the FFT virtuals (it left null vtable slots on
   PE/MinGW and crashed). But the real FFT_CPU<float> methods live in
   fft_cpu_float.cpp, which is compiled only when ENABLE_FLOAT_FFTW=ON. With
   weak gone and float off (the Linux default), the FFT_CPU<float> vtable --
   still emitted wherever the class is constructed (FFT_Bundle) -- referenced
   undefined symbols:
     undefined reference to `ModuleBase::FFT_CPU<float>::setupFFT()' ...
   Provide trivial FFT_CPU<float> method definitions in the always-compiled
   fft_cpu.cpp, guarded by `#if !defined(__ENABLE_FLOAT_FFTW)`, so every vtable
   slot is valid on any ABI without weak and without pulling in libfftw3f. The
   float CPU path stays unreachable at runtime (FFT_Bundle::setupFFT
   WARNING_QUITs for single/mixing CPU FFT unless the macro is set). When the
   macro is on, the stubs are excluded and fft_cpu_float.cpp supplies the real
   definitions -- no duplicate symbols. Verified by linking the float vtable TU
   against fft_cpu.o in both macro states (off: links via stubs; on: links via
   fft_cpu_float.o), and that dropping both reproduces the reported errors.

2. parse_expression (input_conv.h) could push indeterminate values into vec.
   If std::regex_search found no match, sub_str stayed empty and was parsed
   anyway; in the non-multiplication branch `T occ` was uninitialized and the
   `convert >> occ` extraction was unchecked. Now: a no-match token is an input
   error (WARNING_QUIT), occ is value-initialized, and a failed extraction
   fails fast. Consistent with the other expression parsers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fft: make the weak-vtable trick Windows-safe without touching Linux code

Rework the FFT_CPU<float> vtable handling so Linux builds byte-for-byte as
upstream and only Windows gets a delta. My earlier port had (a) removed
__attribute__((weak)) outright and (b) added trivial float stubs in
fft_cpu.cpp -- both changed working Linux core code, and (b) didn't even reach
targets that compile fft_bundle.cpp without linking fft_cpu.cpp (e.g.
MODULE_HAMILT_XCTest_VXC), so Linux still failed to link:
    undefined reference to `ModuleBase::FFT_CPU<float>::setupFFT()' ...

Root cause: the upstream virtuals are __attribute__((weak)) so the ELF linker
nulls the unused FFT_CPU<float> vtable slots when ENABLE_FLOAT_FFTW is off.
MinGW/PE has no equivalent -- weak template members there collide
("multiple definition") or leave null slots that crash on dispatch (verified
both empirically with g++).

Fix, keeping Linux untouched:
* Introduce ABACUS_FFT_WEAK = __attribute__((weak)) on non-Windows, empty on
  _WIN32, and use it in place of the raw attribute in fft_base.h / fft_cpu.h.
  Preprocessing with -U_WIN32 reproduces the upstream headers exactly (14 weak
  attrs, no extra defs); fft_cpu.cpp is reverted to pristine.
* On Windows the empty macro makes the slots ordinary symbols; the build
  already sets ENABLE_FLOAT_FFTW=ON, so fft_cpu_float.cpp supplies the real
  FFT_CPU<float> methods. The non-pure FFT_BASE<T> virtuals (which had no body,
  relying on weak) get trivial bodies in a `#if defined(_WIN32)` block -- never
  executed (abstract base; backends override what they use). This block is
  compiled only on Windows.

Verified with MinGW g++: constructing FFT_CPU<float> and dispatching through
its vtable links (no multiple-definition, no undefined base/derived refs) and
runs (no null-vtable crash); and the Linux-simulated preprocess output matches
upstream.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* toolchain(windows): cap default build parallelism by available RAM

The Windows build defaulted to -j nproc. On a 20-core box, 20 concurrent -O3
compilations of heavy template TUs (source_cell/module_symmetry/symmetry.cpp,
read_pp_upf201.cpp, ...) exhausted memory and ninja died with
"cc1plus.exe: out of memory allocating N bytes" -- even with 31 GB RAM.

Default -j is now min(nproc, MemTotalGB / 3) (~3 GB budget per job), read from
/proc/meminfo; an explicit -j still overrides, and the chosen value is printed
with a hint to lower it if cc1plus runs out of memory. Falls back to nproc if
/proc/meminfo is unreadable. Not a code issue -- the sources compiled fine up
to the OOM.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(windows): remove install_windows_native.md

This was a working note for the native-Windows build trial, not reference
documentation for the repository. Drop it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Add optional DFT-D4 support

* Docs and tests

* Install dftd4 from toolchain in GitHub test

* Fix stress calculation

* Add regtest

* Add D4S model

* Add citations
…EMM (deepmodeling#7395)

* perf(gint): shape-exact bucketing + tile ladder + wide-LDS vbatched GEMM

Optimize the GPU gint batched-GEMM path (gemm_{nn,tn}_vbatch, driven from
phi_mul_phi / phi_mul_dm) for FP64 on V100/A100-class GPUs.

- phi_operator_gpu: replace the single max-shape vbatch launch with
  shape-exact bucketing. Atom pairs are grouped by (nw1, nw2) via a dense
  NW_MAX*NW_MAX counting-sort table, pre-enumerated once per batch in
  set_bgrid_batch, so each bucket hands the kernel a scalar (m, n, k) and the
  tile ladder picks the tightest tile per shape -- no cross-species tile
  waste, no over-launched blocks. A guard aborts if any atom nw >= NW_MAX.

- dgemm_vbatch: scalar (m, n, k) dispatch (drops the per-batchid M/N/K device
  arrays) feeding a 4x2 (NN) / 4x4 (TN) BLK_{M,N} ladder over {8,16,32,48}.

- gemm_{nn,tn}_vbatch: K-inner shared-memory layout + wide (double2/float4)
  LDS inner loop -- one 16-byte LDS feeds VK FMAs per (m,n); PAD keeps the
  shmem stride 16-byte aligned and warp access bank-conflict-free.

C accumulators stay double regardless of input type T, preserving the
mixed-precision fp64-accumulator fix (deepmodeling#7368); the phi_operator kernel
optimizations from deepmodeling#7366 (WantPhi dispatch, single-warp reduce) are retained.

FP64 15-case GPU benchmark: end-to-end ~1.05x (A800) / ~1.04x (V100), with
cal_gint_vl up to ~1.5x and cal_gint_rho up to ~1.65x; energies and pressures
match develop to ~1e-10 on every case.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(gint): derive shape-bucket stride from ucell.nwmax, drop hardcoded NW_MAX

The (nw1, nw2) shape-bucketing in phi_mul_phi / phi_mul_dm flattened pairs
into a dense table key via `nw1 * NW_MAX + nw2`, with NW_MAX a hardcoded 64.
That was both a magic number and an artificial ceiling: a basis with nw > 64
would abort(), and 64 was only a guess at the real max.

The true upper bound is already known to the code as ucell.nwmax (max orbital
count over all atom types), exposed via gint_gpu_vars_->nwmax. Use it: set
nw_stride_ = nwmax + 1 once in the ctor so the bucket table is sized exactly to
the basis -- no cap to maintain.

A runtime stride can't index std::array<int, NW_MAX*NW_MAX>, so the three
counting-sort tables (counts / base / cursor) move to mutable std::vector
members allocated once and re-zeroed per call. For typical nwmax~25 that's ~676
ints vs the old fixed 4096, so the hot path zeroes less and never reallocates.

The set_bgrid_batch() abort guard becomes a structurally-unreachable assert,
since nwmax is by definition the largest nw. Drop now-unused includes
(<array>, <cstdio>, <cstdlib>); add <cassert>.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(gint): clarify GEMM kernel comments, hoist shape-bucket struct

Follow-up cleanup on the shape-exact vbatched GEMM path. No behavior change.

- gemm_{nn,tn}_vbatch, dgemm_vbatch, gint_helper: rewrite the kernel comments
  to describe the actual mechanism (K-inner shared-memory layout, wide vector
  loads feeding VK FMAs per load, the tile ladder, fp64 cross-item
  accumulation) and drop the internal "V1/V3/Phase" development shorthand that
  carried no meaning outside the original work log.

- phi_operator_gpu: the local `Bucket` struct was declared identically inside
  both phi_mul_phi and phi_mul_dm. Hoist it to a named GemmShapeBucket type and
  reuse a single buckets_ member vector (cleared, not reallocated) across both,
  reserved once in the ctor -- one less per-call heap allocation on the hot
  path.

- phi_operator_gpu: pair_scratch_offset_ is fully overwritten in Pass 1 before
  Pass 2 reads it, so resize() it instead of assign(..., -1); the -1 sentinel
  was never observed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Refactor: move exx files

* fix TO_STRING

---------

Co-authored-by: linpz <linpz@mail.ustc.edu.cn>
Co-authored-by: PeizeLin <78645006+PeizeLin@users.noreply.github.com>
…al_optimization

Merge Rearrange_data into Grid_integral_optimization
…ptimization

merge OPENMP to Grid_integral_optimization
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.