Skip to content

feat: introduce core infrastructure, OpenMPI backend, and AllReduce example#4

Merged
Ziminli merged 23 commits into
masterfrom
feat/dev-infra
May 8, 2026
Merged

feat: introduce core infrastructure, OpenMPI backend, and AllReduce example#4
Ziminli merged 23 commits into
masterfrom
feat/dev-infra

Conversation

@Ziminli
Copy link
Copy Markdown
Collaborator

@Ziminli Ziminli commented May 7, 2026

Summary

This PR introduces the initial core infrastructure of InfiniCCL since the initial commit.

It establishes the foundational architecture for device and backend abstraction, compile-time dispatching, runtime support, and a few communication primitives. In addition, it provides an OpenMPI-based implementation of AllReduce along with a complete example program, including simple profiling and validation utilities.

Changes

Core Infrastructure

  • Introduce core abstractions:
    • Device, BackendType, and runtime specializations (CPU, NVIDIA, MetaX)
    • Operation base class and dispatching mechanism
    • Compile-time traits and dispatcher utilities
  • Add communication-related structures:
    • Communicator and backend-specific instances (e.g., OpenMPI)
  • Implement data type system:
    • DataType, TypeMap, and device-aware mappings (fp16, bf16)

OpenMPI Backend Support

  • Add OpenMPI-based implementations for:
    • infiniInit()
    • infiniGetRank()
    • infiniGetSize()
    • infiniCommInitAll()
    • infiniCommDestroy()
    • infiniAllReduce()
    • infiniFinalize()
  • Introduce OpenMPI-specific mappings and runtime integration

AllReduce Example and Utilities

  • Add examples/all_reduce.cc demonstrating distributed AllReduce
  • Introduce shared utilities under examples/utils.h:
    • Timer, Metrics, and Validator for profiling and correctness checking
  • Enable runtime-based memory management and validation

Build System

  • Introduce CMake-based build system
  • Implement code generation for include/infiniccl.h and bridge files
  • Refine header visibility:
    • expose only include/ as PUBLIC interface
    • keep internal sources PRIVATE
  • Link example programs with internal library

Logging

  • Add a lightweight logging system:
    • Logger and LOG macro for unified message handling
  • Integrate logging into key components

Refactor & Style

  • Improve code structure and modularity across components
  • Keep consistent formatting, comments, and general style

Known Issues & Future Work

  • Additional devices and backends will be supported in the future;
  • A unified casting utility will be required and supported in future updates;
  • Configure threading model (MPI_THREAD_FUNNELED) in Init to avoid runtime issues since using MPI_THREAD_MULTIPLE hangs;
  • The current examples/all_reduce.cc defaults to the Sum reduction operation. However, Prod and Avg have also been tested and verified. Support for configurable reduction types may be added in future updates.

The following have been addressed in #1

  • The current AllReduce example has only been tested on a single node (NVIDIA environment). Multi-node and heterogeneous setups still require a complete validation.
  • The logging module is temporary and will be replaced with glog in the future.
  • In the current test environment, OpenMPI consistently reports the warning: mpirun has exited due to process rank <RANK#> with PID <PID#> on node <IP> exiting improperly message. This issue has been investigated but not fully resolved. It does not appear to impact functionality at this time, but should be addressed in future work;
  • The current OpenMPI implementation of the above functions hardcodes MPI_COMM_WORLD within the implementation. This should be made configurable in the future;
  • The current OpenMPI implementation of CommInitAll directly relies on OpenMPI-specific environment variables (e.g., OMPI_COMM_WORLD_LOCAL_RANK). This is planned to be abstracted and mapped into InfiniCCL’s own environment variable interface to provide a cleaner and more complete backend abstraction;

Logs & Screenshots

cmake .. && make -j$(nproc) && mpirun -n 4 --mca mtl ^ofi --mca btl tcp,self -x UCX_NET_DEVICES=all --allow-run-as-root ./examples/all_reduce
-- No backend specified. Defaulting to WITH_OMPI=ON
-- Auto-detecting available devices...
-- Auto-detected NVIDIA environment.
-- No MetaX GPU detected
-- InfiniCCL Config: Devices [cpu, nvidia] | Backends [ompi]
-- Configuring done (1.8s)
-- Generating done (0.0s)
-- Build files have been written to: /nfs/lizimin/InfiniCCL/build
[ 20%] Generating InfiniCCL bridge and manifest files for Devices: [cpu;nvidia] Backends: [ompi]...
[ 40%] Building CXX object src/CMakeFiles/infiniccl.dir/comm_bridge.cc.o
[ 60%] Linking CXX shared library libinfiniccl.so
[ 60%] Built target infiniccl
[ 80%] Building CXX object examples/CMakeFiles/all_reduce.dir/all_reduce.cc.o
[100%] Linking CXX executable all_reduce
[100%] Built target all_reduce
[1778167342.070887] [server:418351:0]          parser.c:2305 UCX  WARN  unused environment variable: UCX_ROOT (maybe: UCX_PROTOS?)
[1778167342.070887] [server:418351:0]          parser.c:2305 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1778167342.071891] [server:418349:0]          parser.c:2305 UCX  WARN  unused environment variable: UCX_ROOT (maybe: UCX_PROTOS?)
[1778167342.071891] [server:418349:0]          parser.c:2305 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1778167342.083620] [server:418350:0]          parser.c:2305 UCX  WARN  unused environment variable: UCX_ROOT (maybe: UCX_PROTOS?)
[1778167342.083620] [server:418350:0]          parser.c:2305 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1778167342.083668] [server:418352:0]          parser.c:2305 UCX  WARN  unused environment variable: UCX_ROOT (maybe: UCX_PROTOS?)
[1778167342.083668] [server:418352:0]          parser.c:2305 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[Rank 3] Host: server | GPU: nvidia  | Device 3
[Rank 2] Host: server | GPU: nvidia  | Device 2
[Rank 1] Host: server | GPU: nvidia  | Device 1
[Rank 0] Host: server | GPU: nvidia  | Device 0

=== Performing AllReduce on GPU Memory ===
Data size: 1048576 floats (4 MB)
Operation: Sum
Warm-up iterations: 2
Profile iterations: 20

=== AllReduce Results ===
Correct: YES
Expect:  10.00
Actual:  10.00
Time:           12.140 ms
Throughput:     0.48 GB/s (Bus BW)
Alg Bandwidth:  0.32 GB/s
InfiniCCL finalized.

Ziminli added 23 commits April 3, 2026 12:17
…penMPI's type map.

- Add `constexpr_map.h` which provides a compile-time map structure.
- Add `include/data_type.h` which includes `infiniDataType_t` that is exposed to the public.
- Add `src/data_type_impl.h` which includes `DataType` and related constructs that are used internally.
- Add `src/ompi/type_map.h` which contains mappings that OpenMPI needs, specifically data type mapping at this moment.
- Add `include/return_status.h` which contains the public interface for return codes/status codes.
- Add `src/return_status_impl.h` which contains the private/internal interface for return codes/status codes.
…its.

- Add `src/device.h` which contains the definitions and utils about devices.
- Add `src/traits.h` which contains the compile-time traits.
- Add `src/dispatcher.h` which contains the implementation of C++17-compatible dispatcher.
- These are mainly migrated from `InfiniTensor/InfiniOps`.
…NVIDIA and MetaX

 - add runtime in `src/runtime.h` and its specializations under `cuda/`, `nvidia/` and `metax/`
 - add `device_.h` under `nvidia/` and `metax/` which contain their platform specializations for `DeviceEnabled`
 - add `Communicator` and `BackendCommInstance` in `src/communicator.h`
 - add the backend-specific derived classes of `BackendCommInstance`, specifically `OmpiInstance` in `src/ompi/comm_instance.h` and `NcclInstance` in `src/nvidia/nccl/comm_instance.h`
…peration` class for operation dispatching

 - add the generic traits for getting the "best" element in a `List` in `traits.h`
 - add traits for indicating enabled backends and `AllBackendTypes` alias
 - add Priority traits for `BackendType` and `Device::Type` in `backend.h` and `device.h`, respectively
 - add `src/operation.h` which contains `Operation` base class for all the operations and is responsible for dispatching different operations
…t `Init`/`infiniInit()`

 - add the definition of some communication functions in `include/comm.h` and `include/comm_ops.h`
 - add `CheckMpiImpl()` in `ompi/checks.h`
 - add class `Init` and its OpenMPI's implementation in `base/init.h` and `ompi/impl/init.h`, respectively
 - add cpu's device file `cpu/device_.h`
…, and device-aware dispatching

 - add `TypeMap` and `DataTypeMap` in `src/data_type_impl.h`
 - add CPU implemetnation of bf16 and fp16 (`Float16` and `BFloat16`, respectively)
 - add device-dependent bf16 and fp16, currently involves CPU, NVIDIA, and MetaX
 - update `DispatchFunc` to reflect the device-aware `DataType` mapping
 - add CPU runtime
 - style fix in `src/device.h`
…Init()` as an example

 - add `Operation` class in `src/operation.h` as a generic base class for all operations
 - combine `include/comm.h` and `include/comm_ops.h` into a single `comm.h` and now only leave `infiniInit()`
 - add the OpenMPI's implementation of `Init`
 - add CMake build system
 - provide code generation for `include/infiniccl.h` and bridge files
 - update `.gitignore` to ignore `build/` and `include/infiniccl.h`
 - add `examples/` directory for example test programs and add `all_reduce.cc` for allreduce example
Restrict internal source and generated directories to PRIVATE visibility
to ensure a production-grade public API. Only the include directory is
exposed to downstream consumers via the PUBLIC interface.

- Move src/ and binary/ directories to PRIVATE build interface.
- Keep include/ as the primary PUBLIC/INSTALL interface.
- Prevents internal template headers from leaking into user space.
- link the example programs with `src/` library in CMake
- use internal device/runtime/traits for validation
- add malloc/memcpy/free runtime calls in `examples/all_reduce` example
 - support `infiniFinalize()` and add its ompi's implementation, used in `examples/all_reduce.cc`
 - create `examples/utils.h` for having all the utilities used by the example programs and move the `CHECK_INFINI` macro into it
…t related required features, and fix errors

 - support `infiniCommInitAll()` and `infiniCommDestroy()` with ompi backend
 - change `Init()` to use `MPI_THREAD_FUNNELED` for ompi's implementation (otherwise will hang)
 - add some mutators for `Communicator` class
 - update `OmpiInstance` with default handle value and `Destroy()` method
 - add `SetDevice()` alias for NVIDIA's runtime
 - add error code info printing for the `CHECK_INFINI` macro in `examples/utils.h`
…the message printing

 - add a simple `Logger` and its `PrintMsg()` method in `src/logging.h`
 - update places where this is used: `src/base/comm_init_all.h` and `src/ompi/impl/comm_init_all.h`
…O` comments.

 - add `LOG` macro for convenient logging, but this will later be replaced with `glog`
 - update the `TODO` comments that remind logging task
… and result validation in the allreduce example

 - support `infiniAllReduce()` and its ompi backend
 - add `Timer`, `Metrics`, and `Validator` in `examples/utils.h` for simple profiling and result checking
 - add `infiniRedOp_t` and its internal mapping
 - add two synchronize runtime alias for NVIDIA's runtime backend
…/all_reduce.cc`

 - add `warmup_iters` and `profile_iters` for controlling the number of rounds for warmup and profiling loops
 - abstract out the original main function in `examples/all_reduce.cc` into `RunAllReduceExample()`, now the main function only set control parameters and then call `RunAllReduceExample()`
@Ziminli Ziminli self-assigned this May 7, 2026
@Ziminli Ziminli merged commit de8ea7f into master May 8, 2026
@Ziminli Ziminli deleted the feat/dev-infra branch May 8, 2026 02:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant