feat: introduce core infrastructure, OpenMPI backend, and AllReduce example by Ziminli · Pull Request #4 · InfiniTensor/InfiniCCL

Ziminli · 2026-05-07T15:52:30Z

Summary

This PR introduces the initial core infrastructure of InfiniCCL since the initial commit.

It establishes the foundational architecture for device and backend abstraction, compile-time dispatching, runtime support, and a few communication primitives. In addition, it provides an OpenMPI-based implementation of AllReduce along with a complete example program, including simple profiling and validation utilities.

Changes

Core Infrastructure

Introduce core abstractions:
- Device, BackendType, and runtime specializations (CPU, NVIDIA, MetaX)
- Operation base class and dispatching mechanism
- Compile-time traits and dispatcher utilities
Add communication-related structures:
- Communicator and backend-specific instances (e.g., OpenMPI)
Implement data type system:
- DataType, TypeMap, and device-aware mappings (fp16, bf16)

OpenMPI Backend Support

Add OpenMPI-based implementations for:
- infiniInit()
- infiniGetRank()
- infiniGetSize()
- infiniCommInitAll()
- infiniCommDestroy()
- infiniAllReduce()
- infiniFinalize()
Introduce OpenMPI-specific mappings and runtime integration

AllReduce Example and Utilities

Add examples/all_reduce.cc demonstrating distributed AllReduce
Introduce shared utilities under examples/utils.h:
- Timer, Metrics, and Validator for profiling and correctness checking
Enable runtime-based memory management and validation

Build System

Introduce CMake-based build system
Implement code generation for include/infiniccl.h and bridge files
Refine header visibility:
- expose only include/ as PUBLIC interface
- keep internal sources PRIVATE
Link example programs with internal library

Logging

Add a lightweight logging system:
- Logger and LOG macro for unified message handling
Integrate logging into key components

Refactor & Style

Improve code structure and modularity across components
Keep consistent formatting, comments, and general style

Known Issues & Future Work

Additional devices and backends will be supported in the future;
A unified casting utility will be required and supported in future updates;
Configure threading model (MPI_THREAD_FUNNELED) in Init to avoid runtime issues since using MPI_THREAD_MULTIPLE hangs;
The current examples/all_reduce.cc defaults to the Sum reduction operation. However, Prod and Avg have also been tested and verified. Support for configurable reduction types may be added in future updates.

The following have been addressed in #1

The current AllReduce example has only been tested on a single node (NVIDIA environment). Multi-node and heterogeneous setups still require a complete validation.
The logging module is temporary and will be replaced with glog in the future.
In the current test environment, OpenMPI consistently reports the warning: mpirun has exited due to process rank <RANK#> with PID <PID#> on node <IP> exiting improperly message. This issue has been investigated but not fully resolved. It does not appear to impact functionality at this time, but should be addressed in future work;
The current OpenMPI implementation of the above functions hardcodes MPI_COMM_WORLD within the implementation. This should be made configurable in the future;
The current OpenMPI implementation of CommInitAll directly relies on OpenMPI-specific environment variables (e.g., OMPI_COMM_WORLD_LOCAL_RANK). This is planned to be abstracted and mapped into InfiniCCL’s own environment variable interface to provide a cleaner and more complete backend abstraction;

Logs & Screenshots

cmake .. && make -j$(nproc) && mpirun -n 4 --mca mtl ^ofi --mca btl tcp,self -x UCX_NET_DEVICES=all --allow-run-as-root ./examples/all_reduce
-- No backend specified. Defaulting to WITH_OMPI=ON
-- Auto-detecting available devices...
-- Auto-detected NVIDIA environment.
-- No MetaX GPU detected
-- InfiniCCL Config: Devices [cpu, nvidia] | Backends [ompi]
-- Configuring done (1.8s)
-- Generating done (0.0s)
-- Build files have been written to: /nfs/lizimin/InfiniCCL/build
[ 20%] Generating InfiniCCL bridge and manifest files for Devices: [cpu;nvidia] Backends: [ompi]...
[ 40%] Building CXX object src/CMakeFiles/infiniccl.dir/comm_bridge.cc.o
[ 60%] Linking CXX shared library libinfiniccl.so
[ 60%] Built target infiniccl
[ 80%] Building CXX object examples/CMakeFiles/all_reduce.dir/all_reduce.cc.o
[100%] Linking CXX executable all_reduce
[100%] Built target all_reduce
[1778167342.070887] [server:418351:0]          parser.c:2305 UCX  WARN  unused environment variable: UCX_ROOT (maybe: UCX_PROTOS?)
[1778167342.070887] [server:418351:0]          parser.c:2305 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1778167342.071891] [server:418349:0]          parser.c:2305 UCX  WARN  unused environment variable: UCX_ROOT (maybe: UCX_PROTOS?)
[1778167342.071891] [server:418349:0]          parser.c:2305 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1778167342.083620] [server:418350:0]          parser.c:2305 UCX  WARN  unused environment variable: UCX_ROOT (maybe: UCX_PROTOS?)
[1778167342.083620] [server:418350:0]          parser.c:2305 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1778167342.083668] [server:418352:0]          parser.c:2305 UCX  WARN  unused environment variable: UCX_ROOT (maybe: UCX_PROTOS?)
[1778167342.083668] [server:418352:0]          parser.c:2305 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[Rank 3] Host: server | GPU: nvidia  | Device 3
[Rank 2] Host: server | GPU: nvidia  | Device 2
[Rank 1] Host: server | GPU: nvidia  | Device 1
[Rank 0] Host: server | GPU: nvidia  | Device 0

=== Performing AllReduce on GPU Memory ===
Data size: 1048576 floats (4 MB)
Operation: Sum
Warm-up iterations: 2
Profile iterations: 20

=== AllReduce Results ===
Correct: YES
Expect:  10.00
Actual:  10.00
Time:           12.140 ms
Throughput:     0.48 GB/s (Bus BW)
Alg Bandwidth:  0.32 GB/s
InfiniCCL finalized.

…penMPI's type map. - Add `constexpr_map.h` which provides a compile-time map structure. - Add `include/data_type.h` which includes `infiniDataType_t` that is exposed to the public. - Add `src/data_type_impl.h` which includes `DataType` and related constructs that are used internally. - Add `src/ompi/type_map.h` which contains mappings that OpenMPI needs, specifically data type mapping at this moment.

- Add `include/return_status.h` which contains the public interface for return codes/status codes. - Add `src/return_status_impl.h` which contains the private/internal interface for return codes/status codes.

…its. - Add `src/device.h` which contains the definitions and utils about devices. - Add `src/traits.h` which contains the compile-time traits. - Add `src/dispatcher.h` which contains the implementation of C++17-compatible dispatcher. - These are mainly migrated from `InfiniTensor/InfiniOps`.

…NVIDIA and MetaX - add runtime in `src/runtime.h` and its specializations under `cuda/`, `nvidia/` and `metax/` - add `device_.h` under `nvidia/` and `metax/` which contain their platform specializations for `DeviceEnabled`

…d enum class

- add `Communicator` and `BackendCommInstance` in `src/communicator.h` - add the backend-specific derived classes of `BackendCommInstance`, specifically `OmpiInstance` in `src/ompi/comm_instance.h` and `NcclInstance` in `src/nvidia/nccl/comm_instance.h`

…peration` class for operation dispatching - add the generic traits for getting the "best" element in a `List` in `traits.h` - add traits for indicating enabled backends and `AllBackendTypes` alias - add Priority traits for `BackendType` and `Device::Type` in `backend.h` and `device.h`, respectively - add `src/operation.h` which contains `Operation` base class for all the operations and is responsible for dispatching different operations

…t `Init`/`infiniInit()` - add the definition of some communication functions in `include/comm.h` and `include/comm_ops.h` - add `CheckMpiImpl()` in `ompi/checks.h` - add class `Init` and its OpenMPI's implementation in `base/init.h` and `ompi/impl/init.h`, respectively - add cpu's device file `cpu/device_.h`

…, and device-aware dispatching - add `TypeMap` and `DataTypeMap` in `src/data_type_impl.h` - add CPU implemetnation of bf16 and fp16 (`Float16` and `BFloat16`, respectively) - add device-dependent bf16 and fp16, currently involves CPU, NVIDIA, and MetaX - update `DispatchFunc` to reflect the device-aware `DataType` mapping - add CPU runtime - style fix in `src/device.h`

…Init()` as an example - add `Operation` class in `src/operation.h` as a generic base class for all operations - combine `include/comm.h` and `include/comm_ops.h` into a single `comm.h` and now only leave `infiniInit()` - add the OpenMPI's implementation of `Init`

- add CMake build system - provide code generation for `include/infiniccl.h` and bridge files - update `.gitignore` to ignore `build/` and `include/infiniccl.h` - add `examples/` directory for example test programs and add `all_reduce.cc` for allreduce example

Restrict internal source and generated directories to PRIVATE visibility to ensure a production-grade public API. Only the include directory is exposed to downstream consumers via the PUBLIC interface. - Move src/ and binary/ directories to PRIVATE build interface. - Keep include/ as the primary PUBLIC/INSTALL interface. - Prevents internal template headers from leaking into user space.

- link the example programs with `src/` library in CMake - use internal device/runtime/traits for validation - add malloc/memcpy/free runtime calls in `examples/all_reduce` example

… `examples/all_reduce.cc`

…duce.cc`

- support `infiniFinalize()` and add its ompi's implementation, used in `examples/all_reduce.cc` - create `examples/utils.h` for having all the utilities used by the example programs and move the `CHECK_INFINI` macro into it

…t related required features, and fix errors - support `infiniCommInitAll()` and `infiniCommDestroy()` with ompi backend - change `Init()` to use `MPI_THREAD_FUNNELED` for ompi's implementation (otherwise will hang) - add some mutators for `Communicator` class - update `OmpiInstance` with default handle value and `Destroy()` method - add `SetDevice()` alias for NVIDIA's runtime - add error code info printing for the `CHECK_INFINI` macro in `examples/utils.h`

…the message printing - add a simple `Logger` and its `PrintMsg()` method in `src/logging.h` - update places where this is used: `src/base/comm_init_all.h` and `src/ompi/impl/comm_init_all.h`

…O` comments. - add `LOG` macro for convenient logging, but this will later be replaced with `glog` - update the `TODO` comments that remind logging task

… and result validation in the allreduce example - support `infiniAllReduce()` and its ompi backend - add `Timer`, `Metrics`, and `Validator` in `examples/utils.h` for simple profiling and result checking - add `infiniRedOp_t` and its internal mapping - add two synchronize runtime alias for NVIDIA's runtime backend

…/all_reduce.cc` - add `warmup_iters` and `profile_iters` for controlling the number of rounds for warmup and profiling loops - abstract out the original main function in `examples/all_reduce.cc` into `RunAllReduceExample()`, now the main function only set control parameters and then call `RunAllReduceExample()`

Ziminli added 23 commits April 3, 2026 12:17

feat: add the external and internal interfaces for return code/status.

83c0334

- Add `include/return_status.h` which contains the public interface for return codes/status codes. - Add `src/return_status_impl.h` which contains the private/internal interface for return codes/status codes.

feat: add src/backend.h which contains the definition of BackendType

175837a

refactor: change DataType and ReturnStatus from aliasing to scope…

109d36e

…d enum class

style: add comments for the #endif in various files

7a03f50

feat: enable the allreduce example with internal runtime

58257e4

- link the example programs with `src/` library in CMake - use internal device/runtime/traits for validation - add malloc/memcpy/free runtime calls in `examples/all_reduce` example

feat: support infiniGetRank() and add rank-related info printing in…

f1527cb

… `examples/all_reduce.cc`

feat: support infiniGetSize() and add its usage in `examples/all_re…

d9607dd

…duce.cc`

feat: create examples/utils.h and support infiniFinalize()

41e3bca

- support `infiniFinalize()` and add its ompi's implementation, used in `examples/all_reduce.cc` - create `examples/utils.h` for having all the utilities used by the example programs and move the `CHECK_INFINI` macro into it

feat: add a simple Logger and its PrintMsg() method for unifying …

7102b31

…the message printing - add a simple `Logger` and its `PrintMsg()` method in `src/logging.h` - update places where this is used: `src/base/comm_init_all.h` and `src/ompi/impl/comm_init_all.h`

refactor: add LOG macro for convenient logging and change some `TOD…

d32d9c2

…O` comments. - add `LOG` macro for convenient logging, but this will later be replaced with `glog` - update the `TODO` comments that remind logging task

Ziminli self-assigned this May 7, 2026

Ziminli merged commit de8ea7f into master May 8, 2026

Ziminli deleted the feat/dev-infra branch May 8, 2026 02:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: introduce core infrastructure, OpenMPI backend, and AllReduce example#4

feat: introduce core infrastructure, OpenMPI backend, and AllReduce example#4
Ziminli merged 23 commits into
masterfrom
feat/dev-infra

Ziminli commented May 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Ziminli commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Core Infrastructure

OpenMPI Backend Support

AllReduce Example and Utilities

Build System

Logging

Refactor & Style

Known Issues & Future Work

Logs & Screenshots

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Ziminli commented May 7, 2026 •

edited

Loading