Skip to content

Add RDMA and NVIDIA GPU Direct Storage support#214

Merged
harshavardhana merged 25 commits into
minio:mainfrom
miniohq:rdma-support
May 26, 2026
Merged

Add RDMA and NVIDIA GPU Direct Storage support#214
harshavardhana merged 25 commits into
minio:mainfrom
miniohq:rdma-support

Conversation

@harshavardhana
Copy link
Copy Markdown
Member

@harshavardhana harshavardhana commented May 26, 2026

Summary

  • Optional RDMA / NVIDIA GPU Direct Storage transport for PutObject / GetObject, gated behind a new CMake flag MINIO_CPP_ENABLE_RDMA (default OFF).
  • PutObjectArgs / GetObjectArgs carry optional RDMA fields, keeping the public API uniform for RDMA and non-RDMA callers.
  • C ABI in include/miniocpp/c_api.h for cross-language bindings (consumed by minio-go / minio-py wrappers).
  • HTTP-over-RDMA control path via curlpp, NIC-failover-aware retry, HTTP fallback when RDMA buffer registration fails, and CRC64NVME checksum support for RDMA multipart uploads.
  • vendor/cuobj/ carries NVIDIA cuFile / cuObjClient redistributables per Attachment A of the CUDA Toolkit EULA (EULA + NOTICE bundled), consumed only when MINIO_CPP_ENABLE_RDMA=ON.
  • Examples: GetPutRDMA (GPU-buffer demo, loads libcudart via dlopen so the SDK build has no cudart link-time dependency) and GPUHostDisk.

Test plan

  • Default build (MINIO_CPP_ENABLE_RDMA=OFF) builds and existing tests pass on Linux/macOS
  • RDMA-enabled build (-DMINIO_CPP_ENABLE_RDMA=ON) builds against vendored cuFile / cuObjClient
  • GetPutRDMA example runs end-to-end against a MinIO server with RDMA enabled
  • HTTP fallback path exercised when RDMA buffer registration fails
  • CRC64NVME checksum verified on RDMA multipart uploads

This commit adds support for RDMA (Remote Direct Memory Access) and
NVIDIA GPU Direct Storage for high-performance data transfers.

Features:
- RDMA transport layer with AWS S3 SignV4 signing
- GetObject/PutObject operations via RDMA
- Multipart upload support with RDMA
- NVIDIA cuFile integration for GPU Direct Storage
- cuObjClient wrapper for RDMA operations (v0.7)
- New examples: GetPutRDMA, GPUHostDisk

Build changes:
- Tests and examples built by default (MINIO_CPP_TEST=ON)
- CUDA toolkit integration at /usr/local/cuda
- cuObjClient library linkage

The RDMA implementation uses objectPut/objectGet callbacks invoked
by the cuFile RDMA layer for direct GPU-to-storage transfers.
- Use std::optional<size_t> for RDMA args size field instead of -1 sentinel
- Add region field to s3_rdma_client_ctx and pass resolved region
- Add null pointer checks for rdmaclient before dereferencing
- Add exception safety with try-catch for std::stoi/stoll in RDMA callbacks
- Add [[maybe_unused]] attribute for unused offset parameter in callbacks
- Fix pugixml set_value() to use c_str() for compatibility
- Add cudart linking for CUDA examples
- Use cuda_runtime.h instead of cuda.h in GetPutRDMA example
- Add checksum_crc64nvme field to UploadPartArgs, PutObjectArgs
- Add checksum field to s3_rdma_client_ctx_t for RDMA context
- Add x-amz-checksum-crc64nvme header in RDMA objectPut callback
- Return checksum in PutObjectResponse and UploadPartResponse
- Support checksum pass-through from Go to C++ RDMA layer
When cuMemObjGetDescriptor fails (e.g., buffer is regular heap memory
instead of GPU/pinned memory), fall back to HTTP transfer instead of
returning an error. This allows RDMA operations to gracefully degrade
when the caller provides non-RDMA-capable buffers.

Affected functions:
- GetObject(GetObjectRDMAArgs): Falls back to HTTP on registration failure
- PutObject(PutObjectRDMAArgs): Falls back to HTTP on registration failure
…iency_v2

Replace callback-based RDMA flow (objectPut/objectGet) with direct token
acquisition via cuMemObjGetRDMAToken/cuMemObjPutRDMAToken. This removes
the callback indirection and gives the application full control over
the S3 control-path HTTP request.

- nvidia-cuobjclient.h: Add cuMemObjGetRDMAToken, cuMemObjPutRDMAToken,
  shutdownTelemetry, and telemetry classes (cuObjTelem, cuObjSpan,
  cuObjTelem_ostream)
- nvidia-cufile.h: Update to upstream v1.17.0 — new error codes (37-50),
  P2P flags, async stream APIs, scatter/gather IO, version/topology APIs
- rdma.h: Replace objectPut/objectGet callbacks with rdmaPut/rdmaGet
  functions that accept a token string directly
- client.cc/baseclient.cc: Use token-based RDMA path with HTTP fallback
  when token acquisition fails or server declines RDMA
Vendor libcuobjclient.so.1.0.0, libcufile.so.1.17.0, and
libcufile_rdma.so.1.17.0 into vendor/cuobj/lib/ so the build
no longer depends on an external eos path. CMakeLists.txt and
configure.sh now reference ${CMAKE_SOURCE_DIR}/vendor/cuobj/lib.
The server expects the x-amz-rdma-token to be in the format:
  <81-char descriptor>:<hex_buf_addr>:<hex_size>;

cuMemObjGetRDMAToken returns only the 81-char descriptor. rdmaPut
and rdmaGet now accept the buffer pointer and append addr:size to
form the complete token the server can parse.
The httplib client used for RDMA S3 requests had no timeouts,
causing hangs when the server response was delayed. Add 5s
connection timeout and 10s read timeout to both rdmaPut and
rdmaGet httplib clients.
Update vendor/cuobj/lib/ from NVIDIA cuObject 1.0.0 to 1.2.0 client
and server libraries. libcufile / libcufile_rdma bumped from 1.17.0
to 1.18.0. libcuobjclient now NEEDs libcufile.so.0 (vs libcufile.so.1
in 1.17) so a new .so.0 symlink is added.

The 1.2 client library ships the cross-NIC failover machinery
(rdma_multipath_enabled, rdma_async_event_monitoring, health-check
thread). Header API surface is unchanged for the symbols
baseclient.cc and client.cc already use (cuMemObjGetDescriptor,
cuMemObjPutDescriptor, cuMemObjGetRDMAToken, cuMemObjPutRDMAToken,
getMemoryType); no source changes needed.
Add rdmaPutWithRetry / rdmaGetWithRetry helpers in
include/miniocpp/rdma.h that wrap cuMemObjGetRDMAToken +
rdmaPut/rdmaGet with a 2-attempt loop. The first failure surfaces
the bad NIC to libcuobjclient's async-event / health-check
threads; the second cuMemObjGetRDMAToken call then mints on the
backup NIC via the library's multipath state, turning a mid-flight
NIC failure into a successful RDMA op instead of a hard fail.

Every retry iteration releases the old token via
cuMemObjPutRDMAToken and mints a fresh one, so the same stale
token is never re-sent to the server. This avoids the callback-API
pitfall where the library re-fires the application callback with
the same cufileRDMAInfo_t* pointer (we don't use that API — ops
are always CUObjIOOps{}, driving the direct-token path).

Wire the helpers into BaseClient::PutObject, BaseClient::UploadPart,
Client::GetObject(GetObjectRDMAArgs) and
Client::PutObject(PutObjectRDMAArgs). On exhausted retries or a
501 reply from the server, all four sites fall through to the
regular HTTP path instead of returning an error — matching the
existing Client::* wrapper behavior. BaseClient::PutObject and
UploadPart used to return an error immediately on any RDMA
failure; that is now a fallback.

vendor/cuobj/cuobj.json: flip rdma_multipath_enabled to true.
Safe on single-NIC hosts because libcufile validates prerequisites
at init time and falls back to single-path with a log line when
rdma_dev_addr_list has only one entry or backups fail to register.
Vendor cuda.h (CUDA driver header) from cuObject v3 1.2.0 into
vendor/cuobj/include/ and point MINIO_CPP_INCLUDES at it, replacing
the /usr/local/cuda/include path. The SDK's use of cuda.h is type-only
(CUdeviceptr etc. appearing in cuFile / cuObj API signatures), so no
CUDA runtime symbols are called from within minio-cpp itself.

Drop -L/usr/local/cuda/lib64 and -lcudart from MINIO_CPP_LIBS; libcufile
is already vendored under vendor/cuobj/lib/ and cudart is not needed by
the SDK.

Rewrite examples/GetPutRDMA.cc and examples/GPUHostDisk.cc to load
libcuda.so (the CUDA driver library shipped with every NVIDIA driver)
via dlopen/dlsym instead of linking against libcudart. Add prominent
comments in both examples clarifying that this dlopen shim is a
convenience for the examples only — production applications wanting
GPU Direct Storage should link against real CUDA APIs via the CUDA
Toolkit.

Add a 'CUDA dependency model' header comment to include/miniocpp/rdma.h
explaining that CUDA is strictly an application concern, not an SDK
concern: pinned-host workloads don't need CUDA at all, and GPU workloads
bring their own CUDA linkage.

Net: the SDK and all examples now build cleanly on hosts without the
CUDA Toolkit installed. Only the NVIDIA GPU driver (libcuda.so) is
needed at runtime, and only when the 'gpu' mode is actually exercised.
Three changes in the RDMA client path, all needed to make cuObject 1.2
multipath work reliably end-to-end against an S3 server:

1. rdma.h: extract the source-NIC GID from the RDMA token (field 7 of
   the cuObj descriptor per cuObjRDMADESCRprotocolformat.pdf) and pin
   the outgoing HTTP socket to that interface via httplib
   set_interface (CURLOPT_INTERFACE). Under RoundRobin multipath
   libcuobjclient may mint a token referencing the backup NIC while
   the kernel's default route sends HTTP out the primary NIC; without
   this binding the server's RDMA_READ targets a peer whose flow was
   never primed on this session, causing IBV_WC_RETRY_EXC_ERR and a
   3.7s stall.

2. client.h + client.cc: promote cuObjClient to a process-wide Meyers
   singleton exposed as Client::SharedRDMAClient(). libcuobjclient
   drives libcufile which maintains process-global state (device /
   peer cache, health monitor, multipath registration); running the
   constructor from multiple threads simultaneously on process startup
   tripped a glibc heap-corruption abort ("malloc(): invalid size
   (unsorted)"). Singleton init via [stmt.dcl]/4 is thread-safe and
   happens exactly once.

3. client.cc: Client::PutObject(PutObjectRDMAArgs) HTTP fallback was
   constructing PutObjectArgs without copying args.bucket / .object /
   .region, so fall-through requests from the RDMA path failed with
   "bucket name cannot be empty". Set those fields explicitly so the
   fallback delivers the same request over HTTP.

Validated on a 2-NIC ConnectX-7 client (primary + backup) against a
single-NIC server, 3×3min warp putRDMA chaos runs (baseline,
primary-down, backup-down) and a 60s PUT+GET round-trip over 130k
objects: zero application-visible errors, ETag stable between RDMA
PUT, RDMA GET, and plain HTTP GET of the same objects.
Three related fixes on the RDMA path:

1. include/miniocpp/baseclient.h: add a public GetBaseUrl() accessor.
   FFI shims (minio-go's api-put-object.cpp PutObjectRDMA / GetObjectRDMA
   C glue) need to read the client-configured region at call time and
   stamp it onto PutObjectRDMAArgs / GetObjectRDMAArgs. base_url_ was
   protected with no getter, so the shim had no way to forward it short
   of widening the C API. A const accessor is the minimal change.

2. src/client.cc: Client::GetObject(GetObjectRDMAArgs) HTTP fallback was
   setting targs.bucket and targs.object but dropping the region we
   just resolved via GetRegion(). For buckets in a non-default region
   this meant the fallback signed against the wrong region; for the
   default region it cost a redundant GetRegion() roundtrip inside
   BaseClient::GetObject. Matches the PutObject(PutObjectRDMAArgs)
   fallback added in fd468d9.

3. src/args.cc: PutObjectRDMAArgs::Validate() and GetObjectRDMAArgs::
   Validate() now chain through ObjectArgs::Validate() to catch an
   empty bucket/object at the call site, rather than letting it slip
   into the RDMA path and surface as a confusing fallback failure.
   We chain to ObjectArgs rather than the more specific parent
   (GetObjectArgs/PutObjectBaseArgs) because the latter require fields
   the RDMA path doesn't use (datafunc / part_size).
Designated-initializer construction of s3_rdma_client_ctx at the four
single-shot Put/Get callsites omits std::string fields that are either
populated by the callee (etag, checksum) or only used in multipart
contexts (uploadId, partNumber). Without in-class defaults on those
strings, GCC raises -Wmissing-field-initializers on every callsite.

Give every member an explicit in-class default so the omissions are
intentional and silent. Functionally a no-op (std::string already
default-constructs to empty), but removes noise that was masking real
warnings during build.
Removes the ~10,700-line vendored copy of cpp-httplib (rdma-httplib.h)
and rewrites rdmaPut / rdmaGet against minio::http::Request, which the
rest of the SDK already uses (libcurl via curlpp).

The protocol shape — Content-Length: 0 on the response body with the
actual transferred byte count delivered via x-amz-rdma-bytes-transferred
— is fully compatible with libcurl; the original httplib dependency was
worked around an earlier protocol revision where Content-Length itself
was abused to carry the transferred count without a body.

While here, fix a latent correctness bug: rdmaGet was returning the
caller-supplied size unconditionally and never reading
x-amz-rdma-bytes-transferred, so ranged / partial-content GETs silently
misreported actual bytes transferred. We now parse and return the
server-reported count, falling back to the caller-supplied size only if
the header is absent (older server).

NIC pinning (CURLOPT_INTERFACE) and aggressive connect/read timeouts
required by the control plane are exposed as new fields on
http::Request so other call sites can reuse them.
The default build no longer requires libcufile, libcuobjclient,
libibverbs, or librdmacm on the host. Consumers who want the RDMA /
GPU Direct Storage API surface opt in with:

    cmake -DMINIO_CPP_ENABLE_RDMA=ON

When OFF (default):
  * The vendored NVIDIA include path and link line are skipped.
  * GetObjectRDMAArgs / PutObjectRDMAArgs and the
    Client::GetObject(GetObjectRDMAArgs) /
    Client::PutObject(PutObjectRDMAArgs) overloads are not declared.
  * The rdmaclient field on PutObjectArgs / PutObjectApiArgs /
    UploadPartArgs is omitted from the struct layout.
  * RDMA-only headers (nvidia-cufile.h, nvidia-cuobjclient.h, rdma.h)
    are not installed.
  * The GetPutRDMA and GPUHostDisk examples are not built.

When ON, target_compile_definitions(miniocpp PUBLIC MINIO_CPP_RDMA)
makes the macro available to downstream consumers so they pick up the
matching API surface from the installed headers.

Verified clean separation on Linux: with RDMA off the resulting
libminiocpp.so has no NDR/cuFile/cuObj/ibverbs deps; with RDMA on it
links libcufile, libcuobjclient, libibverbs, librdmacm.
Adds vendor/cuobj/NOTICE explaining that the vendored cuda.h header,
the cuFile and cuObj shared libraries, and the public-facing API
headers (nvidia-cufile.h, nvidia-cuobjclient.h) all originate from
NVIDIA Corporation and remain subject to NVIDIA's software license
agreements — they are not covered by the minio-cpp Apache 2.0 grant.

README.md now points to that NOTICE and clarifies that the default
build omits the entire RDMA stack, so consumers who do not opt into
GPUDirect Storage / RDMA can safely ignore the NVIDIA terms.
…bles

Ships the verbatim NVIDIA CUDA Toolkit End User License Agreement as
vendor/cuobj/EULA.txt and rewrites the NOTICE to reference it directly,
including the §2.6 Attachment A entry that explicitly enumerates the
cuFile component (cufile.h, libcufile.so, libcufile_rdma.so) as
redistributable.

The same EULA covers the broader set of NVIDIA-derived headers and
shared libraries vendored under vendor/cuobj/ and reproduced under
include/miniocpp/ (cuda.h, nvidia-cufile.h, nvidia-cuobjclient.h).
Replaces the literal 192.168.1.1 sample under rdma_dev_addr_list with
<client-nic-ip> placeholder. Operators must substitute their local
NIC IPv4 before use.
…e 2.0 compatibility

Re-grounds vendor/cuobj/NOTICE on the actual CUDA Toolkit EULA §2.6
Attachment A entries that authorize redistribution of the vendored
files:

  * "NVIDIA CUDA File IO Libraries and Header" covers cufile.h,
    libcufile.so, libcufile_rdma.so (plus static variants).
  * "Accelerated CUDA Libraries for Object Storage" covers
    libcuobjclient.so, libcuobjserver.so, and the cuObj headers
    (cuobjclient.h, cuobjrdma.h, cuobjrdmaparam.h, cuobjserver.h,
    cuobjtelem.h).
  * "CUDA Headers for Runtime Compilation" covers cuda.h.

Also adds a section explaining why redistribution inside minio-cpp's
Apache 2.0 SDK does not violate EULA §1.2(5): Apache 2.0 is permissive,
not copyleft, and §4 of Apache 2.0 expressly permits shipping the
licensed work alongside components carrying different license terms.
The NVIDIA artifacts remain governed exclusively by EULA.txt; the
Apache 2.0 grant covers only minio-cpp's own source.
… + add C ABI

The RDMA path is now selected transparently by populating the new buf/size
fields on PutObjectArgs / GetObjectArgs. The dedicated PutObjectRDMAArgs /
GetObjectRDMAArgs structs and their corresponding Client::PutObject /
Client::GetObject overloads are deleted.

API change summary:

  PutObjectArgs
    + char*               buf  = nullptr     (RDMA / direct-buffer path)
    + std::optional<size_t> size              (required when buf is set)
    ~ std::istream*       stream             (was: std::istream&)

  GetObjectArgs
    + char*               buf  = nullptr     (RDMA / direct-buffer path)
    + std::optional<size_t> size              (required when buf is set)

  Validate() on both now requires exactly one of (stream|datafunc, buf).

  Deleted: struct PutObjectRDMAArgs, struct GetObjectRDMAArgs,
           Client::PutObject(PutObjectRDMAArgs),
           Client::GetObject(GetObjectRDMAArgs).

Behaviour preserved: the previous RDMA-specific method bodies are folded
inline into the unified Client::PutObject / Client::GetObject as a buf-mode
branch. When buf is set, the SDK still tries RDMA via SharedRDMAClient()
and falls back to a single-shot HTTP upload / streamed HTTP-into-buf
download on RDMA decline.

Source-compat note: the PutObjectArgs constructor still takes std::istream&
(now stored as a pointer internally). Callers that assigned to the .stream
field directly need to pass a pointer (&my_stream).

Also adds include/miniocpp/c_api.h + src/c_api.cc — a stable extern "C"
ABI exporting miniocpp_client_new/free, miniocpp_put_object,
miniocpp_get_object (both unified — buf!=NULL ⇒ RDMA, else callback
streaming), miniocpp_alloc_aligned/free_aligned, miniocpp_rdma_available,
miniocpp_last_error. Symbols carry visibility("default") and gated under
MINIO_CPP_ENABLE_RDMA. This is the shared base that minio-go and minio-py
language bindings will dlopen against instead of vendoring per-language
C++ glue.
Switches the GPU-buffer demo from libcuda's driver API (cuCtxCreate +
cuMemAlloc) to libcudart's runtime API (cudaSetDevice + cudaMalloc), via
dlopen so the SDK build itself still has no cudart link-time dependency.

cudaMalloc runs cudart's static initialization, retrieves the device's
primary context, and registers the allocation with cudart's internal
P2P bookkeeping — none of which cuMemAlloc on a fresh cuCtxCreate'd
context does. This is the idiomatic pattern for GPUDirect Storage / RDMA
workloads regardless of whether it ultimately unblocks any specific
end-to-end RDMA flow.
@harshavardhana harshavardhana force-pushed the rdma-support branch 5 times, most recently from ede2dd8 to 8e75a37 Compare May 26, 2026 03:44
- Adds Linux/arm64 (ubuntu-24.04-arm) to the existing CI matrix
  alongside Linux/amd64, parametrized via matrix.config.arch.
- Splits vendor/cuobj/lib/ into x86_64/ and aarch64/ subdirs and
  adds the aarch64 cuFile / cuObjClient / cuObjServer libraries
  (cuObject resiliency_v3, version 1.2.0 / cuFile 1.18.0).
- CMake selects the right per-arch subdir from CMAKE_SYSTEM_PROCESSOR
  and verifies the cuObj client .so is present before linking.
  configure.sh drops the hardcoded -L flag.
- New workflow .github/workflows/ci-rdma.yml builds miniocpp with
  MINIO_CPP_ENABLE_RDMA=ON on Linux amd64 and arm64. Build-only,
  no server / no tests.
- Linux CI downloads the AIStor binary from
  dl.min.io/aistor/minio/release/linux-${arch}/ and passes the
  MINIO_LICENSE secret as an env var so the free-tier license
  activates at runtime. macOS and Windows continue to use
  community minio.
@harshavardhana harshavardhana merged commit 9bae8cd into minio:main May 26, 2026
7 checks passed
@harshavardhana harshavardhana deleted the rdma-support branch May 26, 2026 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant