Add RDMA and NVIDIA GPU Direct Storage support by harshavardhana · Pull Request #214 · minio/minio-cpp

harshavardhana · 2026-05-26T02:02:49Z

Summary

Optional RDMA / NVIDIA GPU Direct Storage transport for PutObject / GetObject, gated behind a new CMake flag MINIO_CPP_ENABLE_RDMA (default OFF).
PutObjectArgs / GetObjectArgs carry optional RDMA fields, keeping the public API uniform for RDMA and non-RDMA callers.
C ABI in include/miniocpp/c_api.h for cross-language bindings (consumed by minio-go / minio-py wrappers).
HTTP-over-RDMA control path via curlpp, NIC-failover-aware retry, HTTP fallback when RDMA buffer registration fails, and CRC64NVME checksum support for RDMA multipart uploads.
vendor/cuobj/ carries NVIDIA cuFile / cuObjClient redistributables per Attachment A of the CUDA Toolkit EULA (EULA + NOTICE bundled), consumed only when MINIO_CPP_ENABLE_RDMA=ON.
Examples: GetPutRDMA (GPU-buffer demo, loads libcudart via dlopen so the SDK build has no cudart link-time dependency) and GPUHostDisk.

Test plan

Default build (MINIO_CPP_ENABLE_RDMA=OFF) builds and existing tests pass on Linux/macOS
RDMA-enabled build (-DMINIO_CPP_ENABLE_RDMA=ON) builds against vendored cuFile / cuObjClient
GetPutRDMA example runs end-to-end against a MinIO server with RDMA enabled
HTTP fallback path exercised when RDMA buffer registration fails
CRC64NVME checksum verified on RDMA multipart uploads

This commit adds support for RDMA (Remote Direct Memory Access) and NVIDIA GPU Direct Storage for high-performance data transfers. Features: - RDMA transport layer with AWS S3 SignV4 signing - GetObject/PutObject operations via RDMA - Multipart upload support with RDMA - NVIDIA cuFile integration for GPU Direct Storage - cuObjClient wrapper for RDMA operations (v0.7) - New examples: GetPutRDMA, GPUHostDisk Build changes: - Tests and examples built by default (MINIO_CPP_TEST=ON) - CUDA toolkit integration at /usr/local/cuda - cuObjClient library linkage The RDMA implementation uses objectPut/objectGet callbacks invoked by the cuFile RDMA layer for direct GPU-to-storage transfers.

- Use std::optional<size_t> for RDMA args size field instead of -1 sentinel - Add region field to s3_rdma_client_ctx and pass resolved region - Add null pointer checks for rdmaclient before dereferencing - Add exception safety with try-catch for std::stoi/stoll in RDMA callbacks - Add [[maybe_unused]] attribute for unused offset parameter in callbacks - Fix pugixml set_value() to use c_str() for compatibility - Add cudart linking for CUDA examples - Use cuda_runtime.h instead of cuda.h in GetPutRDMA example

- Add checksum_crc64nvme field to UploadPartArgs, PutObjectArgs - Add checksum field to s3_rdma_client_ctx_t for RDMA context - Add x-amz-checksum-crc64nvme header in RDMA objectPut callback - Return checksum in PutObjectResponse and UploadPartResponse - Support checksum pass-through from Go to C++ RDMA layer

When cuMemObjGetDescriptor fails (e.g., buffer is regular heap memory instead of GPU/pinned memory), fall back to HTTP transfer instead of returning an error. This allows RDMA operations to gracefully degrade when the caller provides non-RDMA-capable buffers. Affected functions: - GetObject(GetObjectRDMAArgs): Falls back to HTTP on registration failure - PutObject(PutObjectRDMAArgs): Falls back to HTTP on registration failure

…iency_v2 Replace callback-based RDMA flow (objectPut/objectGet) with direct token acquisition via cuMemObjGetRDMAToken/cuMemObjPutRDMAToken. This removes the callback indirection and gives the application full control over the S3 control-path HTTP request. - nvidia-cuobjclient.h: Add cuMemObjGetRDMAToken, cuMemObjPutRDMAToken, shutdownTelemetry, and telemetry classes (cuObjTelem, cuObjSpan, cuObjTelem_ostream) - nvidia-cufile.h: Update to upstream v1.17.0 — new error codes (37-50), P2P flags, async stream APIs, scatter/gather IO, version/topology APIs - rdma.h: Replace objectPut/objectGet callbacks with rdmaPut/rdmaGet functions that accept a token string directly - client.cc/baseclient.cc: Use token-based RDMA path with HTTP fallback when token acquisition fails or server declines RDMA

Vendor libcuobjclient.so.1.0.0, libcufile.so.1.17.0, and libcufile_rdma.so.1.17.0 into vendor/cuobj/lib/ so the build no longer depends on an external eos path. CMakeLists.txt and configure.sh now reference ${CMAKE_SOURCE_DIR}/vendor/cuobj/lib.

The server expects the x-amz-rdma-token to be in the format: <81-char descriptor>:<hex_buf_addr>:<hex_size>; cuMemObjGetRDMAToken returns only the 81-char descriptor. rdmaPut and rdmaGet now accept the buffer pointer and append addr:size to form the complete token the server can parse.

The httplib client used for RDMA S3 requests had no timeouts, causing hangs when the server response was delayed. Add 5s connection timeout and 10s read timeout to both rdmaPut and rdmaGet httplib clients.

Update vendor/cuobj/lib/ from NVIDIA cuObject 1.0.0 to 1.2.0 client and server libraries. libcufile / libcufile_rdma bumped from 1.17.0 to 1.18.0. libcuobjclient now NEEDs libcufile.so.0 (vs libcufile.so.1 in 1.17) so a new .so.0 symlink is added. The 1.2 client library ships the cross-NIC failover machinery (rdma_multipath_enabled, rdma_async_event_monitoring, health-check thread). Header API surface is unchanged for the symbols baseclient.cc and client.cc already use (cuMemObjGetDescriptor, cuMemObjPutDescriptor, cuMemObjGetRDMAToken, cuMemObjPutRDMAToken, getMemoryType); no source changes needed.

Add rdmaPutWithRetry / rdmaGetWithRetry helpers in include/miniocpp/rdma.h that wrap cuMemObjGetRDMAToken + rdmaPut/rdmaGet with a 2-attempt loop. The first failure surfaces the bad NIC to libcuobjclient's async-event / health-check threads; the second cuMemObjGetRDMAToken call then mints on the backup NIC via the library's multipath state, turning a mid-flight NIC failure into a successful RDMA op instead of a hard fail. Every retry iteration releases the old token via cuMemObjPutRDMAToken and mints a fresh one, so the same stale token is never re-sent to the server. This avoids the callback-API pitfall where the library re-fires the application callback with the same cufileRDMAInfo_t* pointer (we don't use that API — ops are always CUObjIOOps{}, driving the direct-token path). Wire the helpers into BaseClient::PutObject, BaseClient::UploadPart, Client::GetObject(GetObjectRDMAArgs) and Client::PutObject(PutObjectRDMAArgs). On exhausted retries or a 501 reply from the server, all four sites fall through to the regular HTTP path instead of returning an error — matching the existing Client::* wrapper behavior. BaseClient::PutObject and UploadPart used to return an error immediately on any RDMA failure; that is now a fallback. vendor/cuobj/cuobj.json: flip rdma_multipath_enabled to true. Safe on single-NIC hosts because libcufile validates prerequisites at init time and falls back to single-path with a log line when rdma_dev_addr_list has only one entry or backups fail to register.

Vendor cuda.h (CUDA driver header) from cuObject v3 1.2.0 into vendor/cuobj/include/ and point MINIO_CPP_INCLUDES at it, replacing the /usr/local/cuda/include path. The SDK's use of cuda.h is type-only (CUdeviceptr etc. appearing in cuFile / cuObj API signatures), so no CUDA runtime symbols are called from within minio-cpp itself. Drop -L/usr/local/cuda/lib64 and -lcudart from MINIO_CPP_LIBS; libcufile is already vendored under vendor/cuobj/lib/ and cudart is not needed by the SDK. Rewrite examples/GetPutRDMA.cc and examples/GPUHostDisk.cc to load libcuda.so (the CUDA driver library shipped with every NVIDIA driver) via dlopen/dlsym instead of linking against libcudart. Add prominent comments in both examples clarifying that this dlopen shim is a convenience for the examples only — production applications wanting GPU Direct Storage should link against real CUDA APIs via the CUDA Toolkit. Add a 'CUDA dependency model' header comment to include/miniocpp/rdma.h explaining that CUDA is strictly an application concern, not an SDK concern: pinned-host workloads don't need CUDA at all, and GPU workloads bring their own CUDA linkage. Net: the SDK and all examples now build cleanly on hosts without the CUDA Toolkit installed. Only the NVIDIA GPU driver (libcuda.so) is needed at runtime, and only when the 'gpu' mode is actually exercised.

Three changes in the RDMA client path, all needed to make cuObject 1.2 multipath work reliably end-to-end against an S3 server: 1. rdma.h: extract the source-NIC GID from the RDMA token (field 7 of the cuObj descriptor per cuObjRDMADESCRprotocolformat.pdf) and pin the outgoing HTTP socket to that interface via httplib set_interface (CURLOPT_INTERFACE). Under RoundRobin multipath libcuobjclient may mint a token referencing the backup NIC while the kernel's default route sends HTTP out the primary NIC; without this binding the server's RDMA_READ targets a peer whose flow was never primed on this session, causing IBV_WC_RETRY_EXC_ERR and a 3.7s stall. 2. client.h + client.cc: promote cuObjClient to a process-wide Meyers singleton exposed as Client::SharedRDMAClient(). libcuobjclient drives libcufile which maintains process-global state (device / peer cache, health monitor, multipath registration); running the constructor from multiple threads simultaneously on process startup tripped a glibc heap-corruption abort ("malloc(): invalid size (unsorted)"). Singleton init via [stmt.dcl]/4 is thread-safe and happens exactly once. 3. client.cc: Client::PutObject(PutObjectRDMAArgs) HTTP fallback was constructing PutObjectArgs without copying args.bucket / .object / .region, so fall-through requests from the RDMA path failed with "bucket name cannot be empty". Set those fields explicitly so the fallback delivers the same request over HTTP. Validated on a 2-NIC ConnectX-7 client (primary + backup) against a single-NIC server, 3×3min warp putRDMA chaos runs (baseline, primary-down, backup-down) and a 60s PUT+GET round-trip over 130k objects: zero application-visible errors, ETag stable between RDMA PUT, RDMA GET, and plain HTTP GET of the same objects.

Three related fixes on the RDMA path: 1. include/miniocpp/baseclient.h: add a public GetBaseUrl() accessor. FFI shims (minio-go's api-put-object.cpp PutObjectRDMA / GetObjectRDMA C glue) need to read the client-configured region at call time and stamp it onto PutObjectRDMAArgs / GetObjectRDMAArgs. base_url_ was protected with no getter, so the shim had no way to forward it short of widening the C API. A const accessor is the minimal change. 2. src/client.cc: Client::GetObject(GetObjectRDMAArgs) HTTP fallback was setting targs.bucket and targs.object but dropping the region we just resolved via GetRegion(). For buckets in a non-default region this meant the fallback signed against the wrong region; for the default region it cost a redundant GetRegion() roundtrip inside BaseClient::GetObject. Matches the PutObject(PutObjectRDMAArgs) fallback added in fd468d9. 3. src/args.cc: PutObjectRDMAArgs::Validate() and GetObjectRDMAArgs:: Validate() now chain through ObjectArgs::Validate() to catch an empty bucket/object at the call site, rather than letting it slip into the RDMA path and surface as a confusing fallback failure. We chain to ObjectArgs rather than the more specific parent (GetObjectArgs/PutObjectBaseArgs) because the latter require fields the RDMA path doesn't use (datafunc / part_size).

Designated-initializer construction of s3_rdma_client_ctx at the four single-shot Put/Get callsites omits std::string fields that are either populated by the callee (etag, checksum) or only used in multipart contexts (uploadId, partNumber). Without in-class defaults on those strings, GCC raises -Wmissing-field-initializers on every callsite. Give every member an explicit in-class default so the omissions are intentional and silent. Functionally a no-op (std::string already default-constructs to empty), but removes noise that was masking real warnings during build.

Removes the ~10,700-line vendored copy of cpp-httplib (rdma-httplib.h) and rewrites rdmaPut / rdmaGet against minio::http::Request, which the rest of the SDK already uses (libcurl via curlpp). The protocol shape — Content-Length: 0 on the response body with the actual transferred byte count delivered via x-amz-rdma-bytes-transferred — is fully compatible with libcurl; the original httplib dependency was worked around an earlier protocol revision where Content-Length itself was abused to carry the transferred count without a body. While here, fix a latent correctness bug: rdmaGet was returning the caller-supplied size unconditionally and never reading x-amz-rdma-bytes-transferred, so ranged / partial-content GETs silently misreported actual bytes transferred. We now parse and return the server-reported count, falling back to the caller-supplied size only if the header is absent (older server). NIC pinning (CURLOPT_INTERFACE) and aggressive connect/read timeouts required by the control plane are exposed as new fields on http::Request so other call sites can reuse them.

The default build no longer requires libcufile, libcuobjclient, libibverbs, or librdmacm on the host. Consumers who want the RDMA / GPU Direct Storage API surface opt in with: cmake -DMINIO_CPP_ENABLE_RDMA=ON When OFF (default): * The vendored NVIDIA include path and link line are skipped. * GetObjectRDMAArgs / PutObjectRDMAArgs and the Client::GetObject(GetObjectRDMAArgs) / Client::PutObject(PutObjectRDMAArgs) overloads are not declared. * The rdmaclient field on PutObjectArgs / PutObjectApiArgs / UploadPartArgs is omitted from the struct layout. * RDMA-only headers (nvidia-cufile.h, nvidia-cuobjclient.h, rdma.h) are not installed. * The GetPutRDMA and GPUHostDisk examples are not built. When ON, target_compile_definitions(miniocpp PUBLIC MINIO_CPP_RDMA) makes the macro available to downstream consumers so they pick up the matching API surface from the installed headers. Verified clean separation on Linux: with RDMA off the resulting libminiocpp.so has no NDR/cuFile/cuObj/ibverbs deps; with RDMA on it links libcufile, libcuobjclient, libibverbs, librdmacm.

Adds vendor/cuobj/NOTICE explaining that the vendored cuda.h header, the cuFile and cuObj shared libraries, and the public-facing API headers (nvidia-cufile.h, nvidia-cuobjclient.h) all originate from NVIDIA Corporation and remain subject to NVIDIA's software license agreements — they are not covered by the minio-cpp Apache 2.0 grant. README.md now points to that NOTICE and clarifies that the default build omits the entire RDMA stack, so consumers who do not opt into GPUDirect Storage / RDMA can safely ignore the NVIDIA terms.

…bles Ships the verbatim NVIDIA CUDA Toolkit End User License Agreement as vendor/cuobj/EULA.txt and rewrites the NOTICE to reference it directly, including the §2.6 Attachment A entry that explicitly enumerates the cuFile component (cufile.h, libcufile.so, libcufile_rdma.so) as redistributable. The same EULA covers the broader set of NVIDIA-derived headers and shared libraries vendored under vendor/cuobj/ and reproduced under include/miniocpp/ (cuda.h, nvidia-cufile.h, nvidia-cuobjclient.h).

Replaces the literal 192.168.1.1 sample under rdma_dev_addr_list with <client-nic-ip> placeholder. Operators must substitute their local NIC IPv4 before use.

…e 2.0 compatibility Re-grounds vendor/cuobj/NOTICE on the actual CUDA Toolkit EULA §2.6 Attachment A entries that authorize redistribution of the vendored files: * "NVIDIA CUDA File IO Libraries and Header" covers cufile.h, libcufile.so, libcufile_rdma.so (plus static variants). * "Accelerated CUDA Libraries for Object Storage" covers libcuobjclient.so, libcuobjserver.so, and the cuObj headers (cuobjclient.h, cuobjrdma.h, cuobjrdmaparam.h, cuobjserver.h, cuobjtelem.h). * "CUDA Headers for Runtime Compilation" covers cuda.h. Also adds a section explaining why redistribution inside minio-cpp's Apache 2.0 SDK does not violate EULA §1.2(5): Apache 2.0 is permissive, not copyleft, and §4 of Apache 2.0 expressly permits shipping the licensed work alongside components carrying different license terms. The NVIDIA artifacts remain governed exclusively by EULA.txt; the Apache 2.0 grant covers only minio-cpp's own source.

… + add C ABI The RDMA path is now selected transparently by populating the new buf/size fields on PutObjectArgs / GetObjectArgs. The dedicated PutObjectRDMAArgs / GetObjectRDMAArgs structs and their corresponding Client::PutObject / Client::GetObject overloads are deleted. API change summary: PutObjectArgs + char* buf = nullptr (RDMA / direct-buffer path) + std::optional<size_t> size (required when buf is set) ~ std::istream* stream (was: std::istream&) GetObjectArgs + char* buf = nullptr (RDMA / direct-buffer path) + std::optional<size_t> size (required when buf is set) Validate() on both now requires exactly one of (stream|datafunc, buf). Deleted: struct PutObjectRDMAArgs, struct GetObjectRDMAArgs, Client::PutObject(PutObjectRDMAArgs), Client::GetObject(GetObjectRDMAArgs). Behaviour preserved: the previous RDMA-specific method bodies are folded inline into the unified Client::PutObject / Client::GetObject as a buf-mode branch. When buf is set, the SDK still tries RDMA via SharedRDMAClient() and falls back to a single-shot HTTP upload / streamed HTTP-into-buf download on RDMA decline. Source-compat note: the PutObjectArgs constructor still takes std::istream& (now stored as a pointer internally). Callers that assigned to the .stream field directly need to pass a pointer (&my_stream). Also adds include/miniocpp/c_api.h + src/c_api.cc — a stable extern "C" ABI exporting miniocpp_client_new/free, miniocpp_put_object, miniocpp_get_object (both unified — buf!=NULL ⇒ RDMA, else callback streaming), miniocpp_alloc_aligned/free_aligned, miniocpp_rdma_available, miniocpp_last_error. Symbols carry visibility("default") and gated under MINIO_CPP_ENABLE_RDMA. This is the shared base that minio-go and minio-py language bindings will dlopen against instead of vendoring per-language C++ glue.

Switches the GPU-buffer demo from libcuda's driver API (cuCtxCreate + cuMemAlloc) to libcudart's runtime API (cudaSetDevice + cudaMalloc), via dlopen so the SDK build itself still has no cudart link-time dependency. cudaMalloc runs cudart's static initialization, retrieves the device's primary context, and registers the allocation with cudart's internal P2P bookkeeping — none of which cuMemAlloc on a fresh cuCtxCreate'd context does. This is the idiomatic pattern for GPUDirect Storage / RDMA workloads regardless of whether it ultimately unblocks any specific end-to-end RDMA flow.

- Adds Linux/arm64 (ubuntu-24.04-arm) to the existing CI matrix alongside Linux/amd64, parametrized via matrix.config.arch. - Splits vendor/cuobj/lib/ into x86_64/ and aarch64/ subdirs and adds the aarch64 cuFile / cuObjClient / cuObjServer libraries (cuObject resiliency_v3, version 1.2.0 / cuFile 1.18.0). - CMake selects the right per-arch subdir from CMAKE_SYSTEM_PROCESSOR and verifies the cuObj client .so is present before linking. configure.sh drops the hardcoded -L flag. - New workflow .github/workflows/ci-rdma.yml builds miniocpp with MINIO_CPP_ENABLE_RDMA=ON on Linux amd64 and arm64. Build-only, no server / no tests. - Linux CI downloads the AIStor binary from dl.min.io/aistor/minio/release/linux-${arch}/ and passes the MINIO_LICENSE secret as an env var so the free-tier license activates at runtime. macOS and Windows continue to use community minio.

harshavardhana added 23 commits May 21, 2026 11:48

fix: style issues

22d4122

fix: Remove trailing semicolon from RDMA token format

2129843

fix: Add httplib connection/read timeouts for RDMA GET/PUT

b506d43

The httplib client used for RDMA S3 requests had no timeouts, causing hangs when the server response was delayed. Add 5s connection timeout and 10s read timeout to both rdmaPut and rdmaGet httplib clients.

vendor: scrub sample IP from cuobj.json to placeholder

c275017

Replaces the literal 192.168.1.1 sample under rdma_dev_addr_list with <client-nic-ip> placeholder. Operators must substitute their local NIC IPv4 before use.

harshavardhana force-pushed the rdma-support branch from 4f73d5d to e333282 Compare May 26, 2026 02:07

harshavardhana force-pushed the rdma-support branch 5 times, most recently from ede2dd8 to 8e75a37 Compare May 26, 2026 03:44

harshavardhana force-pushed the rdma-support branch from 8e75a37 to a5881cf Compare May 26, 2026 03:57

harshavardhana force-pushed the rdma-support branch from a5881cf to 2074462 Compare May 26, 2026 04:12

harshavardhana merged commit 9bae8cd into minio:main May 26, 2026
7 checks passed

harshavardhana deleted the rdma-support branch May 26, 2026 23:03

harshavardhana mentioned this pull request May 27, 2026

fix multipart RDMA: propagate rdmaclient, per-part CRC64NVME, complete XML #226

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RDMA and NVIDIA GPU Direct Storage support#214

Add RDMA and NVIDIA GPU Direct Storage support#214
harshavardhana merged 25 commits into
minio:mainfrom
miniohq:rdma-support

harshavardhana commented May 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

harshavardhana commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

harshavardhana commented May 26, 2026 •

edited

Loading