Add RDMA and NVIDIA GPU Direct Storage support#214
Merged
Conversation
This commit adds support for RDMA (Remote Direct Memory Access) and NVIDIA GPU Direct Storage for high-performance data transfers. Features: - RDMA transport layer with AWS S3 SignV4 signing - GetObject/PutObject operations via RDMA - Multipart upload support with RDMA - NVIDIA cuFile integration for GPU Direct Storage - cuObjClient wrapper for RDMA operations (v0.7) - New examples: GetPutRDMA, GPUHostDisk Build changes: - Tests and examples built by default (MINIO_CPP_TEST=ON) - CUDA toolkit integration at /usr/local/cuda - cuObjClient library linkage The RDMA implementation uses objectPut/objectGet callbacks invoked by the cuFile RDMA layer for direct GPU-to-storage transfers.
- Use std::optional<size_t> for RDMA args size field instead of -1 sentinel - Add region field to s3_rdma_client_ctx and pass resolved region - Add null pointer checks for rdmaclient before dereferencing - Add exception safety with try-catch for std::stoi/stoll in RDMA callbacks - Add [[maybe_unused]] attribute for unused offset parameter in callbacks - Fix pugixml set_value() to use c_str() for compatibility - Add cudart linking for CUDA examples - Use cuda_runtime.h instead of cuda.h in GetPutRDMA example
- Add checksum_crc64nvme field to UploadPartArgs, PutObjectArgs - Add checksum field to s3_rdma_client_ctx_t for RDMA context - Add x-amz-checksum-crc64nvme header in RDMA objectPut callback - Return checksum in PutObjectResponse and UploadPartResponse - Support checksum pass-through from Go to C++ RDMA layer
When cuMemObjGetDescriptor fails (e.g., buffer is regular heap memory instead of GPU/pinned memory), fall back to HTTP transfer instead of returning an error. This allows RDMA operations to gracefully degrade when the caller provides non-RDMA-capable buffers. Affected functions: - GetObject(GetObjectRDMAArgs): Falls back to HTTP on registration failure - PutObject(PutObjectRDMAArgs): Falls back to HTTP on registration failure
…iency_v2 Replace callback-based RDMA flow (objectPut/objectGet) with direct token acquisition via cuMemObjGetRDMAToken/cuMemObjPutRDMAToken. This removes the callback indirection and gives the application full control over the S3 control-path HTTP request. - nvidia-cuobjclient.h: Add cuMemObjGetRDMAToken, cuMemObjPutRDMAToken, shutdownTelemetry, and telemetry classes (cuObjTelem, cuObjSpan, cuObjTelem_ostream) - nvidia-cufile.h: Update to upstream v1.17.0 — new error codes (37-50), P2P flags, async stream APIs, scatter/gather IO, version/topology APIs - rdma.h: Replace objectPut/objectGet callbacks with rdmaPut/rdmaGet functions that accept a token string directly - client.cc/baseclient.cc: Use token-based RDMA path with HTTP fallback when token acquisition fails or server declines RDMA
Vendor libcuobjclient.so.1.0.0, libcufile.so.1.17.0, and
libcufile_rdma.so.1.17.0 into vendor/cuobj/lib/ so the build
no longer depends on an external eos path. CMakeLists.txt and
configure.sh now reference ${CMAKE_SOURCE_DIR}/vendor/cuobj/lib.
The server expects the x-amz-rdma-token to be in the format: <81-char descriptor>:<hex_buf_addr>:<hex_size>; cuMemObjGetRDMAToken returns only the 81-char descriptor. rdmaPut and rdmaGet now accept the buffer pointer and append addr:size to form the complete token the server can parse.
The httplib client used for RDMA S3 requests had no timeouts, causing hangs when the server response was delayed. Add 5s connection timeout and 10s read timeout to both rdmaPut and rdmaGet httplib clients.
Update vendor/cuobj/lib/ from NVIDIA cuObject 1.0.0 to 1.2.0 client and server libraries. libcufile / libcufile_rdma bumped from 1.17.0 to 1.18.0. libcuobjclient now NEEDs libcufile.so.0 (vs libcufile.so.1 in 1.17) so a new .so.0 symlink is added. The 1.2 client library ships the cross-NIC failover machinery (rdma_multipath_enabled, rdma_async_event_monitoring, health-check thread). Header API surface is unchanged for the symbols baseclient.cc and client.cc already use (cuMemObjGetDescriptor, cuMemObjPutDescriptor, cuMemObjGetRDMAToken, cuMemObjPutRDMAToken, getMemoryType); no source changes needed.
Add rdmaPutWithRetry / rdmaGetWithRetry helpers in
include/miniocpp/rdma.h that wrap cuMemObjGetRDMAToken +
rdmaPut/rdmaGet with a 2-attempt loop. The first failure surfaces
the bad NIC to libcuobjclient's async-event / health-check
threads; the second cuMemObjGetRDMAToken call then mints on the
backup NIC via the library's multipath state, turning a mid-flight
NIC failure into a successful RDMA op instead of a hard fail.
Every retry iteration releases the old token via
cuMemObjPutRDMAToken and mints a fresh one, so the same stale
token is never re-sent to the server. This avoids the callback-API
pitfall where the library re-fires the application callback with
the same cufileRDMAInfo_t* pointer (we don't use that API — ops
are always CUObjIOOps{}, driving the direct-token path).
Wire the helpers into BaseClient::PutObject, BaseClient::UploadPart,
Client::GetObject(GetObjectRDMAArgs) and
Client::PutObject(PutObjectRDMAArgs). On exhausted retries or a
501 reply from the server, all four sites fall through to the
regular HTTP path instead of returning an error — matching the
existing Client::* wrapper behavior. BaseClient::PutObject and
UploadPart used to return an error immediately on any RDMA
failure; that is now a fallback.
vendor/cuobj/cuobj.json: flip rdma_multipath_enabled to true.
Safe on single-NIC hosts because libcufile validates prerequisites
at init time and falls back to single-path with a log line when
rdma_dev_addr_list has only one entry or backups fail to register.
Vendor cuda.h (CUDA driver header) from cuObject v3 1.2.0 into vendor/cuobj/include/ and point MINIO_CPP_INCLUDES at it, replacing the /usr/local/cuda/include path. The SDK's use of cuda.h is type-only (CUdeviceptr etc. appearing in cuFile / cuObj API signatures), so no CUDA runtime symbols are called from within minio-cpp itself. Drop -L/usr/local/cuda/lib64 and -lcudart from MINIO_CPP_LIBS; libcufile is already vendored under vendor/cuobj/lib/ and cudart is not needed by the SDK. Rewrite examples/GetPutRDMA.cc and examples/GPUHostDisk.cc to load libcuda.so (the CUDA driver library shipped with every NVIDIA driver) via dlopen/dlsym instead of linking against libcudart. Add prominent comments in both examples clarifying that this dlopen shim is a convenience for the examples only — production applications wanting GPU Direct Storage should link against real CUDA APIs via the CUDA Toolkit. Add a 'CUDA dependency model' header comment to include/miniocpp/rdma.h explaining that CUDA is strictly an application concern, not an SDK concern: pinned-host workloads don't need CUDA at all, and GPU workloads bring their own CUDA linkage. Net: the SDK and all examples now build cleanly on hosts without the CUDA Toolkit installed. Only the NVIDIA GPU driver (libcuda.so) is needed at runtime, and only when the 'gpu' mode is actually exercised.
Three changes in the RDMA client path, all needed to make cuObject 1.2
multipath work reliably end-to-end against an S3 server:
1. rdma.h: extract the source-NIC GID from the RDMA token (field 7 of
the cuObj descriptor per cuObjRDMADESCRprotocolformat.pdf) and pin
the outgoing HTTP socket to that interface via httplib
set_interface (CURLOPT_INTERFACE). Under RoundRobin multipath
libcuobjclient may mint a token referencing the backup NIC while
the kernel's default route sends HTTP out the primary NIC; without
this binding the server's RDMA_READ targets a peer whose flow was
never primed on this session, causing IBV_WC_RETRY_EXC_ERR and a
3.7s stall.
2. client.h + client.cc: promote cuObjClient to a process-wide Meyers
singleton exposed as Client::SharedRDMAClient(). libcuobjclient
drives libcufile which maintains process-global state (device /
peer cache, health monitor, multipath registration); running the
constructor from multiple threads simultaneously on process startup
tripped a glibc heap-corruption abort ("malloc(): invalid size
(unsorted)"). Singleton init via [stmt.dcl]/4 is thread-safe and
happens exactly once.
3. client.cc: Client::PutObject(PutObjectRDMAArgs) HTTP fallback was
constructing PutObjectArgs without copying args.bucket / .object /
.region, so fall-through requests from the RDMA path failed with
"bucket name cannot be empty". Set those fields explicitly so the
fallback delivers the same request over HTTP.
Validated on a 2-NIC ConnectX-7 client (primary + backup) against a
single-NIC server, 3×3min warp putRDMA chaos runs (baseline,
primary-down, backup-down) and a 60s PUT+GET round-trip over 130k
objects: zero application-visible errors, ETag stable between RDMA
PUT, RDMA GET, and plain HTTP GET of the same objects.
Three related fixes on the RDMA path: 1. include/miniocpp/baseclient.h: add a public GetBaseUrl() accessor. FFI shims (minio-go's api-put-object.cpp PutObjectRDMA / GetObjectRDMA C glue) need to read the client-configured region at call time and stamp it onto PutObjectRDMAArgs / GetObjectRDMAArgs. base_url_ was protected with no getter, so the shim had no way to forward it short of widening the C API. A const accessor is the minimal change. 2. src/client.cc: Client::GetObject(GetObjectRDMAArgs) HTTP fallback was setting targs.bucket and targs.object but dropping the region we just resolved via GetRegion(). For buckets in a non-default region this meant the fallback signed against the wrong region; for the default region it cost a redundant GetRegion() roundtrip inside BaseClient::GetObject. Matches the PutObject(PutObjectRDMAArgs) fallback added in fd468d9. 3. src/args.cc: PutObjectRDMAArgs::Validate() and GetObjectRDMAArgs:: Validate() now chain through ObjectArgs::Validate() to catch an empty bucket/object at the call site, rather than letting it slip into the RDMA path and surface as a confusing fallback failure. We chain to ObjectArgs rather than the more specific parent (GetObjectArgs/PutObjectBaseArgs) because the latter require fields the RDMA path doesn't use (datafunc / part_size).
Designated-initializer construction of s3_rdma_client_ctx at the four single-shot Put/Get callsites omits std::string fields that are either populated by the callee (etag, checksum) or only used in multipart contexts (uploadId, partNumber). Without in-class defaults on those strings, GCC raises -Wmissing-field-initializers on every callsite. Give every member an explicit in-class default so the omissions are intentional and silent. Functionally a no-op (std::string already default-constructs to empty), but removes noise that was masking real warnings during build.
Removes the ~10,700-line vendored copy of cpp-httplib (rdma-httplib.h) and rewrites rdmaPut / rdmaGet against minio::http::Request, which the rest of the SDK already uses (libcurl via curlpp). The protocol shape — Content-Length: 0 on the response body with the actual transferred byte count delivered via x-amz-rdma-bytes-transferred — is fully compatible with libcurl; the original httplib dependency was worked around an earlier protocol revision where Content-Length itself was abused to carry the transferred count without a body. While here, fix a latent correctness bug: rdmaGet was returning the caller-supplied size unconditionally and never reading x-amz-rdma-bytes-transferred, so ranged / partial-content GETs silently misreported actual bytes transferred. We now parse and return the server-reported count, falling back to the caller-supplied size only if the header is absent (older server). NIC pinning (CURLOPT_INTERFACE) and aggressive connect/read timeouts required by the control plane are exposed as new fields on http::Request so other call sites can reuse them.
The default build no longer requires libcufile, libcuobjclient,
libibverbs, or librdmacm on the host. Consumers who want the RDMA /
GPU Direct Storage API surface opt in with:
cmake -DMINIO_CPP_ENABLE_RDMA=ON
When OFF (default):
* The vendored NVIDIA include path and link line are skipped.
* GetObjectRDMAArgs / PutObjectRDMAArgs and the
Client::GetObject(GetObjectRDMAArgs) /
Client::PutObject(PutObjectRDMAArgs) overloads are not declared.
* The rdmaclient field on PutObjectArgs / PutObjectApiArgs /
UploadPartArgs is omitted from the struct layout.
* RDMA-only headers (nvidia-cufile.h, nvidia-cuobjclient.h, rdma.h)
are not installed.
* The GetPutRDMA and GPUHostDisk examples are not built.
When ON, target_compile_definitions(miniocpp PUBLIC MINIO_CPP_RDMA)
makes the macro available to downstream consumers so they pick up the
matching API surface from the installed headers.
Verified clean separation on Linux: with RDMA off the resulting
libminiocpp.so has no NDR/cuFile/cuObj/ibverbs deps; with RDMA on it
links libcufile, libcuobjclient, libibverbs, librdmacm.
Adds vendor/cuobj/NOTICE explaining that the vendored cuda.h header, the cuFile and cuObj shared libraries, and the public-facing API headers (nvidia-cufile.h, nvidia-cuobjclient.h) all originate from NVIDIA Corporation and remain subject to NVIDIA's software license agreements — they are not covered by the minio-cpp Apache 2.0 grant. README.md now points to that NOTICE and clarifies that the default build omits the entire RDMA stack, so consumers who do not opt into GPUDirect Storage / RDMA can safely ignore the NVIDIA terms.
…bles Ships the verbatim NVIDIA CUDA Toolkit End User License Agreement as vendor/cuobj/EULA.txt and rewrites the NOTICE to reference it directly, including the §2.6 Attachment A entry that explicitly enumerates the cuFile component (cufile.h, libcufile.so, libcufile_rdma.so) as redistributable. The same EULA covers the broader set of NVIDIA-derived headers and shared libraries vendored under vendor/cuobj/ and reproduced under include/miniocpp/ (cuda.h, nvidia-cufile.h, nvidia-cuobjclient.h).
Replaces the literal 192.168.1.1 sample under rdma_dev_addr_list with <client-nic-ip> placeholder. Operators must substitute their local NIC IPv4 before use.
…e 2.0 compatibility
Re-grounds vendor/cuobj/NOTICE on the actual CUDA Toolkit EULA §2.6
Attachment A entries that authorize redistribution of the vendored
files:
* "NVIDIA CUDA File IO Libraries and Header" covers cufile.h,
libcufile.so, libcufile_rdma.so (plus static variants).
* "Accelerated CUDA Libraries for Object Storage" covers
libcuobjclient.so, libcuobjserver.so, and the cuObj headers
(cuobjclient.h, cuobjrdma.h, cuobjrdmaparam.h, cuobjserver.h,
cuobjtelem.h).
* "CUDA Headers for Runtime Compilation" covers cuda.h.
Also adds a section explaining why redistribution inside minio-cpp's
Apache 2.0 SDK does not violate EULA §1.2(5): Apache 2.0 is permissive,
not copyleft, and §4 of Apache 2.0 expressly permits shipping the
licensed work alongside components carrying different license terms.
The NVIDIA artifacts remain governed exclusively by EULA.txt; the
Apache 2.0 grant covers only minio-cpp's own source.
… + add C ABI
The RDMA path is now selected transparently by populating the new buf/size
fields on PutObjectArgs / GetObjectArgs. The dedicated PutObjectRDMAArgs /
GetObjectRDMAArgs structs and their corresponding Client::PutObject /
Client::GetObject overloads are deleted.
API change summary:
PutObjectArgs
+ char* buf = nullptr (RDMA / direct-buffer path)
+ std::optional<size_t> size (required when buf is set)
~ std::istream* stream (was: std::istream&)
GetObjectArgs
+ char* buf = nullptr (RDMA / direct-buffer path)
+ std::optional<size_t> size (required when buf is set)
Validate() on both now requires exactly one of (stream|datafunc, buf).
Deleted: struct PutObjectRDMAArgs, struct GetObjectRDMAArgs,
Client::PutObject(PutObjectRDMAArgs),
Client::GetObject(GetObjectRDMAArgs).
Behaviour preserved: the previous RDMA-specific method bodies are folded
inline into the unified Client::PutObject / Client::GetObject as a buf-mode
branch. When buf is set, the SDK still tries RDMA via SharedRDMAClient()
and falls back to a single-shot HTTP upload / streamed HTTP-into-buf
download on RDMA decline.
Source-compat note: the PutObjectArgs constructor still takes std::istream&
(now stored as a pointer internally). Callers that assigned to the .stream
field directly need to pass a pointer (&my_stream).
Also adds include/miniocpp/c_api.h + src/c_api.cc — a stable extern "C"
ABI exporting miniocpp_client_new/free, miniocpp_put_object,
miniocpp_get_object (both unified — buf!=NULL ⇒ RDMA, else callback
streaming), miniocpp_alloc_aligned/free_aligned, miniocpp_rdma_available,
miniocpp_last_error. Symbols carry visibility("default") and gated under
MINIO_CPP_ENABLE_RDMA. This is the shared base that minio-go and minio-py
language bindings will dlopen against instead of vendoring per-language
C++ glue.
4f73d5d to
e333282
Compare
Switches the GPU-buffer demo from libcuda's driver API (cuCtxCreate + cuMemAlloc) to libcudart's runtime API (cudaSetDevice + cudaMalloc), via dlopen so the SDK build itself still has no cudart link-time dependency. cudaMalloc runs cudart's static initialization, retrieves the device's primary context, and registers the allocation with cudart's internal P2P bookkeeping — none of which cuMemAlloc on a fresh cuCtxCreate'd context does. This is the idiomatic pattern for GPUDirect Storage / RDMA workloads regardless of whether it ultimately unblocks any specific end-to-end RDMA flow.
ede2dd8 to
8e75a37
Compare
8e75a37 to
a5881cf
Compare
- Adds Linux/arm64 (ubuntu-24.04-arm) to the existing CI matrix
alongside Linux/amd64, parametrized via matrix.config.arch.
- Splits vendor/cuobj/lib/ into x86_64/ and aarch64/ subdirs and
adds the aarch64 cuFile / cuObjClient / cuObjServer libraries
(cuObject resiliency_v3, version 1.2.0 / cuFile 1.18.0).
- CMake selects the right per-arch subdir from CMAKE_SYSTEM_PROCESSOR
and verifies the cuObj client .so is present before linking.
configure.sh drops the hardcoded -L flag.
- New workflow .github/workflows/ci-rdma.yml builds miniocpp with
MINIO_CPP_ENABLE_RDMA=ON on Linux amd64 and arm64. Build-only,
no server / no tests.
- Linux CI downloads the AIStor binary from
dl.min.io/aistor/minio/release/linux-${arch}/ and passes the
MINIO_LICENSE secret as an env var so the free-tier license
activates at runtime. macOS and Windows continue to use
community minio.
a5881cf to
2074462
Compare
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PutObject/GetObject, gated behind a new CMake flagMINIO_CPP_ENABLE_RDMA(defaultOFF).PutObjectArgs/GetObjectArgscarry optional RDMA fields, keeping the public API uniform for RDMA and non-RDMA callers.include/miniocpp/c_api.hfor cross-language bindings (consumed by minio-go / minio-py wrappers).vendor/cuobj/carries NVIDIA cuFile / cuObjClient redistributables per Attachment A of the CUDA Toolkit EULA (EULA + NOTICE bundled), consumed only whenMINIO_CPP_ENABLE_RDMA=ON.GetPutRDMA(GPU-buffer demo, loads libcudart via dlopen so the SDK build has no cudart link-time dependency) andGPUHostDisk.Test plan
MINIO_CPP_ENABLE_RDMA=OFF) builds and existing tests pass on Linux/macOS-DMINIO_CPP_ENABLE_RDMA=ON) builds against vendored cuFile / cuObjClientGetPutRDMAexample runs end-to-end against a MinIO server with RDMA enabled