test: xfail Windows MCDM mempool OOM setup failures#2000
Conversation
Work around nvbugs5815123 by treating OOM returns from mempool setup in affected tests as expected failures on Windows. Unsupported configurations still skip normally, while other platforms continue to fail on unexpected OOMs. Made-with: Cursor
Use NVML to confirm the CUDA device is running on Windows MCDM before treating mempool OOM setup failures as expected. If the MCDM check cannot be completed, leave the original test failure visible. Made-with: Cursor
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
|
@Andy-Jost @rluo8 for visibility Thanks @rluo8 for sending me the logs! |
This comment has been minimized.
This comment has been minimized.
Move the Windows MCDM detection and mempool OOM xfail handling into a shared test helper so cuda.bindings and cuda.core tests use the same workaround logic. Made-with: Cursor
Let the MCDM detector report only the detected state and keep the broad fallback in the mempool OOM xfail path, where detection failures should leave the original test failure visible. Made-with: Cursor
Andy-Jost
left a comment
There was a problem hiding this comment.
Comments only, no blockers.
| return False | ||
| import cuda.bindings.nvml as nvml | ||
|
|
||
| device_id = int(device.device_id if hasattr(device, "device_id") else device) |
There was a problem hiding this comment.
nit: getattr(device, "device_id", device)?
There was a problem hiding this comment.
Done: commit 20c8a7a
Thanks for catching this!
| current, _ = nvml.device_get_driver_model_v2(handle) | ||
| return current == nvml.DriverModel.DRIVER_MCDM | ||
| finally: | ||
| nvml.shutdown() |
There was a problem hiding this comment.
Doesn't this assume that nvml was uninitialized on entry to this function? Would it break callers that initialized nvml?
There was a problem hiding this comment.
I checked the NVML API contract directly instead of relying on memory. The short answer is that nvmlInit_v2() and nvmlShutdown() are reference-counted, so the balanced nvml.init_v2() / nvml.shutdown() pair in our helper should not break callers that had already initialized NVML.
The most relevant NVIDIA doc is the current NVML "Initialization and Cleanup" page: https://docs.nvidia.com/deploy/nvml-api/group__nvmlInitializationAndCleanup.html.
Cursor generated supporting details:
- The current NVML docs for
nvmlInit_v2()say: "A reference count of the number of initializations is maintained. Shutdown only occurs when the reference count reaches zero." - The current NVML docs for
nvmlShutdown()say: "This method should be called ... once for each call tonvmlInit_v2(). A reference count of the number of initializations is maintained. Shutdown only occurs when the reference count reaches zero." - The same current docs also say this applies "For all products." Separately, the NVML API reference lists Windows as a supported OS platform, so there is no indication that the ref-count behavior is Linux-only.
- The archived R525 docs use the same ref-count language, which suggests this is not a recent or unstable contract.
- Our
cuda.bindings.nvmllayer is a thin pass-through here:init_v2()callsnvmlInit_v2()directly andshutdown()callsnvmlShutdown()directly, so there is no extra Python-side lifecycle logic changing the semantics. - The generated binding text in
cuda_bindings/cuda/bindings/nvml.pyxalso reflects the same contract:ERROR_ALREADY_INITIALIZEDis described as deprecated because "Multiple initializations are now allowed through ref counting." - The repo already encodes this assumption in
cuda_bindings/tests/nvml/test_init.py, whosetest_init_ref_count()explicitly exercises repeatedinit_v2()/shutdown()calls and checks that NVML remains initialized until the matching final shutdown. That test is skipped on Windows, so it is not direct Windows coverage, but it does show the intended interpretation inside this repo. - One unrelated wrinkle: the current docs say extra
nvmlShutdown()calls beyond the init count are tolerated for backwards compatibility, while our local test expectsUninitializedErroron a nakedshutdown(). That mismatch is worth keeping in mind, but it does not affect this helper because the helper uses a balanced init/shutdown pair.
|
|
||
| attr_list = [None] * 8 | ||
| err, pool = cuda.cuMemPoolCreate(poolProps) | ||
| xfail_if_mempool_oom(err, "cuMemPoolCreate", poolProps.location.id) |
There was a problem hiding this comment.
I was (perhaps naively) expecting the xfail logic to appear as a decorator on the test itself, or, at worst, a context manager. I guess that's not practical?
There was a problem hiding this comment.
Helper-based local skip/xfail logic is a widely-used pattern in this repo, especially under cuda_core/tests. Also, I believe for this specific issue, a local helper is the right tool for the job because the condition is only knowable after a specific CUDA API call returns a specific failure on a specific runtime configuration.
Cursor generated supporting details:
- Under
cuda_core/tests, runtime gating is frequently factored into helpers and fixtures rather than only using decorators. Examples includeskip_if_pinned_memory_unsupported()andskip_if_managed_memory_unsupported()incuda_core/tests/conftest.py, plus local helpers like_skip_if_no_mempool()/_skip_if_no_managed_mempool()incuda_core/tests/graph/test_graph_definition.py, similar_skip_if_no_mempool()helpers in several other graph/object-protocol modules, and fixture-style runtime gates likeskip_if_no_tmaincuda_core/tests/test_tensor_map.py. - So I think it is fair to describe helper-based runtime skip/xfail logic as a commonly used pattern under
cuda-python, with the strongest examples living incuda_core/tests. - Decorators are most natural when the condition is static up front: platform, version, missing import, permanently absent feature, etc. Here the interesting condition is narrower: a particular mempool setup call fails with the known Windows MCDM OOM-like failure. A decorator would tend to mark the whole test based on environment rather than on the actually observed failure.
- A context manager is more plausible than a decorator, but still not a great fit here because
cuda_bindings/testsis largely return-code driven. The test gets anerrback from the CUDA API and then decides what to do. In that style, a helper likexfail_if_mempool_oom(err, api_name, device)is more natural than building an exception-oriented context manager around a return-code check. - The local helper also keeps the
xfailnarrowly scoped. Unaffected systems still pass normally, affected systems onlyxfailwhen the known bug actually reproduces, and once the underlying issue is fixed the test can begin passing immediately instead of remaining broadly pre-marked.
There was a problem hiding this comment.
I agree this follows the existing pattern. I'd be interested in exploring options to diminish the reliance on these helpers.
At this particular line of code, errors are being checked manually, so a helper makes sense. More broadly, it would be better if the tests could be written directly and some other mechanism could translate failures into skips or xfails, as needed. An aspiration.
| skip_if_pinned_memory_unsupported(device) | ||
| mr = PinnedMemoryResource() | ||
| mr = create_pinned_memory_resource_or_skip(xfail_device=device) |
There was a problem hiding this comment.
This sequence looks odd. It checks whether PinnedMemoryResource is supported, but even when that check passes an additional "guarded construction" is necessary.
There was a problem hiding this comment.
Ah, sorry I was too rushed last night and got mixed up with the skip/xfail name change. Fixed in these commits:
The diff to main is smaller now, and I believe the "first skip if the capability is missing, then xfail if there is a known issue" sequence here looks much clearer now.
Use getattr for the shared mempool helper so it accepts device objects and raw ordinals without extra branching. Co-authored-by: Cursor <cursoragent@cursor.com>
Keep the established managed-memory test helper name so call sites stay readable, while documenting that Windows MCDM mempool OOM setup failures are xfailed rather than skipped. Co-authored-by: Cursor <cursoragent@cursor.com>
Clarify pinned mempool test setup by keeping skip for capability checks and using xfail naming for the Windows MCDM constructor workaround. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Oh, I'm stumbling over the new helper missing in the I'll put this PR back in draft mode while I figure out how to resolve that problem. |
Allow cuda_core tests to run against older cuda.bindings artifacts by falling back when the mempool xfail helper is unavailable, so collection succeeds without the new OOM xfail behavior. Co-authored-by: Cursor <cursoragent@cursor.com>
|
/ok to test |
This comment has been minimized.
This comment has been minimized.
2 similar comments
This comment has been minimized.
This comment has been minimized.
|
Summary
This PR hardens a small set of CUDA memory pool tests on Windows MCDM systems where the device reports memory pool support, but the actual memory pool setup path can still fail with
CUDA_ERROR_OUT_OF_MEMORY.Instead of treating those environment-specific setup failures as ordinary test failures, the affected tests now mark them as expected failures only when:
If the MCDM check cannot be completed, the original test failure remains visible.
Why
xfailxfailis the best fit here because the feature is expected to be supported by the device, and the tests should pass once the underlying platform behavior is resolved. A skip would make the tests look unsupported or not applicable, while an xfail preserves the signal that this is a known failing path and will turn into an XPASS if the behavior starts working.The xfail is also scoped narrowly. Other platforms, non-MCDM Windows configurations, and unexpected errors still fail normally.
Alternatives Considered
Approximate the risk from device properties.
We could estimate whether a system is likely to fail based on platform, driver model, and device memory size. This is useful for diagnostics, but it is not reliable enough for test gating because it predicts an internal allocation behavior rather than observing the actual API result.
Probe in a subprocess before running the tests.
A subprocess probe would avoid changing the parent pytest process state. However, it only proves that the operation succeeded in a fresh process. These failures can depend on process state and test ordering, so a subprocess result would not reliably predict whether the real test process can still perform the same setup later.
Use driver logging to identify the allocation behavior.
Logs can help diagnose the failure after it happens, but they require external setup and are not a practical in-test precondition. They also do not provide a stable public API for deciding whether a test should run.
The implemented approach uses the actual API result from the test path and only xfails the known Windows MCDM out-of-memory case. That keeps the workaround close to the behavior being tested and avoids hiding unrelated regressions.
Test Plan
6f3ac7f.