test: xfail Windows MCDM mempool OOM setup failures by rwgk · Pull Request #2000 · NVIDIA/cuda-python

rwgk · 2026-05-01T00:45:48Z

Summary

This PR hardens a small set of CUDA memory pool tests on Windows MCDM systems where the device reports memory pool support, but the actual memory pool setup path can still fail with CUDA_ERROR_OUT_OF_MEMORY.

Instead of treating those environment-specific setup failures as ordinary test failures, the affected tests now mark them as expected failures only when:

the failing call is one of the memory pool setup paths covered by the tests,
the error is an out-of-memory result, and
NVML confirms that the CUDA device is running under Windows MCDM.

If the MCDM check cannot be completed, the original test failure remains visible.

Why `xfail`

xfail is the best fit here because the feature is expected to be supported by the device, and the tests should pass once the underlying platform behavior is resolved. A skip would make the tests look unsupported or not applicable, while an xfail preserves the signal that this is a known failing path and will turn into an XPASS if the behavior starts working.

The xfail is also scoped narrowly. Other platforms, non-MCDM Windows configurations, and unexpected errors still fail normally.

Alternatives Considered

Approximate the risk from device properties.
We could estimate whether a system is likely to fail based on platform, driver model, and device memory size. This is useful for diagnostics, but it is not reliable enough for test gating because it predicts an internal allocation behavior rather than observing the actual API result.
Probe in a subprocess before running the tests.
A subprocess probe would avoid changing the parent pytest process state. However, it only proves that the operation succeeded in a fresh process. These failures can depend on process state and test ordering, so a subprocess result would not reliably predict whether the real test process can still perform the same setup later.
Use driver logging to identify the allocation behavior.
Logs can help diagnose the failure after it happens, but they require external setup and are not a practical in-test precondition. They also do not provide a stable public API for deciding whether a test should run.

The implemented approach uses the actual API result from the test path and only xfails the known Windows MCDM out-of-memory case. That keeps the workaround close to the behavior being tested and avoids hiding unrelated regressions.

Test Plan

Public CI passed for commit 6f3ac7f.
Additional target-system validation is pending before rerunning the full public CI.

Work around nvbugs5815123 by treating OOM returns from mempool setup in affected tests as expected failures on Windows. Unsupported configurations still skip normally, while other platforms continue to fail on unexpected OOMs. Made-with: Cursor

Use NVML to confirm the CUDA device is running on Windows MCDM before treating mempool OOM setup failures as expected. If the MCDM check cannot be completed, leave the original test failure visible. Made-with: Cursor

copy-pr-bot · 2026-05-01T00:45:51Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

rwgk · 2026-05-01T00:46:18Z

/ok to test

rwgk · 2026-05-01T00:47:15Z

@Andy-Jost @rluo8 for visibility

Thanks @rluo8 for sending me the logs!

Move the Windows MCDM detection and mempool OOM xfail handling into a shared test helper so cuda.bindings and cuda.core tests use the same workaround logic. Made-with: Cursor

Let the MCDM detector report only the detected state and keep the broad fallback in the mempool OOM xfail path, where detection failures should leave the original test failure visible. Made-with: Cursor

Andy-Jost

Comments only, no blockers.

Andy-Jost · 2026-05-01T15:06:02Z

+        return False
+    import cuda.bindings.nvml as nvml
+
+    device_id = int(device.device_id if hasattr(device, "device_id") else device)


nit: getattr(device, "device_id", device)?

Done: commit 20c8a7a

Thanks for catching this!

Andy-Jost · 2026-05-01T15:08:51Z

+        current, _ = nvml.device_get_driver_model_v2(handle)
+        return current == nvml.DriverModel.DRIVER_MCDM
+    finally:
+        nvml.shutdown()


Doesn't this assume that nvml was uninitialized on entry to this function? Would it break callers that initialized nvml?

I checked the NVML API contract directly instead of relying on memory. The short answer is that nvmlInit_v2() and nvmlShutdown() are reference-counted, so the balanced nvml.init_v2() / nvml.shutdown() pair in our helper should not break callers that had already initialized NVML.

The most relevant NVIDIA doc is the current NVML "Initialization and Cleanup" page: https://docs.nvidia.com/deploy/nvml-api/group__nvmlInitializationAndCleanup.html.

Cursor generated supporting details:

The current NVML docs for nvmlInit_v2() say: "A reference count of the number of initializations is maintained. Shutdown only occurs when the reference count reaches zero."

The current NVML docs for nvmlShutdown() say: "This method should be called ... once for each call to nvmlInit_v2(). A reference count of the number of initializations is maintained. Shutdown only occurs when the reference count reaches zero."

The same current docs also say this applies "For all products." Separately, the NVML API reference lists Windows as a supported OS platform, so there is no indication that the ref-count behavior is Linux-only.

The archived R525 docs use the same ref-count language, which suggests this is not a recent or unstable contract.

Our cuda.bindings.nvml layer is a thin pass-through here: init_v2() calls nvmlInit_v2() directly and shutdown() calls nvmlShutdown() directly, so there is no extra Python-side lifecycle logic changing the semantics.

The generated binding text in cuda_bindings/cuda/bindings/nvml.pyx also reflects the same contract: ERROR_ALREADY_INITIALIZED is described as deprecated because "Multiple initializations are now allowed through ref counting."

The repo already encodes this assumption in cuda_bindings/tests/nvml/test_init.py, whose test_init_ref_count() explicitly exercises repeated init_v2() / shutdown() calls and checks that NVML remains initialized until the matching final shutdown. That test is skipped on Windows, so it is not direct Windows coverage, but it does show the intended interpretation inside this repo.

One unrelated wrinkle: the current docs say extra nvmlShutdown() calls beyond the init count are tolerated for backwards compatibility, while our local test expects UninitializedError on a naked shutdown(). That mismatch is worth keeping in mind, but it does not affect this helper because the helper uses a balanced init/shutdown pair.

Andy-Jost · 2026-05-01T15:22:22Z


    attr_list = [None] * 8
    err, pool = cuda.cuMemPoolCreate(poolProps)
+    xfail_if_mempool_oom(err, "cuMemPoolCreate", poolProps.location.id)


I was (perhaps naively) expecting the xfail logic to appear as a decorator on the test itself, or, at worst, a context manager. I guess that's not practical?

Helper-based local skip/xfail logic is a widely-used pattern in this repo, especially under cuda_core/tests. Also, I believe for this specific issue, a local helper is the right tool for the job because the condition is only knowable after a specific CUDA API call returns a specific failure on a specific runtime configuration.

Cursor generated supporting details:

Under cuda_core/tests, runtime gating is frequently factored into helpers and fixtures rather than only using decorators. Examples include skip_if_pinned_memory_unsupported() and skip_if_managed_memory_unsupported() in cuda_core/tests/conftest.py, plus local helpers like _skip_if_no_mempool() / _skip_if_no_managed_mempool() in cuda_core/tests/graph/test_graph_definition.py, similar _skip_if_no_mempool() helpers in several other graph/object-protocol modules, and fixture-style runtime gates like skip_if_no_tma in cuda_core/tests/test_tensor_map.py.

So I think it is fair to describe helper-based runtime skip/xfail logic as a commonly used pattern under cuda-python, with the strongest examples living in cuda_core/tests.

Decorators are most natural when the condition is static up front: platform, version, missing import, permanently absent feature, etc. Here the interesting condition is narrower: a particular mempool setup call fails with the known Windows MCDM OOM-like failure. A decorator would tend to mark the whole test based on environment rather than on the actually observed failure.

A context manager is more plausible than a decorator, but still not a great fit here because cuda_bindings/tests is largely return-code driven. The test gets an err back from the CUDA API and then decides what to do. In that style, a helper like xfail_if_mempool_oom(err, api_name, device) is more natural than building an exception-oriented context manager around a return-code check.

The local helper also keeps the xfail narrowly scoped. Unaffected systems still pass normally, affected systems only xfail when the known bug actually reproduces, and once the underlying issue is fixed the test can begin passing immediately instead of remaining broadly pre-marked.

I agree this follows the existing pattern. I'd be interested in exploring options to diminish the reliance on these helpers.

At this particular line of code, errors are being checked manually, so a helper makes sense. More broadly, it would be better if the tests could be written directly and some other mechanism could translate failures into skips or xfails, as needed. An aspiration.

Andy-Jost · 2026-05-01T15:27:31Z

        skip_if_pinned_memory_unsupported(device)
-        mr = PinnedMemoryResource()
+        mr = create_pinned_memory_resource_or_skip(xfail_device=device)


This sequence looks odd. It checks whether PinnedMemoryResource is supported, but even when that check passes an additional "guarded construction" is necessary.

Ah, sorry I was too rushed last night and got mixed up with the skip/xfail name change. Fixed in these commits:

commit 6e790f7

commit d757fa0

The diff to main is smaller now, and I believe the "first skip if the capability is missing, then xfail if there is a known issue" sequence here looks much clearer now.

Use getattr for the shared mempool helper so it accepts device objects and raw ordinals without extra branching. Co-authored-by: Cursor <cursoragent@cursor.com>

Keep the established managed-memory test helper name so call sites stay readable, while documenting that Windows MCDM mempool OOM setup failures are xfailed rather than skipped. Co-authored-by: Cursor <cursoragent@cursor.com>

Clarify pinned mempool test setup by keeping skip for capability checks and using xfail naming for the Windows MCDM constructor workaround. Co-authored-by: Cursor <cursoragent@cursor.com>

rwgk · 2026-05-01T18:32:15Z

Oh, I'm stumbling over the new helper missing in the 12.9.x branch:

tests/conftest.py:20: in <module>
    from cuda.bindings._test_helpers.mempool import xfail_if_mempool_oom
E   ModuleNotFoundError: No module named 'cuda.bindings._test_helpers.mempool'
Error: Process completed with exit code 4.

I'll put this PR back in draft mode while I figure out how to resolve that problem.

Allow cuda_core tests to run against older cuda.bindings artifacts by falling back when the mempool xfail helper is unavailable, so collection succeeds without the new OOM xfail behavior. Co-authored-by: Cursor <cursoragent@cursor.com>

rwgk · 2026-05-01T19:52:04Z

/ok to test

github-actions · 2026-05-03T00:28:56Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

rwgk added 2 commits April 30, 2026 17:33

test: xfail Windows mempool OOM cases

279935a

Work around nvbugs5815123 by treating OOM returns from mempool setup in affected tests as expected failures on Windows. Unsupported configurations still skip normally, while other platforms continue to fail on unexpected OOMs. Made-with: Cursor

test: limit mempool OOM xfail to MCDM

6f3ac7f

Use NVML to confirm the CUDA device is running on Windows MCDM before treating mempool OOM setup failures as expected. If the MCDM check cannot be completed, leave the original test failure visible. Made-with: Cursor

rwgk added this to the cuda.bindings next milestone May 1, 2026

rwgk self-assigned this May 1, 2026

rwgk added P0 High priority - Must do! test Improvements or additions to tests cuda.bindings Everything related to the cuda.bindings module cuda.core Everything related to the cuda.core module labels May 1, 2026

This comment has been minimized.

Sign in to view

rwgk added 3 commits April 30, 2026 19:59

test: centralize mempool OOM xfail helper

d89a77b

Move the Windows MCDM detection and mempool OOM xfail handling into a shared test helper so cuda.bindings and cuda.core tests use the same workaround logic. Made-with: Cursor

test: keep MCDM detection fallback in xfail helper

79543e4

Let the MCDM detector report only the detected state and keep the broad fallback in the mempool OOM xfail path, where detection failures should leave the original test failure visible. Made-with: Cursor

Merge branch 'main' into nvbugs5815123_xfail_if_mempool_oom

1337557

rwgk changed the title ~~WIP: nvbugs5815123_xfail_if_mempool_oom~~ test: xfail Windows MCDM mempool OOM setup failures May 1, 2026

Merge branch 'main' into nvbugs5815123_xfail_if_mempool_oom

a181fd6

rwgk marked this pull request as ready for review May 1, 2026 03:45

rwgk requested a review from Andy-Jost May 1, 2026 03:45

Andy-Jost approved these changes May 1, 2026

View reviewed changes

rwgk and others added 4 commits May 1, 2026 09:57

Merge branch 'main' into nvbugs5815123_xfail_if_mempool_oom

4517648

test: simplify MCDM helper device lookup

20c8a7a

Use getattr for the shared mempool helper so it accepts device objects and raw ordinals without extra branching. Co-authored-by: Cursor <cursoragent@cursor.com>

test: restore managed helper skip naming

6e790f7

Keep the established managed-memory test helper name so call sites stay readable, while documenting that Windows MCDM mempool OOM setup failures are xfailed rather than skipped. Co-authored-by: Cursor <cursoragent@cursor.com>

test: rename pinned helper for xfail flow

d757fa0

Clarify pinned mempool test setup by keeping skip for capability checks and using xfail naming for the Windows MCDM constructor workaround. Co-authored-by: Cursor <cursoragent@cursor.com>

rwgk marked this pull request as draft May 1, 2026 18:32

test: tolerate missing mempool xfail helper

04307fe

Allow cuda_core tests to run against older cuda.bindings artifacts by falling back when the mempool xfail helper is unavailable, so collection succeeds without the new OOM xfail behavior. Co-authored-by: Cursor <cursoragent@cursor.com>

rwgk marked this pull request as ready for review May 1, 2026 20:38

rwgk merged commit 2849053 into NVIDIA:main May 1, 2026
96 checks passed

rwgk deleted the nvbugs5815123_xfail_if_mempool_oom branch May 1, 2026 20:50

This comment has been minimized.

Sign in to view

Conversation

rwgk commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why xfail

Alternatives Considered

Test Plan

Uh oh!

copy-pr-bot Bot commented May 1, 2026

Uh oh!

rwgk commented May 1, 2026

Uh oh!

rwgk commented May 1, 2026

Uh oh!

This comment has been minimized.

Andy-Jost left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rwgk commented May 1, 2026

Uh oh!

rwgk commented May 1, 2026

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

github-actions Bot commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rwgk commented May 1, 2026 •

edited

Loading

Why `xfail`