Skip to content

test: xfail Windows MCDM mempool OOM setup failures#2000

Merged
rwgk merged 11 commits intoNVIDIA:mainfrom
rwgk:nvbugs5815123_xfail_if_mempool_oom
May 1, 2026
Merged

test: xfail Windows MCDM mempool OOM setup failures#2000
rwgk merged 11 commits intoNVIDIA:mainfrom
rwgk:nvbugs5815123_xfail_if_mempool_oom

Conversation

@rwgk
Copy link
Copy Markdown
Contributor

@rwgk rwgk commented May 1, 2026

Summary

This PR hardens a small set of CUDA memory pool tests on Windows MCDM systems where the device reports memory pool support, but the actual memory pool setup path can still fail with CUDA_ERROR_OUT_OF_MEMORY.

Instead of treating those environment-specific setup failures as ordinary test failures, the affected tests now mark them as expected failures only when:

  • the failing call is one of the memory pool setup paths covered by the tests,
  • the error is an out-of-memory result, and
  • NVML confirms that the CUDA device is running under Windows MCDM.

If the MCDM check cannot be completed, the original test failure remains visible.

Why xfail

xfail is the best fit here because the feature is expected to be supported by the device, and the tests should pass once the underlying platform behavior is resolved. A skip would make the tests look unsupported or not applicable, while an xfail preserves the signal that this is a known failing path and will turn into an XPASS if the behavior starts working.

The xfail is also scoped narrowly. Other platforms, non-MCDM Windows configurations, and unexpected errors still fail normally.

Alternatives Considered

  1. Approximate the risk from device properties.
    We could estimate whether a system is likely to fail based on platform, driver model, and device memory size. This is useful for diagnostics, but it is not reliable enough for test gating because it predicts an internal allocation behavior rather than observing the actual API result.

  2. Probe in a subprocess before running the tests.
    A subprocess probe would avoid changing the parent pytest process state. However, it only proves that the operation succeeded in a fresh process. These failures can depend on process state and test ordering, so a subprocess result would not reliably predict whether the real test process can still perform the same setup later.

  3. Use driver logging to identify the allocation behavior.
    Logs can help diagnose the failure after it happens, but they require external setup and are not a practical in-test precondition. They also do not provide a stable public API for deciding whether a test should run.

The implemented approach uses the actual API result from the test path and only xfails the known Windows MCDM out-of-memory case. That keeps the workaround close to the behavior being tested and avoids hiding unrelated regressions.

Test Plan

  • Public CI passed for commit 6f3ac7f.
  • Additional target-system validation is pending before rerunning the full public CI.

rwgk added 2 commits April 30, 2026 17:33
Work around nvbugs5815123 by treating OOM returns from mempool setup in affected tests as expected failures on Windows. Unsupported configurations still skip normally, while other platforms continue to fail on unexpected OOMs.

Made-with: Cursor
Use NVML to confirm the CUDA device is running on Windows MCDM before treating mempool OOM setup failures as expected. If the MCDM check cannot be completed, leave the original test failure visible.

Made-with: Cursor
@rwgk rwgk added this to the cuda.bindings next milestone May 1, 2026
@rwgk rwgk self-assigned this May 1, 2026
@rwgk rwgk added P0 High priority - Must do! test Improvements or additions to tests cuda.bindings Everything related to the cuda.bindings module cuda.core Everything related to the cuda.core module labels May 1, 2026
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 1, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rwgk
Copy link
Copy Markdown
Contributor Author

rwgk commented May 1, 2026

/ok to test

@rwgk
Copy link
Copy Markdown
Contributor Author

rwgk commented May 1, 2026

@Andy-Jost @rluo8 for visibility

Thanks @rluo8 for sending me the logs!

@github-actions

This comment has been minimized.

rwgk added 3 commits April 30, 2026 19:59
Move the Windows MCDM detection and mempool OOM xfail handling into a shared test helper so cuda.bindings and cuda.core tests use the same workaround logic.

Made-with: Cursor
Let the MCDM detector report only the detected state and keep the broad fallback in the mempool OOM xfail path, where detection failures should leave the original test failure visible.

Made-with: Cursor
@rwgk rwgk changed the title WIP: nvbugs5815123_xfail_if_mempool_oom test: xfail Windows MCDM mempool OOM setup failures May 1, 2026
@rwgk rwgk marked this pull request as ready for review May 1, 2026 03:45
@rwgk rwgk requested a review from Andy-Jost May 1, 2026 03:45
Copy link
Copy Markdown
Contributor

@Andy-Jost Andy-Jost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments only, no blockers.

return False
import cuda.bindings.nvml as nvml

device_id = int(device.device_id if hasattr(device, "device_id") else device)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: getattr(device, "device_id", device)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: commit 20c8a7a

Thanks for catching this!

current, _ = nvml.device_get_driver_model_v2(handle)
return current == nvml.DriverModel.DRIVER_MCDM
finally:
nvml.shutdown()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this assume that nvml was uninitialized on entry to this function? Would it break callers that initialized nvml?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the NVML API contract directly instead of relying on memory. The short answer is that nvmlInit_v2() and nvmlShutdown() are reference-counted, so the balanced nvml.init_v2() / nvml.shutdown() pair in our helper should not break callers that had already initialized NVML.

The most relevant NVIDIA doc is the current NVML "Initialization and Cleanup" page: https://docs.nvidia.com/deploy/nvml-api/group__nvmlInitializationAndCleanup.html.

Cursor generated supporting details:

  • The current NVML docs for nvmlInit_v2() say: "A reference count of the number of initializations is maintained. Shutdown only occurs when the reference count reaches zero."
  • The current NVML docs for nvmlShutdown() say: "This method should be called ... once for each call to nvmlInit_v2(). A reference count of the number of initializations is maintained. Shutdown only occurs when the reference count reaches zero."
  • The same current docs also say this applies "For all products." Separately, the NVML API reference lists Windows as a supported OS platform, so there is no indication that the ref-count behavior is Linux-only.
  • The archived R525 docs use the same ref-count language, which suggests this is not a recent or unstable contract.
  • Our cuda.bindings.nvml layer is a thin pass-through here: init_v2() calls nvmlInit_v2() directly and shutdown() calls nvmlShutdown() directly, so there is no extra Python-side lifecycle logic changing the semantics.
  • The generated binding text in cuda_bindings/cuda/bindings/nvml.pyx also reflects the same contract: ERROR_ALREADY_INITIALIZED is described as deprecated because "Multiple initializations are now allowed through ref counting."
  • The repo already encodes this assumption in cuda_bindings/tests/nvml/test_init.py, whose test_init_ref_count() explicitly exercises repeated init_v2() / shutdown() calls and checks that NVML remains initialized until the matching final shutdown. That test is skipped on Windows, so it is not direct Windows coverage, but it does show the intended interpretation inside this repo.
  • One unrelated wrinkle: the current docs say extra nvmlShutdown() calls beyond the init count are tolerated for backwards compatibility, while our local test expects UninitializedError on a naked shutdown(). That mismatch is worth keeping in mind, but it does not affect this helper because the helper uses a balanced init/shutdown pair.


attr_list = [None] * 8
err, pool = cuda.cuMemPoolCreate(poolProps)
xfail_if_mempool_oom(err, "cuMemPoolCreate", poolProps.location.id)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was (perhaps naively) expecting the xfail logic to appear as a decorator on the test itself, or, at worst, a context manager. I guess that's not practical?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helper-based local skip/xfail logic is a widely-used pattern in this repo, especially under cuda_core/tests. Also, I believe for this specific issue, a local helper is the right tool for the job because the condition is only knowable after a specific CUDA API call returns a specific failure on a specific runtime configuration.

Cursor generated supporting details:

  • Under cuda_core/tests, runtime gating is frequently factored into helpers and fixtures rather than only using decorators. Examples include skip_if_pinned_memory_unsupported() and skip_if_managed_memory_unsupported() in cuda_core/tests/conftest.py, plus local helpers like _skip_if_no_mempool() / _skip_if_no_managed_mempool() in cuda_core/tests/graph/test_graph_definition.py, similar _skip_if_no_mempool() helpers in several other graph/object-protocol modules, and fixture-style runtime gates like skip_if_no_tma in cuda_core/tests/test_tensor_map.py.
  • So I think it is fair to describe helper-based runtime skip/xfail logic as a commonly used pattern under cuda-python, with the strongest examples living in cuda_core/tests.
  • Decorators are most natural when the condition is static up front: platform, version, missing import, permanently absent feature, etc. Here the interesting condition is narrower: a particular mempool setup call fails with the known Windows MCDM OOM-like failure. A decorator would tend to mark the whole test based on environment rather than on the actually observed failure.
  • A context manager is more plausible than a decorator, but still not a great fit here because cuda_bindings/tests is largely return-code driven. The test gets an err back from the CUDA API and then decides what to do. In that style, a helper like xfail_if_mempool_oom(err, api_name, device) is more natural than building an exception-oriented context manager around a return-code check.
  • The local helper also keeps the xfail narrowly scoped. Unaffected systems still pass normally, affected systems only xfail when the known bug actually reproduces, and once the underlying issue is fixed the test can begin passing immediately instead of remaining broadly pre-marked.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this follows the existing pattern. I'd be interested in exploring options to diminish the reliance on these helpers.

At this particular line of code, errors are being checked manually, so a helper makes sense. More broadly, it would be better if the tests could be written directly and some other mechanism could translate failures into skips or xfails, as needed. An aspiration.

Comment thread cuda_core/tests/test_memory.py Outdated
Comment on lines +642 to +643
skip_if_pinned_memory_unsupported(device)
mr = PinnedMemoryResource()
mr = create_pinned_memory_resource_or_skip(xfail_device=device)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sequence looks odd. It checks whether PinnedMemoryResource is supported, but even when that check passes an additional "guarded construction" is necessary.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry I was too rushed last night and got mixed up with the skip/xfail name change. Fixed in these commits:

The diff to main is smaller now, and I believe the "first skip if the capability is missing, then xfail if there is a known issue" sequence here looks much clearer now.

rwgk and others added 4 commits May 1, 2026 09:57
Use getattr for the shared mempool helper so it accepts device objects and raw ordinals without extra branching.

Co-authored-by: Cursor <cursoragent@cursor.com>
Keep the established managed-memory test helper name so call sites stay readable, while documenting that Windows MCDM mempool OOM setup failures are xfailed rather than skipped.

Co-authored-by: Cursor <cursoragent@cursor.com>
Clarify pinned mempool test setup by keeping skip for capability checks and using xfail naming for the Windows MCDM constructor workaround.

Co-authored-by: Cursor <cursoragent@cursor.com>
@rwgk
Copy link
Copy Markdown
Contributor Author

rwgk commented May 1, 2026

Oh, I'm stumbling over the new helper missing in the 12.9.x branch:

tests/conftest.py:20: in <module>
    from cuda.bindings._test_helpers.mempool import xfail_if_mempool_oom
E   ModuleNotFoundError: No module named 'cuda.bindings._test_helpers.mempool'
Error: Process completed with exit code 4.

I'll put this PR back in draft mode while I figure out how to resolve that problem.

@rwgk rwgk marked this pull request as draft May 1, 2026 18:32
Allow cuda_core tests to run against older cuda.bindings artifacts by falling back when the mempool xfail helper is unavailable, so collection succeeds without the new OOM xfail behavior.

Co-authored-by: Cursor <cursoragent@cursor.com>
@rwgk
Copy link
Copy Markdown
Contributor Author

rwgk commented May 1, 2026

/ok to test

@rwgk rwgk marked this pull request as ready for review May 1, 2026 20:38
@rwgk rwgk merged commit 2849053 into NVIDIA:main May 1, 2026
96 checks passed
@rwgk rwgk deleted the nvbugs5815123_xfail_if_mempool_oom branch May 1, 2026 20:50
@github-actions

This comment has been minimized.

2 similar comments
@github-actions

This comment has been minimized.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 3, 2026

Doc Preview CI
Preview removed because the pull request was closed or merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.bindings Everything related to the cuda.bindings module cuda.core Everything related to the cuda.core module P0 High priority - Must do! test Improvements or additions to tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants