-
Notifications
You must be signed in to change notification settings - Fork 278
test: xfail Windows MCDM mempool OOM setup failures #2000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
279935a
6f3ac7f
d89a77b
79543e4
1337557
a181fd6
4517648
20c8a7a
6e790f7
d757fa0
04307fe
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,54 @@ | ||
| # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE | ||
|
|
||
| import sys | ||
|
|
||
| import pytest | ||
|
|
||
| from cuda.bindings import driver, runtime | ||
|
|
||
|
|
||
| def is_windows_mcdm_device(device=0): | ||
| if sys.platform != "win32": | ||
| return False | ||
| import cuda.bindings.nvml as nvml | ||
|
|
||
| device_id = int(getattr(device, "device_id", device)) | ||
| (err,) = driver.cuInit(0) | ||
| if err != driver.CUresult.CUDA_SUCCESS: | ||
| return False | ||
| err, pci_bus_id = driver.cuDeviceGetPCIBusId(13, device_id) | ||
| if err != driver.CUresult.CUDA_SUCCESS: | ||
| return False | ||
| pci_bus_id = pci_bus_id.split(b"\x00", 1)[0].decode("ascii") | ||
| nvml.init_v2() | ||
| try: | ||
| handle = nvml.device_get_handle_by_pci_bus_id_v2(pci_bus_id) | ||
| current, _ = nvml.device_get_driver_model_v2(handle) | ||
| return current == nvml.DriverModel.DRIVER_MCDM | ||
| finally: | ||
| nvml.shutdown() | ||
|
|
||
|
|
||
| def xfail_if_mempool_oom(err_or_exc, api_name=None, device=0): | ||
| if api_name is not None and not isinstance(api_name, str): | ||
| device = api_name | ||
| api_name = None | ||
|
|
||
| is_oom = err_or_exc in ( | ||
| driver.CUresult.CUDA_ERROR_OUT_OF_MEMORY, | ||
| runtime.cudaError_t.cudaErrorMemoryAllocation, | ||
| ) or "CUDA_ERROR_OUT_OF_MEMORY" in str(err_or_exc) | ||
|
|
||
| if not is_oom: | ||
| return | ||
| try: | ||
| is_windows_mcdm = is_windows_mcdm_device(device) | ||
| except Exception: | ||
| # If MCDM detection fails, leave the primary test failure visible. | ||
| return | ||
| if not is_windows_mcdm: | ||
| return | ||
|
|
||
| api_context = f"{api_name} " if api_name else "" | ||
| pytest.xfail(f"{api_context}could not reserve VA for mempool operations on Windows MCDM") | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -12,6 +12,7 @@ | |
| import cuda.bindings.driver as cuda | ||
| import cuda.bindings.runtime as cudart | ||
| from cuda.bindings import driver | ||
| from cuda.bindings._test_helpers.mempool import xfail_if_mempool_oom | ||
|
|
||
|
|
||
| def driverVersionLessThan(target): | ||
|
|
@@ -270,6 +271,7 @@ def test_cuda_memPool_attr(): | |
|
|
||
| attr_list = [None] * 8 | ||
| err, pool = cuda.cuMemPoolCreate(poolProps) | ||
| xfail_if_mempool_oom(err, "cuMemPoolCreate", poolProps.location.id) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was (perhaps naively) expecting the
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Helper-based local Cursor generated supporting details:
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree this follows the existing pattern. I'd be interested in exploring options to diminish the reliance on these helpers. At this particular line of code, errors are being checked manually, so a helper makes sense. More broadly, it would be better if the tests could be written directly and some other mechanism could translate failures into skips or xfails, as needed. An aspiration. |
||
| assert err == cuda.CUresult.CUDA_SUCCESS | ||
|
|
||
| for idx, attr in enumerate( | ||
|
|
@@ -468,6 +470,12 @@ def test_cuda_graphMem_attr(device): | |
| params.bytesize = allocSize | ||
|
|
||
| err, allocNode = cuda.cuGraphAddMemAllocNode(graph, None, 0, params) | ||
| if err == cuda.CUresult.CUDA_ERROR_OUT_OF_MEMORY: | ||
| (destroy_err,) = cuda.cuGraphDestroy(graph) | ||
| assert destroy_err == cuda.CUresult.CUDA_SUCCESS | ||
| (destroy_err,) = cuda.cuStreamDestroy(stream) | ||
| assert destroy_err == cuda.CUresult.CUDA_SUCCESS | ||
| xfail_if_mempool_oom(err, "cuGraphAddMemAllocNode", device) | ||
| assert err == cuda.CUresult.CUDA_SUCCESS | ||
| err, freeNode = cuda.cuGraphAddMemFreeNode(graph, [allocNode], 1, params.dptr) | ||
| assert err == cuda.CUresult.CUDA_SUCCESS | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this assume that
nvmlwas uninitialized on entry to this function? Would it break callers that initializednvml?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the NVML API contract directly instead of relying on memory. The short answer is that
nvmlInit_v2()andnvmlShutdown()are reference-counted, so the balancednvml.init_v2()/nvml.shutdown()pair in our helper should not break callers that had already initialized NVML.The most relevant NVIDIA doc is the current NVML "Initialization and Cleanup" page: https://docs.nvidia.com/deploy/nvml-api/group__nvmlInitializationAndCleanup.html.
Cursor generated supporting details:
nvmlInit_v2()say: "A reference count of the number of initializations is maintained. Shutdown only occurs when the reference count reaches zero."nvmlShutdown()say: "This method should be called ... once for each call tonvmlInit_v2(). A reference count of the number of initializations is maintained. Shutdown only occurs when the reference count reaches zero."cuda.bindings.nvmllayer is a thin pass-through here:init_v2()callsnvmlInit_v2()directly andshutdown()callsnvmlShutdown()directly, so there is no extra Python-side lifecycle logic changing the semantics.cuda_bindings/cuda/bindings/nvml.pyxalso reflects the same contract:ERROR_ALREADY_INITIALIZEDis described as deprecated because "Multiple initializations are now allowed through ref counting."cuda_bindings/tests/nvml/test_init.py, whosetest_init_ref_count()explicitly exercises repeatedinit_v2()/shutdown()calls and checks that NVML remains initialized until the matching final shutdown. That test is skipped on Windows, so it is not direct Windows coverage, but it does show the intended interpretation inside this repo.nvmlShutdown()calls beyond the init count are tolerated for backwards compatibility, while our local test expectsUninitializedErroron a nakedshutdown(). That mismatch is worth keeping in mind, but it does not affect this helper because the helper uses a balanced init/shutdown pair.