Skip to content

Add opt-in GPU spreading for the parallel test suite#588

Draft
michel2323 wants to merge 1 commit into
mainfrom
test-gpu-spreading
Draft

Add opt-in GPU spreading for the parallel test suite#588
michel2323 wants to merge 1 commit into
mainfrom
test-gpu-spreading

Conversation

@michel2323

Copy link
Copy Markdown
Member

Summary

Adds an opt-in mechanism to spread the parallel test suite across multiple GPU tiles instead of oversubscribing device 0.

With ONEAPI_TEST_SPREAD_GPUS=1, each test worker process is pinned to a distinct GPU via ZE_AFFINITY_MASK (claimed round-robin through an atomic mkdir counter, set before using oneAPI so the Level Zero driver picks it up at init), passed through the ParallelTestRunner env kwarg.

device() is task-local and Malt runs each test in a fresh task, so a device! in init_worker_code would not stick — process-level pinning is the robust approach.

Notes

Fully opt-in: when ONEAPI_TEST_SPREAD_GPUS is unset, behavior is identical to current main (every worker stays on the first device). ZE_AFFINITY_MASK is a standard Level Zero variable, stack-agnostic.

🤖 Generated with Claude Code

ONEAPI_TEST_SPREAD_GPUS=1 pins each test worker process to a distinct GPU via
ZE_AFFINITY_MASK (claimed round-robin through an atomic mkdir counter, set before
`using oneAPI` so the Level Zero driver picks it up at init). This spreads the suite
across all tiles instead of oversubscribing device 0.

device() is task-local and Malt runs each test in a fresh task, so a device! in
init_worker_code would not stick — process-level pinning is the robust approach.

Default (unset) keeps every worker on the first device, preserving single-tile
oversubscription which is useful for surfacing contention bugs.

Verified: 6 concurrent claimers -> 6 distinct device UUIDs; real harness with
--jobs=4 spreads cleanly (SUCCESS).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 18, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80.89%. Comparing base (e995a63) to head (dd1ba6b).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #588      +/-   ##
==========================================
- Coverage   80.92%   80.89%   -0.04%     
==========================================
  Files          48       48              
  Lines        3234     3234              
==========================================
- Hits         2617     2616       -1     
- Misses        617      618       +1     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant