ark-dev: examples/qwen3: TP all-reduce component — mscclpp fused-packet all-reduce at Qwen3 attn-output and MLP-output shapes, equivalence test, microbenchmark#268
Open
chhwang wants to merge 11 commits into
Conversation
…et all-reduce at Qwen3 attn-output and MLP-output shapes, equivalence test, microbenchmark
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #268 +/- ##
==========================================
+ Coverage 85.70% 85.83% +0.12%
==========================================
Files 129 129
Lines 6457 6495 +38
==========================================
+ Hits 5534 5575 +41
+ Misses 923 920 -3 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
…et all-reduce at Qwen3 attn-output and MLP-output shapes, equivalence test, microbenchmark
…tion shapes — root-cause planner/executor bug, fix, add regression tests
The bare 'apt-get install -y lcov' pulls 78 recommended packages (fontconfig, fonts-dejavu, libgd, etc.), exhausting runner memory. The runner receives SIGKILL during unpack before any test runs. Adding --no-install-recommends limits the install to lcov and its hard dependencies only.
…, doc comments - ut.yml: move 'Run Qwen3 Example Tests' step before coverage collection so Qwen3 tests run with the build artifacts still in place. - ark_allreduce.py: add world_size >= 1 guard, call ark.set_rank() and ark.set_world_size() before ark.init(). - test_allreduce.py: add tests for world_size=0 and world_size=-1. - ops_cast.cpp, ops_rope.cpp: add cross-reference comments noting the identical default_config bodies. - examples/qwen3/__init__.py: add submodule index comment.
The worker subprocess in bench_allreduce.py used sys.path.insert(0, ".") which, when run from the repo root, caused Python to import the C++ ark/ directory as a namespace package instead of python/ark/. This produced: AttributeError: module 'ark' has no attribute 'set_rank' Remove the sys.path manipulation; the environment PYTHONPATH already points to the correct Python package.
…itation ARK codegen rejects OFFSET arguments referencing external buffers created by all_reduce_packet (codegen.cpp:318). Mark the 4 multi-GPU tests as xfail(strict=True) so CI passes while preserving the tests as documentation. Tests will surface as xpass when the limitation is resolved.
…ilures bench_allreduce.py crashed without printing a PERF_GATE line because workers hit AttributeError on ark.set_rank (codegen limitation blocks all_reduce_packet). The bench now: - handles worker failures gracefully - always prints an honest PERF_GATE line (ratio=999999 when ARK path cannot execute, real ratio when it can)
…nput in all_reduce_packet codegen.cpp: emit moff.value() for external/unmapped buffers instead of erroring. Internal buffers still resolve via buffer_id_to_offset_. ops_all_reduce.cpp: copy input into an internal buffer so putPackets reads from mscclpp-registered memory. ops_all_reduce_test.cpp: add 3 fused-packet all-reduce tests (2/4/8 GPU). test_allreduce.py: fix init ordering (ark.init() before set_rank/ set_world_size); remove 4 xfail decorators.
Move ark.init() before ark.set_rank()/ark.set_world_size() so Model.reset() runs before rank/world_size are configured. Previous order reset rank to 0 after setting it.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TP all-reduce component (mscclpp fused-packet)
Wraps ark.all_reduce_packet for 2-D Qwen3 shapes. Includes CPU-only validation tests (CI), multi-GPU equivalence tests (skip at device_count < 2), and microbenchmark.
Carries the composed-graph cudaErrorMisalignedAddress fix from PR #269 and the --no-install-recommends CI fix.
Merge guard
Depends on #269 -- do not merge until #269 is on main. This branch carries the QB fix set via merge commit.
Changes (vs main)