Skip to content

ark-dev: examples/qwen3: TP all-reduce component — mscclpp fused-packet all-reduce at Qwen3 attn-output and MLP-output shapes, equivalence test, microbenchmark#268

Open
chhwang wants to merge 11 commits into
mainfrom
qwen3-q7-allreduce
Open

Conversation

@chhwang

@chhwang chhwang commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

TP all-reduce component (mscclpp fused-packet)

Wraps ark.all_reduce_packet for 2-D Qwen3 shapes. Includes CPU-only validation tests (CI), multi-GPU equivalence tests (skip at device_count < 2), and microbenchmark.

Carries the composed-graph cudaErrorMisalignedAddress fix from PR #269 and the --no-install-recommends CI fix.

Merge guard

Depends on #269 -- do not merge until #269 is on main. This branch carries the QB fix set via merge commit.

Changes (vs main)

  • examples/qwen3/ark_allreduce.py -- all-reduce wrapper
  • examples/qwen3/test_allreduce.py -- 11 CPU-only + 4 multi-GPU tests
  • examples/qwen3/bench_allreduce.py -- microbenchmark
  • examples/qwen3/init.py, qwen3_config.py, equiv.py, microbench.py -- Q3 harness (carried)
  • .github/workflows/ut.yml -- Qwen3 test discovery + --no-install-recommends
  • QB fix: cast.h, ops_cast.cpp, ops_rope.cpp, test_composed_graph_shapes.py

…et all-reduce at Qwen3 attn-output and MLP-output shapes, equivalence test, microbenchmark
@codecov

codecov Bot commented Jun 13, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.83%. Comparing base (c257202) to head (1e5592c).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #268      +/-   ##
==========================================
+ Coverage   85.70%   85.83%   +0.12%     
==========================================
  Files         129      129              
  Lines        6457     6495      +38     
==========================================
+ Hits         5534     5575      +41     
+ Misses        923      920       -3     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ark-dev and others added 10 commits June 14, 2026 07:20
…et all-reduce at Qwen3 attn-output and MLP-output shapes, equivalence test, microbenchmark
…tion shapes — root-cause planner/executor bug, fix, add regression tests
The bare 'apt-get install -y lcov' pulls 78 recommended packages
(fontconfig, fonts-dejavu, libgd, etc.), exhausting runner memory.
The runner receives SIGKILL during unpack before any test runs.
Adding --no-install-recommends limits the install to lcov and its
hard dependencies only.
…, doc comments

- ut.yml: move 'Run Qwen3 Example Tests' step before coverage collection
  so Qwen3 tests run with the build artifacts still in place.
- ark_allreduce.py: add world_size >= 1 guard, call ark.set_rank() and
  ark.set_world_size() before ark.init().
- test_allreduce.py: add tests for world_size=0 and world_size=-1.
- ops_cast.cpp, ops_rope.cpp: add cross-reference comments noting the
  identical default_config bodies.
- examples/qwen3/__init__.py: add submodule index comment.
The worker subprocess in bench_allreduce.py used
sys.path.insert(0, ".") which, when run from the repo root,
caused Python to import the C++ ark/ directory as a namespace
package instead of python/ark/. This produced:
  AttributeError: module 'ark' has no attribute 'set_rank'

Remove the sys.path manipulation; the environment PYTHONPATH
already points to the correct Python package.
…itation

ARK codegen rejects OFFSET arguments referencing external buffers
created by all_reduce_packet (codegen.cpp:318). Mark the 4 multi-GPU
tests as xfail(strict=True) so CI passes while preserving the tests
as documentation. Tests will surface as xpass when the limitation is
resolved.
…ilures

bench_allreduce.py crashed without printing a PERF_GATE line because
workers hit AttributeError on ark.set_rank (codegen limitation blocks
all_reduce_packet). The bench now:
- handles worker failures gracefully
- always prints an honest PERF_GATE line (ratio=999999 when ARK path
  cannot execute, real ratio when it can)
…nput in all_reduce_packet

codegen.cpp: emit moff.value() for external/unmapped buffers instead of
erroring. Internal buffers still resolve via buffer_id_to_offset_.

ops_all_reduce.cpp: copy input into an internal buffer so putPackets
reads from mscclpp-registered memory.

ops_all_reduce_test.cpp: add 3 fused-packet all-reduce tests (2/4/8 GPU).

test_allreduce.py: fix init ordering (ark.init() before set_rank/
set_world_size); remove 4 xfail decorators.
Move ark.init() before ark.set_rank()/ark.set_world_size() so
Model.reset() runs before rank/world_size are configured. Previous
order reset rank to 0 after setting it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant