Skip to content

Fix composed-graph cudaErrorMisalignedAddress at 4D production shapes and CI OOM on lcov install#269

Merged
chhwang merged 2 commits into
mainfrom
qwen3-qb-graphbug
Jun 16, 2026
Merged

Fix composed-graph cudaErrorMisalignedAddress at 4D production shapes and CI OOM on lcov install#269
chhwang merged 2 commits into
mainfrom
qwen3-qb-graphbug

Conversation

@chhwang

@chhwang chhwang commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Fix composed-graph cudaErrorMisalignedAddress at 4D shapes; fix CI OOM

Two bugs in the planner/executor caused cudaErrorMisalignedAddress or
silent data corruption when composing ops on 4D tensors where H > W
(e.g., (1,4,128,32)).

Bug 1 — Cast _InShape::W == 1 branch (cast.h:23): scalar path
processed 1 element while NelemPerThread = 2, skipping every other
H element. Triggered at shapes like (1,4,128,1).

Bug 2 — Default tile (1, 64) for Cast and RoPE: at (1,4,128,32)
(H=128, W=32), the tile spans 2× W, forcing H-consecutive access that
misaligns addresses.

Fix: remove the dead W==1 scalar branch in cast.h; add
default_config overrides in ops_cast.cpp and ops_rope.cpp selecting
tile (32,2) when H > W and W ≥ 2. Add 23 regression tests in
test_composed_graph_shapes.py covering rope, composed rmsnorm, and
composed silu·gate at the exact production shapes that crashed.

CI fix: add --no-install-recommends to apt-get install lcov in
ut.yml. The bare install pulled 78 recommended packages (fontconfig,
fonts-dejavu, etc.), causing OOM/SIGKILL on the self-hosted runner.

Files changed:

  • ark/include/kernels/cast.h — remove dead W==1 branch
  • ark/ops/ops_cast.{cpp,hpp} — add default_config for H > W
  • ark/ops/ops_rope.{cpp,hpp} — add default_config for H > W
  • python/unittest/ops/test_composed_graph_shapes.py — 23 regression tests
  • .github/workflows/ut.yml — --no-install-recommends for lcov

chhwang added 2 commits June 15, 2026 02:14
…tion shapes — root-cause planner/executor bug, fix, add regression tests
The bare 'apt-get install -y lcov' pulls 78 recommended packages
(fontconfig, fonts-dejavu, libgd, etc.), exhausting runner memory.
The runner receives SIGKILL during unpack before any test runs.
Adding --no-install-recommends limits the install to lcov and its
hard dependencies only.
@codecov

codecov Bot commented Jun 15, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.78%. Comparing base (c257202) to head (c9c5872).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #269      +/-   ##
==========================================
+ Coverage   85.70%   85.78%   +0.08%     
==========================================
  Files         129      129              
  Lines        6457     6495      +38     
==========================================
+ Hits         5534     5572      +38     
  Misses        923      923              

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@chhwang chhwang changed the title ark: fix composed-graph cudaErrorMisalignedAddress crash at 4D production shapes — root-cause planner/executor bug, fix, add regression tests Fix composed-graph cudaErrorMisalignedAddress at 4D production shapes and CI OOM on lcov install Jun 15, 2026
@chhwang chhwang merged commit c619d64 into main Jun 16, 2026
10 of 11 checks passed
@chhwang chhwang deleted the qwen3-qb-graphbug branch June 16, 2026 06:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant