Skip to content

[AMD] perf: enable FlyDSL w4a16 MoE for Kimi INT4#1777

Open
amd-asalykov wants to merge 3 commits into
SemiAnalysisAI:mainfrom
amd-asalykov:flydsl-moe
Open

[AMD] perf: enable FlyDSL w4a16 MoE for Kimi INT4#1777
amd-asalykov wants to merge 3 commits into
SemiAnalysisAI:mainfrom
amd-asalykov:flydsl-moe

Conversation

@amd-asalykov

@amd-asalykov amd-asalykov commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Replace default triton w4a16 MoE kernel with more performant FlyDSL implementation for Kimi INT4 MI355X


Note

Low Risk
Benchmark and serving-flag changes only; no application auth or data paths. Risk is limited to reproducibility and CI cost from expanded sweeps and a nightly container pin.

Overview
Updates the Kimi K2.5 INT4 vLLM MI355X benchmark to use FlyDSL for w4a16 MoE instead of the default Triton path, and pins a digest-suffixed ROCm nightly image (vllm-openai-rocm:nightly-b8336c3…).

The runner script kimik2.5_int4_mi355x.sh adds --moe-backend flydsl and a compilation pass that sets fuse_allreduce_rms to false. CI config expands the fixed-seq-len sweep: concurrency up to 128 (from 64) and an additional TP=4 row for both 1k/1k and 8k/1k scenarios.

perf-changelog.yaml records the config-key change for PR #1777.

Reviewed by Cursor Bugbot for commit be23347. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread .github/configs/amd-master.yaml Outdated
Comment thread perf-changelog.yaml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants