[DNM][AMD] agentx-v0.4 rebased from commit chore/agentx-v0.4 commit 7f61 by seungrokj · Pull Request #1709 · SemiAnalysisAI/InferenceX

seungrokj · 2026-06-11T06:08:14Z

Summary

Add qwen3.5-fp4-mi355x-sglang-agentic-hicache config: SGLang agentic-coding sweep with and without hicache offloading (TP2, EP1)
Add minimaxm2.5-fp4-mi355x-vllm-agentic-lmcache config: vLLM agentic-coding sweep with lmcache
Add new agentic benchmark scripts: minimaxm2.5_fp4_mi355x.sh, qwen3.5_fp4_mi355x.sh
Update existing agentic scripts: glm5.1_fp4_mi355x.sh, kimik2.5_fp4_mi355x.sh, minimaxm2.5_fp8_mi355x.sh, qwen3.5_fp8_mi355x.sh
Update launch_mi355x-amds.sh

Test plan

Verify hicache/lmcache agentic configs run correctly on MI355X
Confirm new agentic scripts launch without errors

🤖 Generated with Claude Code

Note

Medium Risk
Large benchmark-only change with many new long-running sweeps and LMCache built from source at job time; mis-tuned DRAM/offload settings could cause flaky CI rather than affecting production services.

Overview
Extends agentx v0.4 MI355X coverage by wiring agentic-coding matrix entries that compare GPU-only vs CPU-tier KV offload (hicache on SGLang, lmcache on vLLM/ATOM), plus targeted image and concurrency grid updates.

CI config (amd-master.yaml) adds several # target sweeps (e.g. Qwen3.5 FP4/FP8 HiCache, GLM5.1 HiCache, Kimi/MiniMax LMCache, DSv4 Atom/SGLang HiCache) and tweaks existing agentic rows (Qwen3.5 FP8 HiCache moves to TP4; MiniMax FP8 agentic images pin to v0.22.0).

Launch scripts gain offload plumbing: HiCache sizing/ratio logic for SGLang agentic runs, and LMCache MP startup (often git clone + HIP build) with larger host DRAM budgets, longer read TTLs, and revised vLLM/ATOM server flags. Kimi agentic drops the prior in-repo ROCm LMCache Python patches in favor of the upstream LMCache install path. New agentic entrypoints include DSv4 (SGLang + Atom), MiniMax FP4, and Qwen3.5 FP4; DSv4 fixed-seq and agentic SGLang recipes shift to the newer dsv4 attention backend / sglang serve style launch. Slurm launcher excludes node mia1-p01-g37.

^{Reviewed by Cursor Bugbot for commit 76d90e0. Bugbot is set up for automated code reviews on this repo. Configure here.}

…r mi355x models Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-06-11T06:08:23Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-06-11T06:08:23Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

cursor · 2026-06-11T06:09:45Z

+    $ASYNC_SCHEDULING_ARGS 
+    "${PREFIX_CACHE_ARGS[@]}"
+    "${OFFLOAD_ARGS[@]}"
+)


vLLM uses wrong model

High Severity

The vLLM command serves "$MODEL" and omits --served-model-name, while the script downloads weights into MODEL_PATH and build_replay_cmd sends --model $MODEL to aiperf. That breaks the usual MODEL_PATH + served-name pairing used by sibling agentic scripts and can fail when MODEL is a Hub id but weights live under MODEL_PATH.

^{Reviewed by Cursor Bugbot for commit 01cc2af. Configure here.}

cursor · 2026-06-11T06:09:45Z

    --mem-fraction-static 0.8 \
-    --context-length $MAX_MODEL_LEN \
+    "${CACHE_ARGS[@]}" \
+    "${WARMUP_ARGS[@]}" \


SGLang ignores MODEL_PATH

Medium Severity

SGLang is started with --model-path $MODEL and no --served-model-name, after the script may download into MODEL_PATH. Matrix jobs that set a local MODEL_PATH can still point the server at the Hub id, and the OpenAI model name may not match MODEL used by aiperf.

Additional Locations (1)

benchmarks/single_node/agentic/qwen3.5_fp4_mi355x.sh#L123-L141

^{Reviewed by Cursor Bugbot for commit 01cc2af. Configure here.}

cursor · 2026-06-11T06:09:45Z

+        cd LMCache
+        pip install -r requirements/build.txt 
+        CXX=hipcc BUILD_WITH_HIP=1 pip install -e .   --no-build-isolation
+        cd ..


LMCache clone not idempotent

Medium Severity

The lmcache path runs git clone https://github.com/LMCache/LMCache.git unconditionally. With set -e, a second run in the same working directory exits when LMCache already exists, so lmcache agentic jobs fail on retry or reuse of the job cwd.

Additional Locations (1)

benchmarks/single_node/agentic/minimaxm2.5_fp4_mi355x.sh#L149-L154

^{Reviewed by Cursor Bugbot for commit 01cc2af. Configure here.}

Signed-off-by: ajith-sirra-amd <ajith.sirra@amd.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…onfig Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cursor · 2026-06-12T06:15:23Z

+
+python3 -m sglang.launch_server \
+    --attention-backend aiter \
+    --model-path $MODEL \


Server ignores MODEL_PATH

Medium Severity

Weights are downloaded into MODEL_PATH when the workflow sets that directory, but SGLang is started with --model-path $MODEL (Hub id) instead of MODEL_PATH. The server may load a different cache path than the one prepared for the job.

^{Reviewed by Cursor Bugbot for commit 32f5007. Configure here.}

cursor · 2026-06-12T06:15:23Z

+        OFFLOAD_ARGS=(
+            --kv-transfer-config
+            "{\"kv_connector\":\"LMCacheMPConnector\",\"kv_connector_module_path\":\"lmcache.integration.vllm.lmcache_mp_connector\",\"kv_role\":\"kv_both\",\"kv_connector_extra_config\":{\"lmcache.mp.host\":\"$LMCACHE_CONNECT_HOST\",\"lmcache.mp.port\":$LMCACHE_PORT}}"
+        )


LMCache missing hybrid disable

High Severity

The lmcache branch omits --disable-hybrid-kv-cache-manager on vllm serve, while the new minimaxm2.5-fp8-mi355x-vllm-agentic-lmcache config exercises that path. The sibling FP4 script documents that LMCache is incompatible without disabling the hybrid KV manager.

^{Reviewed by Cursor Bugbot for commit 32f5007. Configure here.}

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cursor · 2026-06-12T08:06:30Z

-        module = _orig_import(name, globals, locals, fromlist, level)
-        if name == "lmcache.v1.lazy_memory_allocator" or (
-            name.startswith("lmcache") and "lmcache.v1.lazy_memory_allocator" in sys.modules
-        ):


Kimi LMCache ROCm fixes removed

High Severity

The Kimi MI355X agentic script replaces the prior ROCm LMCache install (ROCm CuPy, nixl cleanup, demand-pinned allocator, MLA block fallback, chunked connector, scheduler KV-transfer patch) with a bare git clone and HIP build. New kimik2.5-fp4-mi355x-vllm-agentic-lmcache sweeps depend on this path for Kimi MLA KV on AMD.

^{Reviewed by Cursor Bugbot for commit 351e729. Configure here.}

cursor · 2026-06-12T08:06:30Z

+
+# ---- Resolve traces and install deps ----------------------------------------
+# https://huggingface.co/datasets/semianalysisai/cc-traces-weka-with-subagents-060826
+export WEKA_LOADER_OVERRIDE=semianalysis_cc_traces_weka_with_subagents_060826


DSv4 atom uncapped traces

Medium Severity

This new DSv4 ATOM agentic script sets WEKA_LOADER_OVERRIDE to the uncapped 060826 trace set, while peer MI355X agentic scripts in the same PR use 060226_256k to avoid ~1M-token traces that are rejected and skew sweeps.

^{Reviewed by Cursor Bugbot for commit 351e729. Configure here.}

Signed-off-by: ajith-sirra-amd <ajith.sirra@amd.com>

…nalysisAI/InferenceX into amd/agentx-v0.4_rebase0611

cursor · 2026-06-12T08:45:54Z

+    $ASYNC_SCHEDULING_ARGS 
+    "${PREFIX_CACHE_ARGS[@]}"
+    "${OFFLOAD_ARGS[@]}"
+)


MiniMax FP8 launcher regressed

High Severity

The MI355X MiniMax FP8 agentic launcher was replaced with a Kimi-style vLLM recipe. Existing minimaxm2.5-fp8-mi355x-vllm-agentic jobs (TP4/EP4, offloading=cpu) lose the prior --max-model-len, ROCM_AITER_UNIFIED_ATTN backend, MODEL_PATH-based serve, and SimpleCPU offload wiring they depended on.

^{Reviewed by Cursor Bugbot for commit faba18f. Configure here.}

cursor · 2026-06-12T08:45:54Z

-                device,
-            )
-            return torch.as_strided(
-                base,


Kimi context length dropped

Medium Severity

The launcher no longer normalizes MAX_MODEL_LEN to 262144 or passes --max-model-len to vLLM. Agentic sweeps typically leave MAX_MODEL_LEN at 0, so the replay harness and Kimi’s enforced context window can disagree and traces may be filtered or rejected differently than the server allows.

^{Reviewed by Cursor Bugbot for commit faba18f. Configure here.}

Signed-off-by: ajith-sirra-amd <ajith.sirra@amd.com>

cursor · 2026-06-12T08:50:14Z

        # ZMQ-style host string.
        LMCACHE_CONNECT_HOST="${LMCACHE_CONNECT_HOST:-tcp://$LMCACHE_HOST}"
-        LMCACHE_L1_SIZE_GB="${LMCACHE_L1_SIZE_GB:-$TOTAL_CPU_DRAM_GB}"
+        LMCACHE_L1_SIZE_GB="${LMCACHE_L1_SIZE_GB:-$((TOTAL_CPU_DRAM_GB / (8 / TP)))}"


LMCache pool wrongly partitioned

Medium Severity

LMCACHE_L1_SIZE_GB for the external LMCache MP server is derived with TOTAL_CPU_DRAM_GB / (8 / TP), the same formula used for per-rank vLLM CPU offload. The MP server owns one node pool; at TP=4 this shrinks L1 from ~3 TB to ~1.5 TB versus the prior full TOTAL_CPU_DRAM_GB default.

Additional Locations (1)

benchmarks/single_node/agentic/dsv4_fp4_mi355x_atom.sh#L162-L163

^{Reviewed by Cursor Bugbot for commit 8ca4bc1. Configure here.}

… config Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…cripts and master yaml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 12 total unresolved issues (including 10 from previous reviews).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 76d90e0. Configure here.}

cursor · 2026-06-15T00:22:06Z

-    --cuda-graph-max-bs "$PER_ENGINE_MAX_RUNNING" \
+    --disable-radix-cache \
+    --attention-backend dsv4 \
+    --max-running-requests ${CONC} \


Radix cache disabled for agentic

High Severity

The DSv4 MI355X agentic SGLang launcher passes --disable-radix-cache while the same file’s none/hicache branches state agentic replay depends on RadixAttention prefix reuse. That mismatch can zero prefix hit rate and skew agentic-coding results versus other offload arms.

^{Reviewed by Cursor Bugbot for commit 76d90e0. Configure here.}

cursor · 2026-06-15T00:22:06Z

-    --cuda-graph-max-bs "$PER_ENGINE_MAX_RUNNING" \
+    --disable-radix-cache \
+    --attention-backend dsv4 \
+    --max-running-requests ${CONC} \


DP max-running requests wrong

Medium Severity

When DP_ATTENTION=true, the script computes PER_ENGINE_MAX_RUNNING as CONC/TP for per-engine limits, but the server is started with --max-running-requests ${CONC}. Each DP engine may accept too many sequences versus the harness load-balancing assumption.

^{Reviewed by Cursor Bugbot for commit 76d90e0. Configure here.}

[AMD] agentic: add hicache/lmcache configs, update agentic scripts fo…

01cc2af

…r mi355x models Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

seungrokj requested review from 1am9trash, billishyahao, chunfangamd and yctseng0211 as code owners June 11, 2026 06:08

github-project-automation Bot added this to InferenceMAX Board Jun 11, 2026

seungrokj mentioned this pull request Jun 11, 2026

[DNM][AMD] agentx-v0.4 #1654

Closed

cursor Bot reviewed Jun 11, 2026

View reviewed changes

seungrokj changed the title ~~[AMD] agentic: add hicache/lmcache configs, update agentic scripts for mi355x models~~ [DNM][AMD] agentx-v0.4 rebased from commit chore/agentx-v0.4 commit 7f61 Jun 11, 2026

ajith-sirra-amd and others added 3 commits June 11, 2026 12:54

Add GLM5.1 & Qwen3.5 MI300 Agentic Scripts

ba1bb37

Signed-off-by: ajith-sirra-amd <ajith.sirra@amd.com>

[AMD] add DSV4-FP4-MI355x atom agentic benchmark and master yaml config

eba4233

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

[AMD] update DSV4-FP4-MI355x atom agentic benchmark and master yaml c…

32f5007

…onfig Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cursor Bot reviewed Jun 12, 2026

View reviewed changes

[AMD] dsv4_fp4_mi355x_atom.sh: update agentic benchmark script

351e729

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cursor Bot reviewed Jun 12, 2026

View reviewed changes

ajith-sirra-amd added 2 commits June 12, 2026 14:12

Add DSV4 MI355X Agentic Scripts

64ce90c

Signed-off-by: ajith-sirra-amd <ajith.sirra@amd.com>

Merge branch 'amd/agentx-v0.4_rebase0611' of https://github.com/SemiA…

faba18f

…nalysisAI/InferenceX into amd/agentx-v0.4_rebase0611

cursor Bot reviewed Jun 12, 2026

View reviewed changes

Add DSV4 MI355X Agentic Scripts

8ca4bc1

Signed-off-by: ajith-sirra-amd <ajith.sirra@amd.com>

cursor Bot reviewed Jun 12, 2026

View reviewed changes

[AMD] update DSV4-FP4-MI355X SGLang agentic benchmark and master yaml…

37f57a7

… config Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cursor Bot reviewed Jun 14, 2026

View reviewed changes

Comment thread benchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_sglang.sh Outdated

Comment thread benchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_sglang.sh Outdated

[AMD] update DSV4-FP4-MI355X SGLang agentic/fixed-seq-len benchmark s…

76d90e0

…cripts and master yaml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cursor Bot reviewed Jun 15, 2026

View reviewed changes

Conversation

seungrokj commented Jun 11, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Uh oh!

cursor Bot Jun 11, 2026

Choose a reason for hiding this comment

vLLM uses wrong model

Uh oh!

cursor Bot Jun 11, 2026

Choose a reason for hiding this comment

SGLang ignores MODEL_PATH

Uh oh!

cursor Bot Jun 11, 2026

Choose a reason for hiding this comment

LMCache clone not idempotent

Uh oh!

cursor Bot Jun 12, 2026

Choose a reason for hiding this comment

Server ignores MODEL_PATH

Uh oh!

cursor Bot Jun 12, 2026

Choose a reason for hiding this comment

LMCache missing hybrid disable

Uh oh!

cursor Bot Jun 12, 2026

Choose a reason for hiding this comment

Kimi LMCache ROCm fixes removed

Uh oh!

cursor Bot Jun 12, 2026

Choose a reason for hiding this comment

DSv4 atom uncapped traces

Uh oh!

cursor Bot Jun 12, 2026

Choose a reason for hiding this comment

MiniMax FP8 launcher regressed

Uh oh!

cursor Bot Jun 12, 2026

Choose a reason for hiding this comment

Kimi context length dropped

Uh oh!

cursor Bot Jun 12, 2026

Choose a reason for hiding this comment

LMCache pool wrongly partitioned

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 15, 2026

Choose a reason for hiding this comment

Radix cache disabled for agentic

Uh oh!

cursor Bot Jun 15, 2026

Choose a reason for hiding this comment

DP max-running requests wrong

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

seungrokj commented Jun 11, 2026 •

edited by cursor Bot

Loading