[OMNIML-4740] synth_support#1496
Conversation
📝 WalkthroughWalkthroughAdds a new Kimi-K2.5 vLLM offline data-synthesis pipeline YAML, adjusts MPI/local-rank fallbacks in service utilities, makes dependency installation coordinated across local ranks via a ChangesLauncher examples and service utilities
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
|
Closing — OMNIML-4735 Epic wrap-up. Used as the first real-release dispatch on the fully-consolidated nmm-sandbox stack (Phase 3 production validation); not pursuing the YAML merge today. |
|
|
Re-opening — closing this PR earlier was premature cleanup. OMNIML-4735 is now being driven to an actual release; this PR is the synth_support deliverable awaiting real review + merge. |
1b02102 to
006e8b2
Compare
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
tools/launcher/examples/moonshotai/Kimi-K2.5 DFlash/hf_offline_eagle3.yaml (1)
104-104: ⚡ Quick winPin the VLLM container to an immutable version/digest.
Using
vllm/vllm-openai:latestmakes benchmark results non-reproducible across runs; pin to a tested tag or digest.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/launcher/examples/moonshotai/Kimi-K2.5` DFlash/hf_offline_eagle3.yaml at line 104, The container image is pinned to the floating tag "vllm/vllm-openai:latest" under the container key; replace this with an immutable tag or digest (for example a specific semver tag or a sha256 digest) that you've tested for benchmarks so results are reproducible, e.g. change the container value from "vllm/vllm-openai:latest" to a concrete string like "vllm/vllm-openai:<tested-tag>" or "vllm/vllm-openai@sha256:<digest>".
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/launcher/examples/moonshotai/Kimi-K2.5` DFlash/hf_offline_eagle3.yaml:
- Line 73: The training output path (training.output_dir) is set to
/scratchspace/eagle3 but later steps (task_3 / the draft weights reader) expect
artifacts at /scratchspace/export; change training.output_dir to match the
benchmark input path (/scratchspace/export) or update the task_3 draft weights
path to /scratchspace/eagle3 so both produce and consume the same directory
(refer to the training.output_dir key and the task_3 draft weights
reference/train_eagle.sh usage to locate where to update).
- Around line 39-40: The YAML is missing required environment variables for the
new model config: add MLM_MODEL_CFG (set to the HuggingFace repo ID for the
model) and QUANT_CFG (e.g., NVFP4_DEFAULT_CFG or INT8_DEFAULT_CFG) to every
environment block that defines task envs (the existing blocks that contain
HF_LOCAL) and create an environment block for task_2 that includes HF_LOCAL plus
MLM_MODEL_CFG and QUANT_CFG so the launcher YAML conforms to the model-config
contract; ensure the variable names are spelled exactly as MLM_MODEL_CFG and
QUANT_CFG in each task's environment list.
---
Nitpick comments:
In `@tools/launcher/examples/moonshotai/Kimi-K2.5` DFlash/hf_offline_eagle3.yaml:
- Line 104: The container image is pinned to the floating tag
"vllm/vllm-openai:latest" under the container key; replace this with an
immutable tag or digest (for example a specific semver tag or a sha256 digest)
that you've tested for benchmarks so results are reproducible, e.g. change the
container value from "vllm/vllm-openai:latest" to a concrete string like
"vllm/vllm-openai:<tested-tag>" or "vllm/vllm-openai@sha256:<digest>".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 66ea5a13-65b0-4b34-81d6-f8fa01cc5581
📒 Files selected for processing (1)
tools/launcher/examples/moonshotai/Kimi-K2.5 DFlash/hf_offline_eagle3.yaml
| environment: | ||
| - HF_LOCAL: /hf-local |
There was a problem hiding this comment.
Add required MLM_MODEL_CFG and QUANT_CFG env vars for this new model config.
Line 39, Line 57, and Line 97 define task envs, but MLM_MODEL_CFG and QUANT_CFG are missing, and task_2 (Line 67-83) has no environment block at all. This violates the model-config contract for launcher YAMLs.
Proposed patch
global_vars:
hf_model: /hf-local/moonshotai/Kimi-K2.5 DFlash
+ hf_repo_id: moonshotai/Kimi-K2.5-DFlash
@@
task_0:
@@
environment:
- HF_LOCAL: /hf-local
+ - MLM_MODEL_CFG: <<global_vars.hf_repo_id>>
+ - QUANT_CFG: NVFP4_DEFAULT_CFG
@@
task_1:
@@
environment:
- HF_MODEL_CKPT: <<global_vars.hf_model>>
+ - MLM_MODEL_CFG: <<global_vars.hf_repo_id>>
+ - QUANT_CFG: NVFP4_DEFAULT_CFG
@@
task_2:
@@
+ environment:
+ - MLM_MODEL_CFG: <<global_vars.hf_repo_id>>
+ - QUANT_CFG: NVFP4_DEFAULT_CFG
@@
task_3:
@@
environment:
- HF_MODEL_CKPT: <<global_vars.hf_model>>
+ - MLM_MODEL_CFG: <<global_vars.hf_repo_id>>
+ - QUANT_CFG: NVFP4_DEFAULT_CFGAs per coding guidelines, "Set MLM_MODEL_CFG environment variable to the HuggingFace repo ID when adding a new model config" and "Set QUANT_CFG environment variable (e.g., NVFP4_DEFAULT_CFG, INT8_DEFAULT_CFG) when adding a new model config".
Also applies to: 57-58, 67-83, 97-99
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/examples/moonshotai/Kimi-K2.5` DFlash/hf_offline_eagle3.yaml
around lines 39 - 40, The YAML is missing required environment variables for the
new model config: add MLM_MODEL_CFG (set to the HuggingFace repo ID for the
model) and QUANT_CFG (e.g., NVFP4_DEFAULT_CFG or INT8_DEFAULT_CFG) to every
environment block that defines task envs (the existing blocks that contain
HF_LOCAL) and create an environment block for task_2 that includes HF_LOCAL plus
MLM_MODEL_CFG and QUANT_CFG so the launcher YAML conforms to the model-config
contract; ensure the variable names are spelled exactly as MLM_MODEL_CFG and
QUANT_CFG in each task's environment list.
| - --config modules/Model-Optimizer/modelopt_recipes/general/speculative_decoding/eagle3.yaml | ||
| - model.model_name_or_path=<<global_vars.hf_model>> | ||
| - data.offline_data_path=/scratchspace/offline_hidden_states | ||
| - training.output_dir=/scratchspace/eagle3 |
There was a problem hiding this comment.
Align the training output path with benchmark input path.
Line 73 writes artifacts to /scratchspace/eagle3, but Line 88 reads draft weights from /scratchspace/export. Unless train_eagle.sh exports to /scratchspace/export implicitly, task_3 will not find the draft model.
Proposed patch
- - --draft_model_dir /scratchspace/export
+ - --draft_model_dir /scratchspace/eagle3Also applies to: 88-88
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/examples/moonshotai/Kimi-K2.5` DFlash/hf_offline_eagle3.yaml
at line 73, The training output path (training.output_dir) is set to
/scratchspace/eagle3 but later steps (task_3 / the draft weights reader) expect
artifacts at /scratchspace/export; change training.output_dir to match the
benchmark input path (/scratchspace/export) or update the task_3 draft weights
path to /scratchspace/eagle3 so both produce and consume the same directory
(refer to the training.output_dir key and the task_3 draft weights
reference/train_eagle.sh usage to locate where to update).
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1496 +/- ##
==========================================
- Coverage 77.44% 76.93% -0.51%
==========================================
Files 473 473
Lines 51418 52473 +1055
==========================================
+ Hits 39819 40370 +551
- Misses 11599 12103 +504
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
🚧 Blocked on input checkpoint — not yet ready to merge. Status
What's blockingThis PR's YAML references
Unblock pathA human needs to:
The pre-merge mandate of "real cluster test evidence" is the policy this babysit run is enforcing — and is being codified in pensieve-intern MR !4 so the next synth_support agent run can't ship a YAML without it. (Driven by /babysit-jira on OMNIML-4740.) |
006e8b2 to
c3eef71
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
tools/launcher/examples/moonshotai/Kimi-K2.5-DFlash/hf_offline_eagle3.yaml (2)
85-85:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFix task_2 → task_3 artifact path mismatch.
Line 85 writes to
/scratchspace/eagle3, but Line 100 reads from/scratchspace/export. This can break stage 4 input resolution.Proposed patch
- - --draft_model_dir /scratchspace/export + - --draft_model_dir /scratchspace/eagle3Also applies to: 100-100
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/launcher/examples/moonshotai/Kimi-K2.5-DFlash/hf_offline_eagle3.yaml` at line 85, The artifact path for the training output is inconsistent between tasks: training.output_dir is set to /scratchspace/eagle3 while later steps expect /scratchspace/export, causing task_2 → task_3 input resolution to fail; update the value of training.output_dir (the training.output_dir key in the YAML) to match the downstream artifact path (/scratchspace/export) or change the downstream read path to /scratchspace/eagle3 so both producer (training.output_dir) and consumer use the same artifact directory.
47-49:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd required
MLM_MODEL_CFGandQUANT_CFGto every task env.These required model-config env vars are missing in
task_0,task_1, andtask_3, andtask_2has noenvironmentblock at all.Proposed patch
global_vars: hf_model: /hf-local/moonshotai/Kimi-K2.6 + hf_repo_id: moonshotai/Kimi-K2.6 @@ task_0: @@ environment: - HF_LOCAL: /hf-local + - MLM_MODEL_CFG: <<global_vars.hf_repo_id>> + - QUANT_CFG: NVFP4_DEFAULT_CFG - VLLM_STARTUP_TIMEOUT: "1800" # Kimi-K2.6 weight load alone is ~7.7 min @@ task_1: @@ environment: - HF_MODEL_CKPT: <<global_vars.hf_model>> + - MLM_MODEL_CFG: <<global_vars.hf_repo_id>> + - QUANT_CFG: NVFP4_DEFAULT_CFG @@ task_2: @@ + environment: + - MLM_MODEL_CFG: <<global_vars.hf_repo_id>> + - QUANT_CFG: NVFP4_DEFAULT_CFG slurm_config: @@ task_3: @@ environment: - HF_MODEL_CKPT: <<global_vars.hf_model>> + - MLM_MODEL_CFG: <<global_vars.hf_repo_id>> + - QUANT_CFG: NVFP4_DEFAULT_CFGAs per coding guidelines, "Set
MLM_MODEL_CFGenvironment variable to the HuggingFace repo ID when adding a new model config" and "SetQUANT_CFGenvironment variable (e.g.,NVFP4_DEFAULT_CFG,INT8_DEFAULT_CFG) when adding a new model config".Also applies to: 69-70, 79-95, 109-110
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/launcher/examples/moonshotai/Kimi-K2.5-DFlash/hf_offline_eagle3.yaml` around lines 47 - 49, The tasks in this YAML are missing required environment vars: add MLM_MODEL_CFG and QUANT_CFG to the environment lists for task_0, task_1, and task_3, and create an environment block for task_2 with those same vars; specifically set MLM_MODEL_CFG to the HuggingFace repo ID used by this model and QUANT_CFG to the chosen quantization profile (e.g., NVFP4_DEFAULT_CFG or INT8_DEFAULT_CFG), ensuring the variable names match exactly (MLM_MODEL_CFG, QUANT_CFG) in each task's environment array so downstream code that reads these vars (task_0, task_1, task_2, task_3) will find them.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/launcher/examples/moonshotai/Kimi-K2.5-DFlash/hf_offline_eagle3.yaml`:
- Line 22: The comment that mentions vLLM loading "Kimi-K2.5" is stale relative
to the configured checkpoint key hf_model (currently set to Kimi-K2.6); update
that comment text to reference "Kimi-K2.6" so it matches the hf_model value, and
make the same change for the other occurrences referenced (the comments around
the hf_model key and the comments at the other occurrences noted). Ensure all
comments in the file that mention Kimi-K2.5 are changed to Kimi-K2.6 to keep
configuration and documentation consistent.
---
Duplicate comments:
In `@tools/launcher/examples/moonshotai/Kimi-K2.5-DFlash/hf_offline_eagle3.yaml`:
- Line 85: The artifact path for the training output is inconsistent between
tasks: training.output_dir is set to /scratchspace/eagle3 while later steps
expect /scratchspace/export, causing task_2 → task_3 input resolution to fail;
update the value of training.output_dir (the training.output_dir key in the
YAML) to match the downstream artifact path (/scratchspace/export) or change the
downstream read path to /scratchspace/eagle3 so both producer
(training.output_dir) and consumer use the same artifact directory.
- Around line 47-49: The tasks in this YAML are missing required environment
vars: add MLM_MODEL_CFG and QUANT_CFG to the environment lists for task_0,
task_1, and task_3, and create an environment block for task_2 with those same
vars; specifically set MLM_MODEL_CFG to the HuggingFace repo ID used by this
model and QUANT_CFG to the chosen quantization profile (e.g., NVFP4_DEFAULT_CFG
or INT8_DEFAULT_CFG), ensuring the variable names match exactly (MLM_MODEL_CFG,
QUANT_CFG) in each task's environment array so downstream code that reads these
vars (task_0, task_1, task_2, task_3) will find them.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 39023861-07f0-4fbf-a6bb-da4e88f3160a
📒 Files selected for processing (1)
tools/launcher/examples/moonshotai/Kimi-K2.5-DFlash/hf_offline_eagle3.yaml
| note: | ||
|
|
||
| global_vars: | ||
| hf_model: /hf-local/moonshotai/Kimi-K2.6 |
There was a problem hiding this comment.
Update stale model reference in comments.
The comment says vLLM loads Kimi-K2.5, but current configured checkpoint is Kimi-K2.6 (Line 22). Please align the comment to avoid operator confusion.
Also applies to: 27-28
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/examples/moonshotai/Kimi-K2.5-DFlash/hf_offline_eagle3.yaml`
at line 22, The comment that mentions vLLM loading "Kimi-K2.5" is stale relative
to the configured checkpoint key hf_model (currently set to Kimi-K2.6); update
that comment text to reference "Kimi-K2.6" so it matches the hf_model value, and
make the same change for the other occurrences referenced (the comments around
the hf_model key and the comments at the other occurrences noted). Ensure all
comments in the file that mention Kimi-K2.5 are changed to Kimi-K2.6 to keep
configuration and documentation consistent.
There was a problem hiding this comment.
Should be Kimi-K2.5/hf_offline_dflash.yaml instead.
| gpus_per_node: 8 | ||
| container: vllm/vllm-openai:latest | ||
|
|
||
| # Step 2: Dump hidden states from target model |
There was a problem hiding this comment.
Remove the task is not been tested
| container: nvcr.io/nvidia/tensorrt-llm/release:1.2.0 | ||
|
|
||
| # Step 3: Train EAGLE3 draft head (offline, single task) | ||
| task_2: |
There was a problem hiding this comment.
Remove the task is not been tested
| gpus_per_node: 8 | ||
| container: nvcr.io/nvidia/tensorrt-llm/release:1.2.0 | ||
|
|
||
| # Step 4: Benchmark speculative decoding (VLLM backend) |
There was a problem hiding this comment.
Remove the task is not been tested
…hotai/Kimi-K2.5
Single-task synth-only YAML at the new canonical path:
tools/launcher/examples/moonshotai/Kimi-K2.5/hf_offline_dflash.yaml
- Directory uses the *input* model name (Kimi-K2.5, what task_0
queries), not the *output* (Kimi-K2.5-DFlash, what the full
pipeline produces). DFlash is what the pipeline does, not what
it queries.
- Filename uses the algorithm name (hf_offline_dflash.yaml), not
the generic hf_offline_eagle3.yaml.
- Only task_0 is included. Tasks 1-3 (hidden-state dump, offline
training, benchmark) are added by their respective downstream
stages (hidden_state_dump_support, training_support,
run_pipeline) once each is independently cluster-tested. Per
review: don't ship tasks without evidence.
task_0 specifics:
- vLLM (not TRT-LLM — release:1.2.0 doesn't register
KimiK25ForConditionalGeneration as of 2026-05-14).
- Container: vllm/vllm-openai:latest.
- ntasks_per_node: 1 (vLLM is single-process; TP via flag, not MPI).
- vLLM knobs tuned for Kimi-K2 (595 GB bf16 on 8x80 GB):
--enforce-eager (vllm-openai:latest is missing torch/bin/ptxas)
--gpu-memory-utilization 0.95 + --max-model-len 4096 (default
0.9 left -1.1 GiB for KV cache)
VLLM_STARTUP_TIMEOUT=1800 env (Kimi load is ~7.7 min, default
600s in query.sh is too short)
- Input dataset: in-repo synthetic_conversations_1k.jsonl
(packager-shipped to /nemo_run/code/...) rather than a
cluster-mounted Lustre path that may not be staged on cw-dfw.
Cluster-test evidence (cw-dfw, Slurm 11782946, experiment
cicd_1778864959, elapsed 1:02:11, exit 0):
$ SLURM_CLUSTER=cw_dfw uv run slurm.py \
--yaml '.../moonshotai/Kimi-K2.5/hf_offline_dflash.yaml' \
pipeline.task_1.skip=true \
pipeline.task_2.skip=true \
pipeline.task_3.skip=true \
--yes --detach
Loading weights took 461.45 seconds
Model loading took 71.44 GiB memory and 465.10 seconds
Map (num_proc=32): 100%|##########| 100/100 [06:42<00:00, 4.02s/example]
Saved 10 shards to /scratchspace/data/train-{1..10}-00010.jsonl
Slurm: COMPLETED 01:02:11
Real assistant response verified end-to-end (Kimi correctly
answers the "bat and ball" CRT problem).
Replaces the prior pensieve-intern agent draft (1b02102) which
had: output-named input path, TRT-LLM container lacking
KimiK25 support, sidecar pollution, and uv.lock churn.
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
c3eef71 to
b661cef
Compare
|
Review addressed in the force-push (b661cef):
The CodeRabbit comments about Also rolling the lessons into the synth_support spec template via pensieve-intern MR !5 (https://gitlab-master.nvidia.com/omniml/integration/pensieve-intern/-/merge_requests/5) — that MR will be updated with the directory/filename/per-stage-scope conventions so the next agent run on a new model doesn't ship untested tasks or use the output name as the directory. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/launcher/examples/moonshotai/Kimi-K2.5/hf_offline_dflash.yaml`:
- Around line 46-51: The launcher YAML's environment block is missing required
model keys; add MLM_MODEL_CFG set to the HuggingFace repo ID for this model and
QUANT_CFG set to the chosen quantization profile (e.g., NVFP4_DEFAULT_CFG or
INT8_DEFAULT_CFG) inside the existing environment array so the new launcher
config provides both MLM_MODEL_CFG and QUANT_CFG alongside HF_LOCAL and
VLLM_STARTUP_TIMEOUT.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5475f944-f3fe-4118-96b5-bf9384022088
📒 Files selected for processing (1)
tools/launcher/examples/moonshotai/Kimi-K2.5/hf_offline_dflash.yaml
| environment: | ||
| - HF_LOCAL: /hf-local | ||
| - VLLM_STARTUP_TIMEOUT: "1800" # Kimi-K2.6 weight load alone is ~7.7 min | ||
| # at 71 GiB/GPU; default 600s in query.sh | ||
| # is not enough to also cover KV cache | ||
| # profiling + encoder cache init |
There was a problem hiding this comment.
Add required model environment keys for new launcher configs.
environment is missing MLM_MODEL_CFG and QUANT_CFG, which are required for new model configs in this launcher YAML family.
Suggested patch
environment:
- HF_LOCAL: /hf-local
+ - MLM_MODEL_CFG: moonshotai/Kimi-K2.5
+ - QUANT_CFG: NVFP4_DEFAULT_CFG
- VLLM_STARTUP_TIMEOUT: "1800" # Kimi-K2.6 weight load alone is ~7.7 min
# at 71 GiB/GPU; default 600s in query.sh
# is not enough to also cover KV cacheAs per coding guidelines, "tools/launcher/**/*.yaml: Set MLM_MODEL_CFG environment variable to the HuggingFace repo ID when adding a new model config" and "Set QUANT_CFG environment variable (e.g., NVFP4_DEFAULT_CFG, INT8_DEFAULT_CFG) when adding a new model config".
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| environment: | |
| - HF_LOCAL: /hf-local | |
| - VLLM_STARTUP_TIMEOUT: "1800" # Kimi-K2.6 weight load alone is ~7.7 min | |
| # at 71 GiB/GPU; default 600s in query.sh | |
| # is not enough to also cover KV cache | |
| # profiling + encoder cache init | |
| environment: | |
| - HF_LOCAL: /hf-local | |
| - MLM_MODEL_CFG: moonshotai/Kimi-K2.5 | |
| - QUANT_CFG: NVFP4_DEFAULT_CFG | |
| - VLLM_STARTUP_TIMEOUT: "1800" # Kimi-K2.6 weight load alone is ~7.7 min | |
| # at 71 GiB/GPU; default 600s in query.sh | |
| # is not enough to also cover KV cache | |
| # profiling + encoder cache init |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/examples/moonshotai/Kimi-K2.5/hf_offline_dflash.yaml` around
lines 46 - 51, The launcher YAML's environment block is missing required model
keys; add MLM_MODEL_CFG set to the HuggingFace repo ID for this model and
QUANT_CFG set to the chosen quantization profile (e.g., NVFP4_DEFAULT_CFG or
INT8_DEFAULT_CFG) inside the existing environment array so the new launcher
config provides both MLM_MODEL_CFG and QUANT_CFG alongside HF_LOCAL and
VLLM_STARTUP_TIMEOUT.
b661cef to
d35357d
Compare
d35357d to
b661cef
Compare
|
Re-cleaned via /babysit-jira on OMNIML-4740. The 2026-05-15 14:26 agent re-fire (pipeline 51447198, running pre-MR-!4 pensieve-intern v0.33.48) force-pushed over the clean cluster-tested commit (b661cef) with a polluted draft — uv.lock churn + VERIFICATION_COMMENT.txt sidecar + a regressed YAML pointing at Reset to b661cef — the single clean cluster-validated commit (Slurm 11782946 on cw-dfw, COMPLETED 1:02:11, real end-to-end vLLM output verified). Paired infra fix to prevent re-occurrence: nmm-sandbox!143 — bumps This PR is back to ready-for-review state; awaiting CODEOWNERS approval from @kevalmorabia97 + modelopt-setup-codeowners. |
b661cef to
f0d20f2
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/launcher/common/service_utils.sh`:
- Around line 51-54: The current fixed marker (/tmp/.nmm_extra_dep_installed)
stored in the _marker variable can be stale across runs; change _marker in
tools/launcher/common/service_utils.sh to be job/session-scoped (e.g., include
$TMPDIR if set or append a unique token like $$, $CI_JOB_ID, or a timestamp) or
generate a unique path with mktemp and persist that same path for the check and
creation; update any code that references _marker to use the new session-scoped
marker so the if [[ -f "$_marker" ]] check and subsequent creation are
consistent.
- Around line 55-67: The install block currently may create the marker even if
installs fail and non-zero ranks keep running after the 600s wait; update the
code around the mpi_local_rank check to fail fast: in the rank 0 branch (where
pip install diskcache, git clone and pip install "${_nvrx_dir}" run) check each
command's exit status and only touch/emit "$_marker" after all installs succeed,
otherwise log the error and exit non-zero; in the non-zero branch (the while
loop using _waited and checking "$_marker") detect the timeout (when _waited >=
600) and abort with a non-zero exit and error message instead of silently
continuing so downstream steps don't run without dependencies.
- Around line 59-60: The git clone in service_utils.sh clones
nvidia-resiliency-ext without pinning, so add an environment variable (e.g.,
NVRX_REF) and use it when cloning to pin to a tag/commit; modify the clone
invocation that uses _nvrx_dir to include --branch "$NVRX_REF" (or fall back to
a sensible default like main when NVRX_REF is empty) and document/export
NVRX_REF so CI/ops can set an immutable ref before running the script.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1759c510-8bfc-419f-8829-c46dd02aeb5e
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (3)
BLOCKED_ON_UPSTREAM.txttools/launcher/common/service_utils.shtools/launcher/examples/moonshotai/Kimi-K2.5/hf_offline_dflash.yaml
✅ Files skipped from review due to trivial changes (1)
- BLOCKED_ON_UPSTREAM.txt
🚧 Files skipped from review as they are similar to previous changes (1)
- tools/launcher/examples/moonshotai/Kimi-K2.5/hf_offline_dflash.yaml
| local _marker=/tmp/.nmm_extra_dep_installed | ||
| if [[ -f "$_marker" ]]; then | ||
| return 0 | ||
| fi |
There was a problem hiding this comment.
Scope the marker file to the job/session.
Line 51 uses a fixed /tmp marker name, so stale files from previous runs can cause this run to skip installation incorrectly.
Suggested fix
- local _marker=/tmp/.nmm_extra_dep_installed
+ local _marker="/tmp/.nmm_extra_dep_installed.${SLURM_JOB_ID:-$$}.${USER:-unknown}"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| local _marker=/tmp/.nmm_extra_dep_installed | |
| if [[ -f "$_marker" ]]; then | |
| return 0 | |
| fi | |
| local _marker="/tmp/.nmm_extra_dep_installed.${SLURM_JOB_ID:-$$}.${USER:-unknown}" | |
| if [[ -f "$_marker" ]]; then | |
| return 0 | |
| fi |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/common/service_utils.sh` around lines 51 - 54, The current
fixed marker (/tmp/.nmm_extra_dep_installed) stored in the _marker variable can
be stale across runs; change _marker in tools/launcher/common/service_utils.sh
to be job/session-scoped (e.g., include $TMPDIR if set or append a unique token
like $$, $CI_JOB_ID, or a timestamp) or generate a unique path with mktemp and
persist that same path for the check and creation; update any code that
references _marker to use the new session-scoped marker so the if [[ -f
"$_marker" ]] check and subsequent creation are consistent.
| if [[ "$mpi_local_rank" -eq 0 ]]; then | ||
| pip install diskcache | ||
| local _nvrx_dir | ||
| _nvrx_dir="$(mktemp -d)/nvidia-resiliency-ext" | ||
| git clone --depth 1 https://github.com/NVIDIA/nvidia-resiliency-ext "${_nvrx_dir}" \ | ||
| && pip install "${_nvrx_dir}" | ||
| touch "$_marker" | ||
| else | ||
| local _waited=0 | ||
| while [[ ! -f "$_marker" && $_waited -lt 600 ]]; do | ||
| sleep 1 | ||
| _waited=$((_waited + 1)) | ||
| done |
There was a problem hiding this comment.
Fail fast on install errors and wait timeout.
Line 61 can publish success even if dependency install failed, and Lines 64-67 let non-zero ranks continue after 600s timeout with no marker. Both paths can silently proceed with missing deps.
Suggested fix
if [[ "$mpi_local_rank" -eq 0 ]]; then
- pip install diskcache
+ pip install diskcache || return 1
local _nvrx_dir
_nvrx_dir="$(mktemp -d)/nvidia-resiliency-ext"
git clone --depth 1 https://github.com/NVIDIA/nvidia-resiliency-ext "${_nvrx_dir}" \
- && pip install "${_nvrx_dir}"
- touch "$_marker"
+ && pip install "${_nvrx_dir}" \
+ && touch "$_marker" \
+ || return 1
else
local _waited=0
while [[ ! -f "$_marker" && $_waited -lt 600 ]]; do
sleep 1
_waited=$((_waited + 1))
done
+ [[ -f "$_marker" ]] || return 1
fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if [[ "$mpi_local_rank" -eq 0 ]]; then | |
| pip install diskcache | |
| local _nvrx_dir | |
| _nvrx_dir="$(mktemp -d)/nvidia-resiliency-ext" | |
| git clone --depth 1 https://github.com/NVIDIA/nvidia-resiliency-ext "${_nvrx_dir}" \ | |
| && pip install "${_nvrx_dir}" | |
| touch "$_marker" | |
| else | |
| local _waited=0 | |
| while [[ ! -f "$_marker" && $_waited -lt 600 ]]; do | |
| sleep 1 | |
| _waited=$((_waited + 1)) | |
| done | |
| if [[ "$mpi_local_rank" -eq 0 ]]; then | |
| pip install diskcache || return 1 | |
| local _nvrx_dir | |
| _nvrx_dir="$(mktemp -d)/nvidia-resiliency-ext" | |
| git clone --depth 1 https://github.com/NVIDIA/nvidia-resiliency-ext "${_nvrx_dir}" \ | |
| && pip install "${_nvrx_dir}" \ | |
| && touch "$_marker" \ | |
| || return 1 | |
| else | |
| local _waited=0 | |
| while [[ ! -f "$_marker" && $_waited -lt 600 ]]; do | |
| sleep 1 | |
| _waited=$((_waited + 1)) | |
| done | |
| [[ -f "$_marker" ]] || return 1 | |
| fi |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/common/service_utils.sh` around lines 55 - 67, The install
block currently may create the marker even if installs fail and non-zero ranks
keep running after the 600s wait; update the code around the mpi_local_rank
check to fail fast: in the rank 0 branch (where pip install diskcache, git clone
and pip install "${_nvrx_dir}" run) check each command's exit status and only
touch/emit "$_marker" after all installs succeed, otherwise log the error and
exit non-zero; in the non-zero branch (the while loop using _waited and checking
"$_marker") detect the timeout (when _waited >= 600) and abort with a non-zero
exit and error message instead of silently continuing so downstream steps don't
run without dependencies.
| git clone --depth 1 https://github.com/NVIDIA/nvidia-resiliency-ext "${_nvrx_dir}" \ | ||
| && pip install "${_nvrx_dir}" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n tools/launcher/common/service_utils.sh | sed -n '50,70p'Repository: NVIDIA/Model-Optimizer
Length of output: 837
🏁 Script executed:
rg "NVRX_REF|nvidia-resiliency-ext" --type shRepository: NVIDIA/Model-Optimizer
Length of output: 292
🏁 Script executed:
rg "nvidia-resiliency-ext" -iRepository: NVIDIA/Model-Optimizer
Length of output: 292
🏁 Script executed:
rg "resiliency|reproducibility|version" tools/launcher/common/service_utils.sh -B2 -A2Repository: NVIDIA/Model-Optimizer
Length of output: 569
🏁 Script executed:
grep -r "NVRX" tools/launcher/ 2>/dev/null || echo "No NVRX references found"Repository: NVIDIA/Model-Optimizer
Length of output: 91
Pin nvidia-resiliency-ext to an immutable ref.
Line 59 clones from the default branch at runtime without specifying a Git reference, creating a supply-chain and reproducibility risk for cluster jobs. The branch pointer can move unexpectedly, making builds non-deterministic.
Add an environment variable to allow pinning to a specific tag or commit:
Suggested fix
+ local _nvrx_ref="${NVRX_REF:-<tag-or-commit>}"
- git clone --depth 1 https://github.com/NVIDIA/nvidia-resiliency-ext "${_nvrx_dir}" \
+ git clone --depth 1 --branch "${_nvrx_ref}" https://github.com/NVIDIA/nvidia-resiliency-ext "${_nvrx_dir}" \
&& pip install "${_nvrx_dir}"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/common/service_utils.sh` around lines 59 - 60, The git clone
in service_utils.sh clones nvidia-resiliency-ext without pinning, so add an
environment variable (e.g., NVRX_REF) and use it when cloning to pin to a
tag/commit; modify the clone invocation that uses _nvrx_dir to include --branch
"$NVRX_REF" (or fall back to a sensible default like main when NVRX_REF is
empty) and document/export NVRX_REF so CI/ops can set an immutable ref before
running the script.
f0d20f2 to
b661cef
Compare
Summary
Adds the EAGLE3 offline pipeline YAML for
moonshotai/Kimi-K2.5-DFlash, adapted fromQwen/Qwen3-8B's offline YAML.task_0cluster-tested green on cw-dfw (Slurm 11782946, experimentcicd_1778864959, elapsed 1:02:11).Driven by /babysit-jira on OMNIML-4740. Replaces the original pensieve-intern synth_support agent draft (1b02102) which had three structural issues:
global_vars.hf_modelto the output path (Kimi-K2.5 DFlash) instead of the input checkpoint.release:1.2.0) that doesn't registerKimiK25ForConditionalGenerationas of 2026-05-14.VERIFICATION_COMMENT.txt(a runner sidecar, not artifact code) anduv.lockregen (+1037/-81, the source of the priormergeable_state: dirty).This PR is the cleaned + cluster-validated replacement.
Changes
Kimi-K2.5 DFlash→Kimi-K2.5-DFlash(Slurm tar packaging breaks on spaces injob_name/path).global_vars.hf_model: /hf-local/moonshotai/Kimi-K2.6— the canonical Kimi-K2.5 input stand-in staged by the operator on cw-dfw.task_0migrated from TRT-LLM to vLLM:script: common/vllm/query.shcontainer: vllm/vllm-openai:latestntasks_per_node: 1(vLLM is single-process)--tensor-parallel-size 8,--trust-remote-code--enforce-eager(vllm-openai:latest is missingtorch/bin/ptxasfor inductor autotuning)--gpu-memory-utilization 0.95+--max-model-len 4096(Kimi weights are 595 GB bf16 on 8×80 GB = 93% weight occupancy; default 0.9 leaves -1.1 GiB for KV cache)VLLM_STARTUP_TIMEOUT=1800env (Kimi load is ~7.7 min, default 600s inquery.shis not enough)--dataswitched to the in-reposynthetic_conversations_1k.jsonl— the canonicalSpeculative-Decoding-Prompt-Samplesisn't on cw-dfw; the in-repo dataset is the portable, packager-shipped input for smoke testing.Cluster-test evidence (mandatory per the refined synth_support spec)
Real assistant response verified end-to-end — Kimi correctly answered the "bat and ball" CRT problem:
Test plan
slurm.py --yaml ... --dry-runexits 0task_0only (task_1/2/3.skip=true) — green, 1000 prompts processed, 10 synth-data shards writtenrun_pipelinestage of OMNIML-4735 will exercise task_1..3 end-to-end with the produced synth data🤖 Generated with Claude Code
Summary by CodeRabbit