Skip to content

[OMNIML-4740] synth_support#1496

Open
ChenhanYu wants to merge 1 commit into
mainfrom
pensieve-intern/OMNIML-4735/synth-support
Open

[OMNIML-4740] synth_support#1496
ChenhanYu wants to merge 1 commit into
mainfrom
pensieve-intern/OMNIML-4735/synth-support

Conversation

@ChenhanYu
Copy link
Copy Markdown
Collaborator

@ChenhanYu ChenhanYu commented May 14, 2026

Summary

Adds the EAGLE3 offline pipeline YAML for moonshotai/Kimi-K2.5-DFlash, adapted from Qwen/Qwen3-8B's offline YAML. task_0 cluster-tested green on cw-dfw (Slurm 11782946, experiment cicd_1778864959, elapsed 1:02:11).

Driven by /babysit-jira on OMNIML-4740. Replaces the original pensieve-intern synth_support agent draft (1b02102) which had three structural issues:

  1. Set global_vars.hf_model to the output path (Kimi-K2.5 DFlash) instead of the input checkpoint.
  2. Used a TRT-LLM container (release:1.2.0) that doesn't register KimiK25ForConditionalGeneration as of 2026-05-14.
  3. Committed VERIFICATION_COMMENT.txt (a runner sidecar, not artifact code) and uv.lock regen (+1037/-81, the source of the prior mergeable_state: dirty).

This PR is the cleaned + cluster-validated replacement.

Changes

  • Directory Kimi-K2.5 DFlashKimi-K2.5-DFlash (Slurm tar packaging breaks on spaces in job_name/path).
  • global_vars.hf_model: /hf-local/moonshotai/Kimi-K2.6 — the canonical Kimi-K2.5 input stand-in staged by the operator on cw-dfw.
  • task_0 migrated from TRT-LLM to vLLM:
    • script: common/vllm/query.sh
    • container: vllm/vllm-openai:latest
    • ntasks_per_node: 1 (vLLM is single-process)
    • --tensor-parallel-size 8, --trust-remote-code
    • --enforce-eager (vllm-openai:latest is missing torch/bin/ptxas for inductor autotuning)
    • --gpu-memory-utilization 0.95 + --max-model-len 4096 (Kimi weights are 595 GB bf16 on 8×80 GB = 93% weight occupancy; default 0.9 leaves -1.1 GiB for KV cache)
    • VLLM_STARTUP_TIMEOUT=1800 env (Kimi load is ~7.7 min, default 600s in query.sh is not enough)
  • --data switched to the in-repo synthetic_conversations_1k.jsonl — the canonical Speculative-Decoding-Prompt-Samples isn't on cw-dfw; the in-repo dataset is the portable, packager-shipped input for smoke testing.
  • Sidecars + uv.lock removed from the diff.

Cluster-test evidence (mandatory per the refined synth_support spec)

$ SLURM_CLUSTER=cw_dfw uv run slurm.py \
    --yaml '.../moonshotai/Kimi-K2.5-DFlash/hf_offline_eagle3.yaml' \
    pipeline.task_1.skip=true pipeline.task_2.skip=true pipeline.task_3.skip=true \
    --yes --detach

Slurm 11782946: COMPLETED, elapsed 1:02:11
Loading weights took 461.45 seconds
Model loading took 71.44 GiB memory and 465.10 seconds
Map (num_proc=32): 100%|██████████| 100/100 [06:42<00:00, 4.02s/example]
Saved 10 shards: /scratchspace/data/train-{1..10}-00010.jsonl

Real assistant response verified end-to-end — Kimi correctly answered the "bat and ball" CRT problem:

The ball costs $0.05 (5 cents). Here's why: If the ball costs $0.05, then the bat costs $1.05 (which is $1.00 more). Together they cost $1.10.

Test plan

  • Dry-run validation: slurm.py --yaml ... --dry-run exits 0
  • Cluster test on cw-dfw: task_0 only (task_1/2/3.skip=true) — green, 1000 prompts processed, 10 synth-data shards written
  • Downstream run_pipeline stage of OMNIML-4735 will exercise task_1..3 end-to-end with the produced synth data

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Added an offline speculative decoding pipeline for Kimi-K2.5 to synthesize conversation datasets via a configurable vLLM serving workflow.
  • Bug Fixes / Reliability
    • Improved launcher startup robustness with additional MPI/SLURM rank fallbacks and coordinated one-time dependency installation across local ranks to avoid duplicate installs and startup races.

Review Change Stack

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 14, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

📝 Walkthrough

Walkthrough

Adds a new Kimi-K2.5 vLLM offline data-synthesis pipeline YAML, adjusts MPI/local-rank fallbacks in service utilities, makes dependency installation coordinated across local ranks via a /tmp marker, and adds a blocker note about a missing upstream slurm.py.

Changes

Launcher examples and service utilities

Layer / File(s) Summary
vLLM offline data synthesis configuration
tools/launcher/examples/moonshotai/Kimi-K2.5/hf_offline_dflash.yaml
Adds Kimi-K2.5_DFlash_offline pipeline with single task_0 that launches vLLM via common/vllm/query.sh, passing model/serve args, dataset input/output, env overrides, and SLURM/container execution settings (1 node, 8 GPUs, vllm/vllm-openai:latest).
MPI/local-rank fallback resolution
tools/launcher/common/service_utils.sh
Updates mpi_rank and mpi_local_rank resolution to fall back through PMIX_*, native ranks, then SLURM_PROCID/SLURM_LOCALID, and finally 0 when unset.
Coordinated extra-dependency installer
tools/launcher/common/service_utils.sh
Reworks util_install_extra_dep to use a /tmp marker file: local rank 0 installs diskcache and clones/installs NVIDIA/nvidia-resiliency-ext, then writes the marker; non-zero local ranks poll up to 600s for the marker before continuing.
Blocked upstream note
BLOCKED_ON_UPSTREAM.txt
Adds a one-line status explaining the dry-run cannot be performed due to missing slurm.py and restricted sandbox access.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title '[OMNIML-4740] synth_support' is vague and does not clearly describe the main change. While it references a ticket, it lacks specifics about adding a Kimi-K2.5-DFlash offline pipeline YAML or the key modifications (vLLM migration, directory rename, configuration updates). Consider a more descriptive title such as '[OMNIML-4740] Add Kimi-K2.5-DFlash offline vLLM pipeline' or similar, which clearly conveys the primary change and feature being added.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed No Python code is modified in this PR. The security check for Python anti-patterns is not applicable to YAML config, text, and bash script changes.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pensieve-intern/OMNIML-4735/synth-support

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@ChenhanYu ChenhanYu closed this May 14, 2026
@ChenhanYu
Copy link
Copy Markdown
Collaborator Author

Closing — OMNIML-4735 Epic wrap-up. Used as the first real-release dispatch on the fully-consolidated nmm-sandbox stack (Phase 3 production validation); not pursuing the YAML merge today.

@github-actions
Copy link
Copy Markdown
Contributor

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-05-14 20:33 UTC

@ChenhanYu ChenhanYu reopened this May 14, 2026
@ChenhanYu
Copy link
Copy Markdown
Collaborator Author

Re-opening — closing this PR earlier was premature cleanup. OMNIML-4735 is now being driven to an actual release; this PR is the synth_support deliverable awaiting real review + merge.

@ChenhanYu ChenhanYu force-pushed the pensieve-intern/OMNIML-4735/synth-support branch 2 times, most recently from 1b02102 to 006e8b2 Compare May 14, 2026 22:48
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 14, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@ChenhanYu ChenhanYu marked this pull request as ready for review May 14, 2026 22:49
@ChenhanYu ChenhanYu enabled auto-merge (squash) May 14, 2026 22:49
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
tools/launcher/examples/moonshotai/Kimi-K2.5 DFlash/hf_offline_eagle3.yaml (1)

104-104: ⚡ Quick win

Pin the VLLM container to an immutable version/digest.

Using vllm/vllm-openai:latest makes benchmark results non-reproducible across runs; pin to a tested tag or digest.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/launcher/examples/moonshotai/Kimi-K2.5` DFlash/hf_offline_eagle3.yaml
at line 104, The container image is pinned to the floating tag
"vllm/vllm-openai:latest" under the container key; replace this with an
immutable tag or digest (for example a specific semver tag or a sha256 digest)
that you've tested for benchmarks so results are reproducible, e.g. change the
container value from "vllm/vllm-openai:latest" to a concrete string like
"vllm/vllm-openai:<tested-tag>" or "vllm/vllm-openai@sha256:<digest>".
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/launcher/examples/moonshotai/Kimi-K2.5` DFlash/hf_offline_eagle3.yaml:
- Line 73: The training output path (training.output_dir) is set to
/scratchspace/eagle3 but later steps (task_3 / the draft weights reader) expect
artifacts at /scratchspace/export; change training.output_dir to match the
benchmark input path (/scratchspace/export) or update the task_3 draft weights
path to /scratchspace/eagle3 so both produce and consume the same directory
(refer to the training.output_dir key and the task_3 draft weights
reference/train_eagle.sh usage to locate where to update).
- Around line 39-40: The YAML is missing required environment variables for the
new model config: add MLM_MODEL_CFG (set to the HuggingFace repo ID for the
model) and QUANT_CFG (e.g., NVFP4_DEFAULT_CFG or INT8_DEFAULT_CFG) to every
environment block that defines task envs (the existing blocks that contain
HF_LOCAL) and create an environment block for task_2 that includes HF_LOCAL plus
MLM_MODEL_CFG and QUANT_CFG so the launcher YAML conforms to the model-config
contract; ensure the variable names are spelled exactly as MLM_MODEL_CFG and
QUANT_CFG in each task's environment list.

---

Nitpick comments:
In `@tools/launcher/examples/moonshotai/Kimi-K2.5` DFlash/hf_offline_eagle3.yaml:
- Line 104: The container image is pinned to the floating tag
"vllm/vllm-openai:latest" under the container key; replace this with an
immutable tag or digest (for example a specific semver tag or a sha256 digest)
that you've tested for benchmarks so results are reproducible, e.g. change the
container value from "vllm/vllm-openai:latest" to a concrete string like
"vllm/vllm-openai:<tested-tag>" or "vllm/vllm-openai@sha256:<digest>".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 66ea5a13-65b0-4b34-81d6-f8fa01cc5581

📥 Commits

Reviewing files that changed from the base of the PR and between e27f76f and 006e8b2.

📒 Files selected for processing (1)
  • tools/launcher/examples/moonshotai/Kimi-K2.5 DFlash/hf_offline_eagle3.yaml

Comment on lines +39 to +40
environment:
- HF_LOCAL: /hf-local
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add required MLM_MODEL_CFG and QUANT_CFG env vars for this new model config.

Line 39, Line 57, and Line 97 define task envs, but MLM_MODEL_CFG and QUANT_CFG are missing, and task_2 (Line 67-83) has no environment block at all. This violates the model-config contract for launcher YAMLs.

Proposed patch
   global_vars:
     hf_model: /hf-local/moonshotai/Kimi-K2.5 DFlash
+    hf_repo_id: moonshotai/Kimi-K2.5-DFlash

@@
   task_0:
@@
     environment:
       - HF_LOCAL: /hf-local
+      - MLM_MODEL_CFG: <<global_vars.hf_repo_id>>
+      - QUANT_CFG: NVFP4_DEFAULT_CFG

@@
   task_1:
@@
     environment:
       - HF_MODEL_CKPT: <<global_vars.hf_model>>
+      - MLM_MODEL_CFG: <<global_vars.hf_repo_id>>
+      - QUANT_CFG: NVFP4_DEFAULT_CFG

@@
   task_2:
@@
+    environment:
+      - MLM_MODEL_CFG: <<global_vars.hf_repo_id>>
+      - QUANT_CFG: NVFP4_DEFAULT_CFG

@@
   task_3:
@@
     environment:
       - HF_MODEL_CKPT: <<global_vars.hf_model>>
+      - MLM_MODEL_CFG: <<global_vars.hf_repo_id>>
+      - QUANT_CFG: NVFP4_DEFAULT_CFG

As per coding guidelines, "Set MLM_MODEL_CFG environment variable to the HuggingFace repo ID when adding a new model config" and "Set QUANT_CFG environment variable (e.g., NVFP4_DEFAULT_CFG, INT8_DEFAULT_CFG) when adding a new model config".

Also applies to: 57-58, 67-83, 97-99

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/launcher/examples/moonshotai/Kimi-K2.5` DFlash/hf_offline_eagle3.yaml
around lines 39 - 40, The YAML is missing required environment variables for the
new model config: add MLM_MODEL_CFG (set to the HuggingFace repo ID for the
model) and QUANT_CFG (e.g., NVFP4_DEFAULT_CFG or INT8_DEFAULT_CFG) to every
environment block that defines task envs (the existing blocks that contain
HF_LOCAL) and create an environment block for task_2 that includes HF_LOCAL plus
MLM_MODEL_CFG and QUANT_CFG so the launcher YAML conforms to the model-config
contract; ensure the variable names are spelled exactly as MLM_MODEL_CFG and
QUANT_CFG in each task's environment list.

- --config modules/Model-Optimizer/modelopt_recipes/general/speculative_decoding/eagle3.yaml
- model.model_name_or_path=<<global_vars.hf_model>>
- data.offline_data_path=/scratchspace/offline_hidden_states
- training.output_dir=/scratchspace/eagle3
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Align the training output path with benchmark input path.

Line 73 writes artifacts to /scratchspace/eagle3, but Line 88 reads draft weights from /scratchspace/export. Unless train_eagle.sh exports to /scratchspace/export implicitly, task_3 will not find the draft model.

Proposed patch
-      - --draft_model_dir /scratchspace/export
+      - --draft_model_dir /scratchspace/eagle3

Also applies to: 88-88

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/launcher/examples/moonshotai/Kimi-K2.5` DFlash/hf_offline_eagle3.yaml
at line 73, The training output path (training.output_dir) is set to
/scratchspace/eagle3 but later steps (task_3 / the draft weights reader) expect
artifacts at /scratchspace/export; change training.output_dir to match the
benchmark input path (/scratchspace/export) or update the task_3 draft weights
path to /scratchspace/eagle3 so both produce and consume the same directory
(refer to the training.output_dir key and the task_3 draft weights
reference/train_eagle.sh usage to locate where to update).

@codecov
Copy link
Copy Markdown

codecov Bot commented May 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.93%. Comparing base (e27f76f) to head (b661cef).
⚠️ Report is 5 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1496      +/-   ##
==========================================
- Coverage   77.44%   76.93%   -0.51%     
==========================================
  Files         473      473              
  Lines       51418    52473    +1055     
==========================================
+ Hits        39819    40370     +551     
- Misses      11599    12103     +504     
Flag Coverage Δ
regression 14.98% <ø> (+0.07%) ⬆️
unit 52.67% <ø> (+0.12%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ChenhanYu ChenhanYu disabled auto-merge May 14, 2026 23:04
@ChenhanYu ChenhanYu marked this pull request as draft May 14, 2026 23:04
@ChenhanYu
Copy link
Copy Markdown
Collaborator Author

🚧 Blocked on input checkpoint — not yet ready to merge.

Status

  • Diff cleaned: removed VERIFICATION_COMMENT.txt (sidecar) and uv.lock (incidental regen). Now just the 104-line YAML.
  • slurm.py --yaml ... --dry-run exits 0 — schema valid, all 4 task blocks parse, all <<global_vars.X>> interpolations resolve.
  • DCO sign-off added.
  • Converted back to draft + auto-merge disabled while cluster-test evidence is pending.

What's blocking

This PR's YAML references /hf-local/moonshotai/Kimi-K2.5 DFlash as the input target model in global_vars.hf_model. That's wrong on two counts:

  1. DFlash is the output of this pipeline, not the input. The input is plain Kimi-K2.5.
  2. The Kimi-K2.5 input checkpoint isn't on any reachable cluster yet:
    • cw-dfw: moonshotai/Kimi-K2-Instruct, moonshotai/Kimi-K2-Thinking only — no K2.5.
    • oci-hsg: has Kimi-K2.5-nolayer (test variant), Kimi-K2.6, Kimi-K2-Thinking — but no plain K2.5.
    • NAS (team-swdl-nas:hf-local/moonshotai/): empty.

Unblock path

A human needs to:

  1. Stage Kimi-K2.5 (the input checkpoint) to cw-dfw's /hf-local/moonshotai/Kimi-K2.5/ (or confirm the agreed-upon input name).
  2. Either edit this PR's global_vars.hf_model to the corrected path, OR re-fire the synth_support agent against the now-present checkpoint via @Pensieve intern_retry OMNIML-4740.
  3. The agent (or human) re-runs slurm.py --yaml ... --dry-run AND launches task_0 on cw-dfw with task_1/2/3.skip=true to produce real cluster-test evidence before this PR is marked ready and merged.

The pre-merge mandate of "real cluster test evidence" is the policy this babysit run is enforcing — and is being codified in pensieve-intern MR !4 so the next synth_support agent run can't ship a YAML without it.

(Driven by /babysit-jira on OMNIML-4740.)

@ChenhanYu ChenhanYu force-pushed the pensieve-intern/OMNIML-4735/synth-support branch from 006e8b2 to c3eef71 Compare May 15, 2026 18:32
@ChenhanYu ChenhanYu marked this pull request as ready for review May 15, 2026 18:33
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
tools/launcher/examples/moonshotai/Kimi-K2.5-DFlash/hf_offline_eagle3.yaml (2)

85-85: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix task_2 → task_3 artifact path mismatch.

Line 85 writes to /scratchspace/eagle3, but Line 100 reads from /scratchspace/export. This can break stage 4 input resolution.

Proposed patch
-      - --draft_model_dir /scratchspace/export
+      - --draft_model_dir /scratchspace/eagle3

Also applies to: 100-100

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/launcher/examples/moonshotai/Kimi-K2.5-DFlash/hf_offline_eagle3.yaml`
at line 85, The artifact path for the training output is inconsistent between
tasks: training.output_dir is set to /scratchspace/eagle3 while later steps
expect /scratchspace/export, causing task_2 → task_3 input resolution to fail;
update the value of training.output_dir (the training.output_dir key in the
YAML) to match the downstream artifact path (/scratchspace/export) or change the
downstream read path to /scratchspace/eagle3 so both producer
(training.output_dir) and consumer use the same artifact directory.

47-49: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add required MLM_MODEL_CFG and QUANT_CFG to every task env.

These required model-config env vars are missing in task_0, task_1, and task_3, and task_2 has no environment block at all.

Proposed patch
   global_vars:
     hf_model: /hf-local/moonshotai/Kimi-K2.6
+    hf_repo_id: moonshotai/Kimi-K2.6

@@
   task_0:
@@
     environment:
       - HF_LOCAL: /hf-local
+      - MLM_MODEL_CFG: <<global_vars.hf_repo_id>>
+      - QUANT_CFG: NVFP4_DEFAULT_CFG
       - VLLM_STARTUP_TIMEOUT: "1800"  # Kimi-K2.6 weight load alone is ~7.7 min

@@
   task_1:
@@
     environment:
       - HF_MODEL_CKPT: <<global_vars.hf_model>>
+      - MLM_MODEL_CFG: <<global_vars.hf_repo_id>>
+      - QUANT_CFG: NVFP4_DEFAULT_CFG

@@
   task_2:
@@
+    environment:
+      - MLM_MODEL_CFG: <<global_vars.hf_repo_id>>
+      - QUANT_CFG: NVFP4_DEFAULT_CFG
     slurm_config:

@@
   task_3:
@@
     environment:
       - HF_MODEL_CKPT: <<global_vars.hf_model>>
+      - MLM_MODEL_CFG: <<global_vars.hf_repo_id>>
+      - QUANT_CFG: NVFP4_DEFAULT_CFG

As per coding guidelines, "Set MLM_MODEL_CFG environment variable to the HuggingFace repo ID when adding a new model config" and "Set QUANT_CFG environment variable (e.g., NVFP4_DEFAULT_CFG, INT8_DEFAULT_CFG) when adding a new model config".

Also applies to: 69-70, 79-95, 109-110

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/launcher/examples/moonshotai/Kimi-K2.5-DFlash/hf_offline_eagle3.yaml`
around lines 47 - 49, The tasks in this YAML are missing required environment
vars: add MLM_MODEL_CFG and QUANT_CFG to the environment lists for task_0,
task_1, and task_3, and create an environment block for task_2 with those same
vars; specifically set MLM_MODEL_CFG to the HuggingFace repo ID used by this
model and QUANT_CFG to the chosen quantization profile (e.g., NVFP4_DEFAULT_CFG
or INT8_DEFAULT_CFG), ensuring the variable names match exactly (MLM_MODEL_CFG,
QUANT_CFG) in each task's environment array so downstream code that reads these
vars (task_0, task_1, task_2, task_3) will find them.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/launcher/examples/moonshotai/Kimi-K2.5-DFlash/hf_offline_eagle3.yaml`:
- Line 22: The comment that mentions vLLM loading "Kimi-K2.5" is stale relative
to the configured checkpoint key hf_model (currently set to Kimi-K2.6); update
that comment text to reference "Kimi-K2.6" so it matches the hf_model value, and
make the same change for the other occurrences referenced (the comments around
the hf_model key and the comments at the other occurrences noted). Ensure all
comments in the file that mention Kimi-K2.5 are changed to Kimi-K2.6 to keep
configuration and documentation consistent.

---

Duplicate comments:
In `@tools/launcher/examples/moonshotai/Kimi-K2.5-DFlash/hf_offline_eagle3.yaml`:
- Line 85: The artifact path for the training output is inconsistent between
tasks: training.output_dir is set to /scratchspace/eagle3 while later steps
expect /scratchspace/export, causing task_2 → task_3 input resolution to fail;
update the value of training.output_dir (the training.output_dir key in the
YAML) to match the downstream artifact path (/scratchspace/export) or change the
downstream read path to /scratchspace/eagle3 so both producer
(training.output_dir) and consumer use the same artifact directory.
- Around line 47-49: The tasks in this YAML are missing required environment
vars: add MLM_MODEL_CFG and QUANT_CFG to the environment lists for task_0,
task_1, and task_3, and create an environment block for task_2 with those same
vars; specifically set MLM_MODEL_CFG to the HuggingFace repo ID used by this
model and QUANT_CFG to the chosen quantization profile (e.g., NVFP4_DEFAULT_CFG
or INT8_DEFAULT_CFG), ensuring the variable names match exactly (MLM_MODEL_CFG,
QUANT_CFG) in each task's environment array so downstream code that reads these
vars (task_0, task_1, task_2, task_3) will find them.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 39023861-07f0-4fbf-a6bb-da4e88f3160a

📥 Commits

Reviewing files that changed from the base of the PR and between 006e8b2 and c3eef71.

📒 Files selected for processing (1)
  • tools/launcher/examples/moonshotai/Kimi-K2.5-DFlash/hf_offline_eagle3.yaml

note:

global_vars:
hf_model: /hf-local/moonshotai/Kimi-K2.6
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update stale model reference in comments.

The comment says vLLM loads Kimi-K2.5, but current configured checkpoint is Kimi-K2.6 (Line 22). Please align the comment to avoid operator confusion.

Also applies to: 27-28

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/launcher/examples/moonshotai/Kimi-K2.5-DFlash/hf_offline_eagle3.yaml`
at line 22, The comment that mentions vLLM loading "Kimi-K2.5" is stale relative
to the configured checkpoint key hf_model (currently set to Kimi-K2.6); update
that comment text to reference "Kimi-K2.6" so it matches the hf_model value, and
make the same change for the other occurrences referenced (the comments around
the hf_model key and the comments at the other occurrences noted). Ensure all
comments in the file that mention Kimi-K2.5 are changed to Kimi-K2.6 to keep
configuration and documentation consistent.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be Kimi-K2.5/hf_offline_dflash.yaml instead.

gpus_per_node: 8
container: vllm/vllm-openai:latest

# Step 2: Dump hidden states from target model
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the task is not been tested

container: nvcr.io/nvidia/tensorrt-llm/release:1.2.0

# Step 3: Train EAGLE3 draft head (offline, single task)
task_2:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the task is not been tested

gpus_per_node: 8
container: nvcr.io/nvidia/tensorrt-llm/release:1.2.0

# Step 4: Benchmark speculative decoding (VLLM backend)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the task is not been tested

…hotai/Kimi-K2.5

Single-task synth-only YAML at the new canonical path:

  tools/launcher/examples/moonshotai/Kimi-K2.5/hf_offline_dflash.yaml

- Directory uses the *input* model name (Kimi-K2.5, what task_0
  queries), not the *output* (Kimi-K2.5-DFlash, what the full
  pipeline produces). DFlash is what the pipeline does, not what
  it queries.
- Filename uses the algorithm name (hf_offline_dflash.yaml), not
  the generic hf_offline_eagle3.yaml.
- Only task_0 is included. Tasks 1-3 (hidden-state dump, offline
  training, benchmark) are added by their respective downstream
  stages (hidden_state_dump_support, training_support,
  run_pipeline) once each is independently cluster-tested. Per
  review: don't ship tasks without evidence.

task_0 specifics:

- vLLM (not TRT-LLM — release:1.2.0 doesn't register
  KimiK25ForConditionalGeneration as of 2026-05-14).
- Container: vllm/vllm-openai:latest.
- ntasks_per_node: 1 (vLLM is single-process; TP via flag, not MPI).
- vLLM knobs tuned for Kimi-K2 (595 GB bf16 on 8x80 GB):
    --enforce-eager (vllm-openai:latest is missing torch/bin/ptxas)
    --gpu-memory-utilization 0.95 + --max-model-len 4096 (default
      0.9 left -1.1 GiB for KV cache)
    VLLM_STARTUP_TIMEOUT=1800 env (Kimi load is ~7.7 min, default
      600s in query.sh is too short)
- Input dataset: in-repo synthetic_conversations_1k.jsonl
  (packager-shipped to /nemo_run/code/...) rather than a
  cluster-mounted Lustre path that may not be staged on cw-dfw.

Cluster-test evidence (cw-dfw, Slurm 11782946, experiment
cicd_1778864959, elapsed 1:02:11, exit 0):

  $ SLURM_CLUSTER=cw_dfw uv run slurm.py \
      --yaml '.../moonshotai/Kimi-K2.5/hf_offline_dflash.yaml' \
      pipeline.task_1.skip=true \
      pipeline.task_2.skip=true \
      pipeline.task_3.skip=true \
      --yes --detach

  Loading weights took 461.45 seconds
  Model loading took 71.44 GiB memory and 465.10 seconds
  Map (num_proc=32): 100%|##########| 100/100 [06:42<00:00, 4.02s/example]
  Saved 10 shards to /scratchspace/data/train-{1..10}-00010.jsonl
  Slurm: COMPLETED 01:02:11

Real assistant response verified end-to-end (Kimi correctly
answers the "bat and ball" CRT problem).

Replaces the prior pensieve-intern agent draft (1b02102) which
had: output-named input path, TRT-LLM container lacking
KimiK25 support, sidecar pollution, and uv.lock churn.

Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
@ChenhanYu ChenhanYu force-pushed the pensieve-intern/OMNIML-4735/synth-support branch from c3eef71 to b661cef Compare May 15, 2026 20:47
@ChenhanYu
Copy link
Copy Markdown
Collaborator Author

Review addressed in the force-push (b661cef):

  • ✅ Renamed to moonshotai/Kimi-K2.5/hf_offline_dflash.yaml (input-model directory, DFlash-algorithm filename — matches the spec convention).
  • ✅ Removed task_1, task_2, task_3. Only task_0 ships in this PR (the one with cluster-test evidence). Downstream stages will add their own tasks each with their own evidence.
  • job_name updated to Kimi-K2.5_DFlash_offline.
  • Dry-run re-validated against the new path: exit 0.
  • Existing cluster-test evidence still applies — only the file path + scope changed, not task_0 contents.

The CodeRabbit comments about MLM_MODEL_CFG/QUANT_CFG (megatron quantize specific, doesn't apply to vLLM task_0), the task_2/task_3 artifact-path mismatch (n/a — both removed), and pinning vllm/vllm-openai:latest to a digest are noted but not addressed in this push.

Also rolling the lessons into the synth_support spec template via pensieve-intern MR !5 (https://gitlab-master.nvidia.com/omniml/integration/pensieve-intern/-/merge_requests/5) — that MR will be updated with the directory/filename/per-stage-scope conventions so the next agent run on a new model doesn't ship untested tasks or use the output name as the directory.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/launcher/examples/moonshotai/Kimi-K2.5/hf_offline_dflash.yaml`:
- Around line 46-51: The launcher YAML's environment block is missing required
model keys; add MLM_MODEL_CFG set to the HuggingFace repo ID for this model and
QUANT_CFG set to the chosen quantization profile (e.g., NVFP4_DEFAULT_CFG or
INT8_DEFAULT_CFG) inside the existing environment array so the new launcher
config provides both MLM_MODEL_CFG and QUANT_CFG alongside HF_LOCAL and
VLLM_STARTUP_TIMEOUT.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5475f944-f3fe-4118-96b5-bf9384022088

📥 Commits

Reviewing files that changed from the base of the PR and between c3eef71 and b661cef.

📒 Files selected for processing (1)
  • tools/launcher/examples/moonshotai/Kimi-K2.5/hf_offline_dflash.yaml

Comment on lines +46 to +51
environment:
- HF_LOCAL: /hf-local
- VLLM_STARTUP_TIMEOUT: "1800" # Kimi-K2.6 weight load alone is ~7.7 min
# at 71 GiB/GPU; default 600s in query.sh
# is not enough to also cover KV cache
# profiling + encoder cache init
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add required model environment keys for new launcher configs.

environment is missing MLM_MODEL_CFG and QUANT_CFG, which are required for new model configs in this launcher YAML family.

Suggested patch
     environment:
       - HF_LOCAL: /hf-local
+      - MLM_MODEL_CFG: moonshotai/Kimi-K2.5
+      - QUANT_CFG: NVFP4_DEFAULT_CFG
       - VLLM_STARTUP_TIMEOUT: "1800"  # Kimi-K2.6 weight load alone is ~7.7 min
                                       # at 71 GiB/GPU; default 600s in query.sh
                                       # is not enough to also cover KV cache

As per coding guidelines, "tools/launcher/**/*.yaml: Set MLM_MODEL_CFG environment variable to the HuggingFace repo ID when adding a new model config" and "Set QUANT_CFG environment variable (e.g., NVFP4_DEFAULT_CFG, INT8_DEFAULT_CFG) when adding a new model config".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
environment:
- HF_LOCAL: /hf-local
- VLLM_STARTUP_TIMEOUT: "1800" # Kimi-K2.6 weight load alone is ~7.7 min
# at 71 GiB/GPU; default 600s in query.sh
# is not enough to also cover KV cache
# profiling + encoder cache init
environment:
- HF_LOCAL: /hf-local
- MLM_MODEL_CFG: moonshotai/Kimi-K2.5
- QUANT_CFG: NVFP4_DEFAULT_CFG
- VLLM_STARTUP_TIMEOUT: "1800" # Kimi-K2.6 weight load alone is ~7.7 min
# at 71 GiB/GPU; default 600s in query.sh
# is not enough to also cover KV cache
# profiling + encoder cache init
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/launcher/examples/moonshotai/Kimi-K2.5/hf_offline_dflash.yaml` around
lines 46 - 51, The launcher YAML's environment block is missing required model
keys; add MLM_MODEL_CFG set to the HuggingFace repo ID for this model and
QUANT_CFG set to the chosen quantization profile (e.g., NVFP4_DEFAULT_CFG or
INT8_DEFAULT_CFG) inside the existing environment array so the new launcher
config provides both MLM_MODEL_CFG and QUANT_CFG alongside HF_LOCAL and
VLLM_STARTUP_TIMEOUT.

@ChenhanYu ChenhanYu force-pushed the pensieve-intern/OMNIML-4735/synth-support branch from b661cef to d35357d Compare May 15, 2026 21:26
@ChenhanYu ChenhanYu requested a review from a team as a code owner May 15, 2026 21:26
@ChenhanYu ChenhanYu requested a review from kevalmorabia97 May 15, 2026 21:26
@ChenhanYu ChenhanYu force-pushed the pensieve-intern/OMNIML-4735/synth-support branch from d35357d to b661cef Compare May 16, 2026 04:16
@ChenhanYu
Copy link
Copy Markdown
Collaborator Author

Re-cleaned via /babysit-jira on OMNIML-4740.

The 2026-05-15 14:26 agent re-fire (pipeline 51447198, running pre-MR-!4 pensieve-intern v0.33.48) force-pushed over the clean cluster-tested commit (b661cef) with a polluted draft — uv.lock churn + VERIFICATION_COMMENT.txt sidecar + a regressed YAML pointing at /hf-local/modelopt/Speculative-Decoding-Prompt-Samples which isn'''t staged on cw-dfw. The agent reported SSH auth failure on the cluster test step and fell back to the sidecar-only verify-task path.

Reset to b661cef — the single clean cluster-validated commit (Slurm 11782946 on cw-dfw, COMPLETED 1:02:11, real end-to-end vLLM output verified).

Paired infra fix to prevent re-occurrence: nmm-sandbox!143 — bumps PENSIEVE_INTERN_REF v0.33.48 → v0.33.57 (which ships the auto-strip uv.lock/sidecar pollution logic from pensieve-intern !4) and adds SSH key + known_hosts to the intern-agent CI before_script (so the next intern_retry can actually run the mandated cw-dfw cluster test instead of falling back to a sidecar).

This PR is back to ready-for-review state; awaiting CODEOWNERS approval from @kevalmorabia97 + modelopt-setup-codeowners.

@ChenhanYu ChenhanYu force-pushed the pensieve-intern/OMNIML-4735/synth-support branch from b661cef to f0d20f2 Compare May 16, 2026 05:08
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/launcher/common/service_utils.sh`:
- Around line 51-54: The current fixed marker (/tmp/.nmm_extra_dep_installed)
stored in the _marker variable can be stale across runs; change _marker in
tools/launcher/common/service_utils.sh to be job/session-scoped (e.g., include
$TMPDIR if set or append a unique token like $$, $CI_JOB_ID, or a timestamp) or
generate a unique path with mktemp and persist that same path for the check and
creation; update any code that references _marker to use the new session-scoped
marker so the if [[ -f "$_marker" ]] check and subsequent creation are
consistent.
- Around line 55-67: The install block currently may create the marker even if
installs fail and non-zero ranks keep running after the 600s wait; update the
code around the mpi_local_rank check to fail fast: in the rank 0 branch (where
pip install diskcache, git clone and pip install "${_nvrx_dir}" run) check each
command's exit status and only touch/emit "$_marker" after all installs succeed,
otherwise log the error and exit non-zero; in the non-zero branch (the while
loop using _waited and checking "$_marker") detect the timeout (when _waited >=
600) and abort with a non-zero exit and error message instead of silently
continuing so downstream steps don't run without dependencies.
- Around line 59-60: The git clone in service_utils.sh clones
nvidia-resiliency-ext without pinning, so add an environment variable (e.g.,
NVRX_REF) and use it when cloning to pin to a tag/commit; modify the clone
invocation that uses _nvrx_dir to include --branch "$NVRX_REF" (or fall back to
a sensible default like main when NVRX_REF is empty) and document/export
NVRX_REF so CI/ops can set an immutable ref before running the script.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1759c510-8bfc-419f-8829-c46dd02aeb5e

📥 Commits

Reviewing files that changed from the base of the PR and between b661cef and f0d20f2.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (3)
  • BLOCKED_ON_UPSTREAM.txt
  • tools/launcher/common/service_utils.sh
  • tools/launcher/examples/moonshotai/Kimi-K2.5/hf_offline_dflash.yaml
✅ Files skipped from review due to trivial changes (1)
  • BLOCKED_ON_UPSTREAM.txt
🚧 Files skipped from review as they are similar to previous changes (1)
  • tools/launcher/examples/moonshotai/Kimi-K2.5/hf_offline_dflash.yaml

Comment thread tools/launcher/common/service_utils.sh Outdated
Comment on lines +51 to +54
local _marker=/tmp/.nmm_extra_dep_installed
if [[ -f "$_marker" ]]; then
return 0
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Scope the marker file to the job/session.

Line 51 uses a fixed /tmp marker name, so stale files from previous runs can cause this run to skip installation incorrectly.

Suggested fix
-    local _marker=/tmp/.nmm_extra_dep_installed
+    local _marker="/tmp/.nmm_extra_dep_installed.${SLURM_JOB_ID:-$$}.${USER:-unknown}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
local _marker=/tmp/.nmm_extra_dep_installed
if [[ -f "$_marker" ]]; then
return 0
fi
local _marker="/tmp/.nmm_extra_dep_installed.${SLURM_JOB_ID:-$$}.${USER:-unknown}"
if [[ -f "$_marker" ]]; then
return 0
fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/launcher/common/service_utils.sh` around lines 51 - 54, The current
fixed marker (/tmp/.nmm_extra_dep_installed) stored in the _marker variable can
be stale across runs; change _marker in tools/launcher/common/service_utils.sh
to be job/session-scoped (e.g., include $TMPDIR if set or append a unique token
like $$, $CI_JOB_ID, or a timestamp) or generate a unique path with mktemp and
persist that same path for the check and creation; update any code that
references _marker to use the new session-scoped marker so the if [[ -f
"$_marker" ]] check and subsequent creation are consistent.

Comment thread tools/launcher/common/service_utils.sh Outdated
Comment on lines +55 to +67
if [[ "$mpi_local_rank" -eq 0 ]]; then
pip install diskcache
local _nvrx_dir
_nvrx_dir="$(mktemp -d)/nvidia-resiliency-ext"
git clone --depth 1 https://github.com/NVIDIA/nvidia-resiliency-ext "${_nvrx_dir}" \
&& pip install "${_nvrx_dir}"
touch "$_marker"
else
local _waited=0
while [[ ! -f "$_marker" && $_waited -lt 600 ]]; do
sleep 1
_waited=$((_waited + 1))
done
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast on install errors and wait timeout.

Line 61 can publish success even if dependency install failed, and Lines 64-67 let non-zero ranks continue after 600s timeout with no marker. Both paths can silently proceed with missing deps.

Suggested fix
     if [[ "$mpi_local_rank" -eq 0 ]]; then
-        pip install diskcache
+        pip install diskcache || return 1
         local _nvrx_dir
         _nvrx_dir="$(mktemp -d)/nvidia-resiliency-ext"
         git clone --depth 1 https://github.com/NVIDIA/nvidia-resiliency-ext "${_nvrx_dir}" \
-            && pip install "${_nvrx_dir}"
-        touch "$_marker"
+            && pip install "${_nvrx_dir}" \
+            && touch "$_marker" \
+            || return 1
     else
         local _waited=0
         while [[ ! -f "$_marker" && $_waited -lt 600 ]]; do
             sleep 1
             _waited=$((_waited + 1))
         done
+        [[ -f "$_marker" ]] || return 1
     fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if [[ "$mpi_local_rank" -eq 0 ]]; then
pip install diskcache
local _nvrx_dir
_nvrx_dir="$(mktemp -d)/nvidia-resiliency-ext"
git clone --depth 1 https://github.com/NVIDIA/nvidia-resiliency-ext "${_nvrx_dir}" \
&& pip install "${_nvrx_dir}"
touch "$_marker"
else
local _waited=0
while [[ ! -f "$_marker" && $_waited -lt 600 ]]; do
sleep 1
_waited=$((_waited + 1))
done
if [[ "$mpi_local_rank" -eq 0 ]]; then
pip install diskcache || return 1
local _nvrx_dir
_nvrx_dir="$(mktemp -d)/nvidia-resiliency-ext"
git clone --depth 1 https://github.com/NVIDIA/nvidia-resiliency-ext "${_nvrx_dir}" \
&& pip install "${_nvrx_dir}" \
&& touch "$_marker" \
|| return 1
else
local _waited=0
while [[ ! -f "$_marker" && $_waited -lt 600 ]]; do
sleep 1
_waited=$((_waited + 1))
done
[[ -f "$_marker" ]] || return 1
fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/launcher/common/service_utils.sh` around lines 55 - 67, The install
block currently may create the marker even if installs fail and non-zero ranks
keep running after the 600s wait; update the code around the mpi_local_rank
check to fail fast: in the rank 0 branch (where pip install diskcache, git clone
and pip install "${_nvrx_dir}" run) check each command's exit status and only
touch/emit "$_marker" after all installs succeed, otherwise log the error and
exit non-zero; in the non-zero branch (the while loop using _waited and checking
"$_marker") detect the timeout (when _waited >= 600) and abort with a non-zero
exit and error message instead of silently continuing so downstream steps don't
run without dependencies.

Comment thread tools/launcher/common/service_utils.sh Outdated
Comment on lines +59 to +60
git clone --depth 1 https://github.com/NVIDIA/nvidia-resiliency-ext "${_nvrx_dir}" \
&& pip install "${_nvrx_dir}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

cat -n tools/launcher/common/service_utils.sh | sed -n '50,70p'

Repository: NVIDIA/Model-Optimizer

Length of output: 837


🏁 Script executed:

rg "NVRX_REF|nvidia-resiliency-ext" --type sh

Repository: NVIDIA/Model-Optimizer

Length of output: 292


🏁 Script executed:

rg "nvidia-resiliency-ext" -i

Repository: NVIDIA/Model-Optimizer

Length of output: 292


🏁 Script executed:

rg "resiliency|reproducibility|version" tools/launcher/common/service_utils.sh -B2 -A2

Repository: NVIDIA/Model-Optimizer

Length of output: 569


🏁 Script executed:

grep -r "NVRX" tools/launcher/ 2>/dev/null || echo "No NVRX references found"

Repository: NVIDIA/Model-Optimizer

Length of output: 91


Pin nvidia-resiliency-ext to an immutable ref.

Line 59 clones from the default branch at runtime without specifying a Git reference, creating a supply-chain and reproducibility risk for cluster jobs. The branch pointer can move unexpectedly, making builds non-deterministic.

Add an environment variable to allow pinning to a specific tag or commit:

Suggested fix
+        local _nvrx_ref="${NVRX_REF:-<tag-or-commit>}"
-        git clone --depth 1 https://github.com/NVIDIA/nvidia-resiliency-ext "${_nvrx_dir}" \
+        git clone --depth 1 --branch "${_nvrx_ref}" https://github.com/NVIDIA/nvidia-resiliency-ext "${_nvrx_dir}" \
             && pip install "${_nvrx_dir}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/launcher/common/service_utils.sh` around lines 59 - 60, The git clone
in service_utils.sh clones nvidia-resiliency-ext without pinning, so add an
environment variable (e.g., NVRX_REF) and use it when cloning to pin to a
tag/commit; modify the clone invocation that uses _nvrx_dir to include --branch
"$NVRX_REF" (or fall back to a sensible default like main when NVRX_REF is
empty) and document/export NVRX_REF so CI/ops can set an immutable ref before
running the script.

@ChenhanYu ChenhanYu force-pushed the pensieve-intern/OMNIML-4735/synth-support branch from f0d20f2 to b661cef Compare May 16, 2026 05:17
@kevalmorabia97 kevalmorabia97 removed the request for review from a team May 16, 2026 11:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant