WIP: Agent Skills Updates From Live Trials by chadvoegele · Pull Request #1493 · NVIDIA/Model-Optimizer

chadvoegele · 2026-05-14T16:43:06Z

What does this PR do?

Type of change: bug fix

Usage

Ask Claude Code:

Quantize `mistralai/Mistral-Medium-3.5-128B` to NVFP4 using the ModelOpt NVFP4 experts-only recipe.

Run on $cluster

Evaluate the resulting quantized checkpoint on:
- GPQA Diamond AA v3
- SciCode AA v2

Complete the quantization and evaluation workflow end to end. Prompt when you require user input, otherwise keep going.

Testing

I'm running the full loop with the above prompt, and iterating on skills to resolve undesired agent behavior.

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅
Did you write any new necessary tests?: N/A
Did you update Changelog?: TODO
Did you get Claude approval on this PR?: TODO

Additional Information

See trials log for details.

Summary by CodeRabbit

New Features
- Added Slurm QoS support for submitted jobs.
- Replaced scheduled checks with a continuous durable job monitor.
- Added multiple new evaluation task recipes (AIME 2025, GPQA, IFBench, LiveCodeBench, SciCode, MMLU-Pro, AA variants).
Bug Fixes
- Ensure output directory is created before writing quantization summaries.
Documentation
- Expanded evaluation workflow (dry-run/canary steps, NEL timeout/resume), quantization-detection rules, mandatory baseline-comparison preflight, SLURM registry auth flow, PTQ preflight, and strict checkpoint-validation gate.
Chore
- Ignore local CLAUDE note file.

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>

coderabbitai · 2026-05-14T16:43:19Z

📝 Walkthrough

Walkthrough

Refactors evaluation and PTQ guidance with stricter quantization detection and baseline gating; replaces cron monitoring with a durable poller; requires explicit cluster selection for multi-cluster configs; adds SLURM QoS support and numerous evaluation recipe docs/removals; includes small utility fixes.

Changes

Evaluation / Quantization & Task Flow

Layer / File(s)	Summary
Quantization auto-detection & test expectations `.claude/skills/evaluation/SKILL.md`, `.claude/skills/evaluation/tests/evals.json`	Check `config.json` first for ModelOpt `quantization_config.quant_method`, fall back to `hf_quant_config.json`, extract `quant_algo` from supported keys, treat absent config as unquantized, and avoid adding vLLM `--quantization` unless required; tests updated.
Baseline preflight and comparability checks `.claude/skills/evaluation/SKILL.md`, `.claude/skills/launching-evals/references/analyze-results.md`	Add baseline-identification/creation preflight for quantized runs and an explicit comparability checklist (prompt/chat-template parity, generation settings, reasoning-trace handling, sample/metric alignment) separate from final-run validity checks.
Registry auth decision flow & task confirmation `.claude/skills/evaluation/SKILL.md`	Step 7.5: registry auth decision path for private images (enroot credential checks, add credentials or fallback image, single retry). Task confirmation prefers `recipes/tasks/<name>.md` fragments and preserves repeat counts.
NEL timeout/resume & gated run sequence `.claude/skills/evaluation/SKILL.md`	Document NEL timeout/resume behavior for SLURM dependency chains and enforce gated sequence: dry-run → limited-samples canary → full run, with log and sample validation between stages and Steps 9–10 for run validation and comparability.
Evaluation recipe docs: add/replace Markdown recipes `.claude/skills/evaluation/recipes/tasks/.md`, `.yaml`, `recipes/examples/example_eval.yaml`	Add many `.md` task recipes (GPQA, LiveCodeBench, SciCode, AIME2025, IFBench, MMLU-Pro, AA-LCR, HLE-AA, MMLU-Pro AA v3) with YAML/snippets and score extraction helpers; remove corresponding YAML task configs for several tasks; update example_eval comment to “Canary (2 samples)”.

PTQ & Checkpoint Validation

Layer / File(s)	Summary
PTQ preflight coverage check `.claude/skills/ptq/SKILL.md`	Preflight inspects recipe include/exclude patterns against model structure to estimate quantized-layer coverage; require stop or confirmation when coverage is zero/very low or when input is already quantized and would lose coverage.
Post-quantization validation gate & scripts `.claude/skills/ptq/references/checkpoint-validation.md`, `.claude/skills/ptq/tests.json`	Require size/ratio check (source vs output safetensors), layer-coverage & precision counts (detect unexpected unquantized layers/mismatches), and non-weight metadata consistency; provide size-check snippet and layer-coverage script; add tests enforcing gating for FP8→NVFP4 scenario.

Monitoring, Environment, and Launcher

Layer / File(s)	Summary
Durable monitor loop & SLURM status detection `.claude/skills/monitor/SKILL.md`	Replace recurring cron with durable monitor polling `.claude/active_jobs.json`, emit state-change events, update `last_status`, remove terminal jobs, and exit when registry empty; expand terminal-state list.
SLURM raw-job check switch to sacct `.claude/skills/monitor/SKILL.md`	Switch raw SLURM job termination/status detection from `squeue` absence checks to `sacct` queries with filtering to ignore extern/batch and step rows.
Cluster selection prompt `.claude/skills/common/environment-setup.md`	When a cluster config defines multiple clusters and no explicit target provided, require asking the user which cluster to use before calling `remote_load_cluster`.
SLURM QoS support in launcher `tools/launcher/slurm_config.py`, `tools/launcher/core.py`	Add `qos: str = None` to `SlurmConfig`, accept `qos` from `SLURM_QOS` in `slurm_factory`, and pass `qos=slurm_config.qos` into `run.SlurmExecutor` construction.

Utilities & Misc

Layer / File(s)	Summary
Quant summary file creation fix `modelopt/torch/quantization/model_quant.py`	`print_quant_summary()` now creates `output_dir` (os.makedirs exist_ok) before writing `.quant_summary.txt` when `output_dir` is provided.
Ignore local debug file `.gitignore`	Add `CLAUDE.local.md` to `.gitignore`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'WIP: Agent Skills Updates From Live Trials' is vague and non-descriptive, using generic terms like 'Updates' and 'Live Trials' without clarifying the specific changes or main focus.	Replace with a more specific title that describes the primary change, such as 'Update evaluation skill documentation and task recipes' or 'Refactor evaluation workflow and add AIME/GPQA/SciCode task recipes'.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	Safe Python changes in 3 files: model_quant.py (os.makedirs), core.py (qos param), slurm_config.py (qos field). No torch.load, numpy.load, eval/exec, trust_remote_code, or new # nosec issues.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch cvoegele/agent_evals

⚔️ Resolve merge conflicts

Resolve merge conflict in branch cvoegele/agent_evals

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.claude/skills/monitor/SKILL.md:
- Around line 54-59: The documentation/logic currently enforces "report only
state changes" universally; update it so that user-initiated checks (e.g., when
the user explicitly asks "check status") return the full current status for each
job rather than only deltas—leave monitor-driven checks to still compare against
`last_status` in `.claude/active_jobs.json` and report only changes. Adjust the
wording and any associated pseudocode/implementation notes to branch on the
trigger type ("monitor output" vs "user-initiated") and on user-initiated flows
ensure you read the registry, check each job, return current state for each job,
and then update `last_status` accordingly. Ensure references to `last_status`
and `.claude/active_jobs.json` remain consistent so maintainers can find and
implement the conditional behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 27a4dd3f-7246-45c7-8699-ec80a318c50a

📥 Commits

Reviewing files that changed from the base of the PR and between 229ba61 and ddf58b5.

📒 Files selected for processing (11)

.claude/skills/common/environment-setup.md
.claude/skills/evaluation/SKILL.md
.claude/skills/evaluation/tests/evals.json
.claude/skills/launching-evals/references/analyze-results.md
.claude/skills/monitor/SKILL.md
.claude/skills/ptq/SKILL.md
.claude/skills/ptq/references/checkpoint-validation.md
.gitignore
modelopt/torch/quantization/model_quant.py
tools/launcher/core.py
tools/launcher/slurm_config.py

coderabbitai · 2026-05-14T16:47:29Z

+Whether triggered by monitor output or by the user asking "check status":

 1. **Read the registry** from `.claude/active_jobs.json`
 2. **Check each job** using the appropriate method (see below)
 3. **Report only state changes** — compare against `last_status` in registry
 4. **Update `last_status`** in the registry


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

User-initiated checks should return current status, not only deltas.

Right now this section applies “report only state changes” even when the user explicitly asks for status. That can yield no meaningful response if nothing changed since last poll.

✏️ Suggested update

-Whether triggered by monitor output or by the user asking "check status": +Whether triggered by monitor output or by the user asking "check status": 1. **Read the registry** from `.claude/active_jobs.json` 2. **Check each job** using the appropriate method (see below) -3. **Report only state changes** — compare against `last_status` in registry -4. **Update `last_status`** in the registry +3. **If user asked for status, report current status for all tracked jobs** (even if unchanged) +4. **If triggered by monitor output, report only state changes** — compare against `last_status` in registry +5. **Update `last_status`** in the registry

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Whether triggered by monitor output or by the user asking "check status":

1. **Read the registry** from `.claude/active_jobs.json`

2. **Check each job** using the appropriate method (see below)

3. **Report only state changes** — compare against `last_status` in registry

4. **Update `last_status`** in the registry

Whether triggered by monitor output or by the user asking "check status":

1. **Read the registry** from `.claude/active_jobs.json`

2. **Check each job** using the appropriate method (see below)

3. **If user asked for status, report current status for all tracked jobs** (even if unchanged)

4. **If triggered by monitor output, report only state changes** — compare against `last_status` in registry

5. **Update `last_status`** in the registry

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.claude/skills/monitor/SKILL.md around lines 54 - 59, The documentation/logic currently enforces "report only state changes" universally; update it so that user-initiated checks (e.g., when the user explicitly asks "check status") return the full current status for each job rather than only deltas—leave monitor-driven checks to still compare against `last_status` in `.claude/active_jobs.json` and report only changes. Adjust the wording and any associated pseudocode/implementation notes to branch on the trigger type ("monitor output" vs "user-initiated") and on user-initiated flows ensure you read the registry, check each job, return current state for each job, and then update `last_status` accordingly. Ensure references to `last_status` and `.claude/active_jobs.json` remain consistent so maintainers can find and implement the conditional behavior.

github-actions · 2026-05-14T16:48:00Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1493/
Built to branch `gh-pages` at 2026-05-14 16:47 UTC. Preview will be ready when the GitHub Pages deployment is complete.

codecov · 2026-05-14T16:59:09Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.03%. Comparing base (229ba61) to head (29bbb20).
⚠️ Report is 5 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1493      +/-   ##
==========================================
- Coverage   72.75%   67.03%   -5.72%     
==========================================
  Files         473      473              
  Lines       51418    51419       +1     
==========================================
- Hits        37407    34470    -2937     
- Misses      14011    16949    +2938

Flag	Coverage Δ
examples	`41.61% <100.00%> (+2.63%)`	⬆️
gpu	`27.16% <0.00%> (-24.49%)`	⬇️
regression	`14.98% <0.00%> (+0.07%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.claude/skills/evaluation/SKILL.md (1)

203-212: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix task snippet schema mismatch (tasks vs evaluation.tasks).

The Step 5 example contradicts earlier instructions to edit evaluation.tasks. Keeping tasks: here can make generated configs invalid or ignored.

Suggested fix

-  tasks:
-    - name: <task>
-      nemo_evaluator_config:
-        config:
-          params:
-            temperature: <value>
-            max_new_tokens: <value>
-            ...
+  evaluation:
+    tasks:
+      - name: <task>
+        nemo_evaluator_config:
+          config:
+            params:
+              temperature: <value>
+              max_new_tokens: <value>
+              ...

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/evaluation/SKILL.md around lines 203 - 212, The YAML example
uses a top-level "tasks:" key which conflicts with the expected
"evaluation.tasks" namespace; update the snippet so the tasks list is nested
under "evaluation.tasks" (e.g., replace "tasks:" with "evaluation.tasks:" and
keep the existing task entries like "name" and "nemo_evaluator_config" intact),
and verify any references to "tasks" in the surrounding text or examples are
corrected to "evaluation.tasks" to keep schema consistent.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.claude/skills/evaluation/recipes/tasks/gpqa.md:
- Around line 73-75: The extractor extract_gpqa_score currently can raise
IndexError when called without args and uses raw open(...) with yaml.safe_load;
add basic argument validation (ensure path is provided and repeats, if given, is
an int) and use a safe file context (with open(path, "r") as f) and
yaml.safe_load(f) while catching FileNotFoundError and yaml.YAMLError and
re-raising a clear ValueError; also validate that the expected keys exist in the
loaded dict (results -> groups -> gpqa -> metrics) and raise ValueError if
missing. Apply the same validation and safe-loading pattern to the similar
extractor function around lines 94-97 to ensure consistent error handling.

In @.claude/skills/evaluation/recipes/tasks/scicode.md:
- Around line 105-108: The extract_score function currently assumes a valid path
and opens the YAML without a context manager; fix it by validating the path
argument (raise ValueError or return a clear error if path is falsy), check the
file exists (catch FileNotFoundError), and read the YAML using a context manager
(with open(path) as f: data = yaml.safe_load(f)); then safely access
TASKS[group] and data["results"]["groups"][group"]["metrics"] (use .get or catch
KeyError to provide a clearer error). Apply the same changes to the other
identical snippet that reads the YAML and accesses metrics.

---

Outside diff comments:
In @.claude/skills/evaluation/SKILL.md:
- Around line 203-212: The YAML example uses a top-level "tasks:" key which
conflicts with the expected "evaluation.tasks" namespace; update the snippet so
the tasks list is nested under "evaluation.tasks" (e.g., replace "tasks:" with
"evaluation.tasks:" and keep the existing task entries like "name" and
"nemo_evaluator_config" intact), and verify any references to "tasks" in the
surrounding text or examples are corrected to "evaluation.tasks" to keep schema
consistent.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1c2365f0-8719-4c16-9dd8-7640d57849f7

📥 Commits

Reviewing files that changed from the base of the PR and between ddf58b5 and 0633922.

📒 Files selected for processing (18)

.claude/skills/evaluation/SKILL.md
.claude/skills/evaluation/recipes/examples/example_eval.yaml
.claude/skills/evaluation/recipes/tasks/aime2025.md
.claude/skills/evaluation/recipes/tasks/aime2025.yaml
.claude/skills/evaluation/recipes/tasks/gpqa.md
.claude/skills/evaluation/recipes/tasks/gpqa.yaml
.claude/skills/evaluation/recipes/tasks/ifbench.md
.claude/skills/evaluation/recipes/tasks/ifbench.yaml
.claude/skills/evaluation/recipes/tasks/livecodebench.md
.claude/skills/evaluation/recipes/tasks/livecodebench.yaml
.claude/skills/evaluation/recipes/tasks/mmlu_pro.md
.claude/skills/evaluation/recipes/tasks/mmlu_pro.yaml
.claude/skills/evaluation/recipes/tasks/scicode.md
.claude/skills/evaluation/recipes/tasks/scicode.yaml
.claude/skills/evaluation/tests/evals.json
.claude/skills/ptq/SKILL.md
.claude/skills/ptq/references/checkpoint-validation.md
.claude/skills/ptq/tests.json

💤 Files with no reviewable changes (6)

.claude/skills/evaluation/recipes/tasks/ifbench.yaml
.claude/skills/evaluation/recipes/tasks/mmlu_pro.yaml
.claude/skills/evaluation/recipes/tasks/aime2025.yaml
.claude/skills/evaluation/recipes/tasks/scicode.yaml
.claude/skills/evaluation/recipes/tasks/gpqa.yaml
.claude/skills/evaluation/recipes/tasks/livecodebench.yaml

✅ Files skipped from review due to trivial changes (3)

.claude/skills/evaluation/recipes/examples/example_eval.yaml
.claude/skills/evaluation/recipes/tasks/ifbench.md
.claude/skills/evaluation/recipes/tasks/mmlu_pro.md

🚧 Files skipped from review as they are similar to previous changes (1)

.claude/skills/evaluation/tests/evals.json

coderabbitai · 2026-05-14T19:21:59Z

+def extract_gpqa_score(path, repeats=None):
+    data = yaml.safe_load(open(path))
+    metrics = data["results"]["groups"]["gpqa"]["metrics"]


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add basic argument validation and safe file loading in the extractor.

This snippet can throw IndexError if invoked without args and should avoid raw open(...) in yaml.safe_load.

Suggested fix

def extract_gpqa_score(path, repeats=None): - data = yaml.safe_load(open(path)) + with open(path, "r", encoding="utf-8") as f: + data = yaml.safe_load(f) @@ if __name__ == "__main__": - path = sys.argv[1] + if len(sys.argv) < 2: + raise SystemExit("Usage: python extract_gpqa_score.py <results.yaml> [repeats]") + path = sys.argv[1] repeats = int(sys.argv[2]) if len(sys.argv) > 2 else None print(extract_gpqa_score(path, repeats))

Also applies to: 94-97

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.claude/skills/evaluation/recipes/tasks/gpqa.md around lines 73 - 75, The extractor extract_gpqa_score currently can raise IndexError when called without args and uses raw open(...) with yaml.safe_load; add basic argument validation (ensure path is provided and repeats, if given, is an int) and use a safe file context (with open(path, "r") as f) and yaml.safe_load(f) while catching FileNotFoundError and yaml.YAMLError and re-raising a clear ValueError; also validate that the expected keys exist in the loaded dict (results -> groups -> gpqa -> metrics) and raise ValueError if missing. Apply the same validation and safe-loading pattern to the similar extractor function around lines 94-97 to ensure consistent error handling.

coderabbitai · 2026-05-14T19:21:59Z

+def extract_score(path, group="scicode"):
+    spec = TASKS[group]
+    data = yaml.safe_load(open(path))
+    metrics = data["results"]["groups"][group]["metrics"]


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Harden CLI/file handling in the score extractor snippet.

The snippet crashes with IndexError when no path is passed, and it opens the YAML file without a context manager.

Suggested fix

def extract_score(path, group="scicode"): spec = TASKS[group] - data = yaml.safe_load(open(path)) + with open(path, "r", encoding="utf-8") as f: + data = yaml.safe_load(f) @@ if __name__ == "__main__": - path = sys.argv[1] + if len(sys.argv) < 2: + raise SystemExit("Usage: python extract_score.py <results.yaml> [scicode|gpqa]") + path = sys.argv[1] group = sys.argv[2] if len(sys.argv) > 2 else "scicode" print(extract_score(path, group))

Also applies to: 135-138

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.claude/skills/evaluation/recipes/tasks/scicode.md around lines 105 - 108, The extract_score function currently assumes a valid path and opens the YAML without a context manager; fix it by validating the path argument (raise ValueError or return a clear error if path is falsy), check the file exists (catch FileNotFoundError), and read the YAML using a context manager (with open(path) as f: data = yaml.safe_load(f)); then safely access TASKS[group] and data["results"]["groups"][group"]["metrics"] (use .get or catch KeyError to provide a clearer error). Apply the same changes to the other identical snippet that reads the YAML and accesses metrics.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.claude/skills/evaluation/SKILL.md:
- Around line 260-267: The preflight check currently tests for any credential
entry using the generic grep command; change it to verify credentials per
registry host used by the selected images by searching for the specific registry
hostnames (not just any "machine" entry) in ~/.config/enroot/.credentials.
Update the documented check (the grep invocation shown) to demonstrate matching
the actual registry host(s) (e.g., loop or run grep for each selected image's
registry host) so the preflight returns true only when credentials exist for
those specific hosts.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 466a2d74-5bf4-40f5-a94e-78d17d191a59

📥 Commits

Reviewing files that changed from the base of the PR and between 0633922 and 29bbb20.

📒 Files selected for processing (7)

.claude/skills/evaluation/SKILL.md
.claude/skills/evaluation/recipes/env.example
.claude/skills/evaluation/recipes/tasks/aa_lcr.md
.claude/skills/evaluation/recipes/tasks/aime2025.md
.claude/skills/evaluation/recipes/tasks/hle_aa.md
.claude/skills/evaluation/recipes/tasks/ifbench.md
.claude/skills/evaluation/recipes/tasks/mmlu_pro_aa_v3.md

✅ Files skipped from review due to trivial changes (6)

.claude/skills/evaluation/recipes/env.example
.claude/skills/evaluation/recipes/tasks/mmlu_pro_aa_v3.md
.claude/skills/evaluation/recipes/tasks/aa_lcr.md
.claude/skills/evaluation/recipes/tasks/hle_aa.md
.claude/skills/evaluation/recipes/tasks/ifbench.md
.claude/skills/evaluation/recipes/tasks/aime2025.md

coderabbitai · 2026-05-15T22:35:05Z

 ```bash
 ssh <host> "grep -E '^\s*machine\s+' ~/.config/enroot/.credentials 2>/dev/null"
 ```

 **Decision flow (check before submitting):**
-1. Check if the cluster has credentials for the default DockerHub image (see command above)
-2. If DockerHub credentials exist → use the default image and submit
-3. If DockerHub credentials are missing but can be added → add them (see `slurm-setup.md` section 6), then submit
-4. If DockerHub credentials cannot be added → override `deployment.image` to the NGC alternative and submit:
+1. If the selected images are public → submit without an auth preflight
+2. If any selected image is private or access-restricted → check for credentials for that image's registry (see command above)
+3. If credentials exist → use the selected image and submit


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make credential verification registry-specific.

Line 261 currently checks whether any credential entry exists, not whether credentials exist for the selected registry host(s). That can pass preflight but still fail image pulls.

Suggested doc fix

-ssh <host> "grep -E '^\s*machine\s+' ~/.config/enroot/.credentials 2>/dev/null" +ssh <host> "awk '/^\s*machine\s+/ {print $2}' ~/.config/enroot/.credentials 2>/dev/null" +# Verify the required registry host(s) from selected images are present (e.g., docker.io, nvcr.io, registry.internal).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.claude/skills/evaluation/SKILL.md around lines 260 - 267, The preflight check currently tests for any credential entry using the generic grep command; change it to verify credentials per registry host used by the selected images by searching for the specific registry hostnames (not just any "machine" entry) in ~/.config/enroot/.credentials. Update the documented check (the grep invocation shown) to demonstrate matching the actual registry host(s) (e.g., loop or run grep for each selected image's registry host) so the preflight returns true only when credentials exist for those specific hosts.

chadvoegele added 9 commits May 14, 2026 11:17

Update evaluation skill guidance

0425e33

Refine agent skill guidance

a9706cf

Clarify quantized eval baseline comparison

d51bc8d

Document repeat guidance for reasoning evals

7333e77

Add PTQ and evaluation verification guidance

f598704

Deduplicate PTQ checkpoint size guidance

b63919a

Deduplicate evaluation recipe guidance

94fcbb4

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>

Add SLURM QoS launcher option

844a238

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>

Create quant summary output directory

ddf58b5

Signed-off-by: Chad Voegele <cvoegele@nvidia.com>

chadvoegele requested a review from a team as a code owner May 14, 2026 16:43

chadvoegele requested a review from jingyu-ml May 14, 2026 16:43

chadvoegele changed the title ~~Agent Skills Updates From Live Trials~~ WIP: Agent Skills Updates From Live Trials May 14, 2026

coderabbitai Bot reviewed May 14, 2026

View reviewed changes

chadvoegele added 5 commits May 14, 2026 13:10

Make PTQ checkpoint validation a required gate

5a94d26

Refine evaluation run gating guidance

0b0ccc5

Document NEL timeout resume behavior

e8ece87

Split evaluation validation and comparability steps

1f04330

Convert evaluation task snippets to references

0633922

coderabbitai Bot reviewed May 14, 2026

View reviewed changes

Add evaluation task references

29bbb20

coderabbitai Bot reviewed May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Agent Skills Updates From Live Trials#1493

WIP: Agent Skills Updates From Live Trials#1493
chadvoegele wants to merge 15 commits into
mainfrom
cvoegele/agent_evals

chadvoegele commented May 14, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 14, 2026 •

edited

Loading

Walkthrough

Changes

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026

Built to branch `gh-pages` at 2026-05-14 16:47 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

codecov Bot commented May 14, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 14, 2026

Uh oh!

coderabbitai Bot May 14, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chadvoegele commented May 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 14, 2026

Built to branch gh-pages at 2026-05-14 16:47 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

codecov Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chadvoegele commented May 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 14, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-05-14 16:47 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

codecov Bot commented May 14, 2026 •

edited

Loading