WIP: Agent Skills Updates From Live Trials#1493
Conversation
Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
Signed-off-by: Chad Voegele <cvoegele@nvidia.com>
📝 WalkthroughWalkthroughRefactors evaluation and PTQ guidance with stricter quantization detection and baseline gating; replaces cron monitoring with a durable poller; requires explicit cluster selection for multi-cluster configs; adds SLURM QoS support and numerous evaluation recipe docs/removals; includes small utility fixes. ChangesEvaluation / Quantization & Task Flow
PTQ & Checkpoint Validation
Monitoring, Environment, and Launcher
Utilities & Misc
Estimated code review effort 🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.claude/skills/monitor/SKILL.md:
- Around line 54-59: The documentation/logic currently enforces "report only
state changes" universally; update it so that user-initiated checks (e.g., when
the user explicitly asks "check status") return the full current status for each
job rather than only deltas—leave monitor-driven checks to still compare against
`last_status` in `.claude/active_jobs.json` and report only changes. Adjust the
wording and any associated pseudocode/implementation notes to branch on the
trigger type ("monitor output" vs "user-initiated") and on user-initiated flows
ensure you read the registry, check each job, return current state for each job,
and then update `last_status` accordingly. Ensure references to `last_status`
and `.claude/active_jobs.json` remain consistent so maintainers can find and
implement the conditional behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 27a4dd3f-7246-45c7-8699-ec80a318c50a
📒 Files selected for processing (11)
.claude/skills/common/environment-setup.md.claude/skills/evaluation/SKILL.md.claude/skills/evaluation/tests/evals.json.claude/skills/launching-evals/references/analyze-results.md.claude/skills/monitor/SKILL.md.claude/skills/ptq/SKILL.md.claude/skills/ptq/references/checkpoint-validation.md.gitignoremodelopt/torch/quantization/model_quant.pytools/launcher/core.pytools/launcher/slurm_config.py
| Whether triggered by monitor output or by the user asking "check status": | ||
|
|
||
| 1. **Read the registry** from `.claude/active_jobs.json` | ||
| 2. **Check each job** using the appropriate method (see below) | ||
| 3. **Report only state changes** — compare against `last_status` in registry | ||
| 4. **Update `last_status`** in the registry |
There was a problem hiding this comment.
User-initiated checks should return current status, not only deltas.
Right now this section applies “report only state changes” even when the user explicitly asks for status. That can yield no meaningful response if nothing changed since last poll.
✏️ Suggested update
-Whether triggered by monitor output or by the user asking "check status":
+Whether triggered by monitor output or by the user asking "check status":
1. **Read the registry** from `.claude/active_jobs.json`
2. **Check each job** using the appropriate method (see below)
-3. **Report only state changes** — compare against `last_status` in registry
-4. **Update `last_status`** in the registry
+3. **If user asked for status, report current status for all tracked jobs** (even if unchanged)
+4. **If triggered by monitor output, report only state changes** — compare against `last_status` in registry
+5. **Update `last_status`** in the registry📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Whether triggered by monitor output or by the user asking "check status": | |
| 1. **Read the registry** from `.claude/active_jobs.json` | |
| 2. **Check each job** using the appropriate method (see below) | |
| 3. **Report only state changes** — compare against `last_status` in registry | |
| 4. **Update `last_status`** in the registry | |
| Whether triggered by monitor output or by the user asking "check status": | |
| 1. **Read the registry** from `.claude/active_jobs.json` | |
| 2. **Check each job** using the appropriate method (see below) | |
| 3. **If user asked for status, report current status for all tracked jobs** (even if unchanged) | |
| 4. **If triggered by monitor output, report only state changes** — compare against `last_status` in registry | |
| 5. **Update `last_status`** in the registry |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.claude/skills/monitor/SKILL.md around lines 54 - 59, The
documentation/logic currently enforces "report only state changes" universally;
update it so that user-initiated checks (e.g., when the user explicitly asks
"check status") return the full current status for each job rather than only
deltas—leave monitor-driven checks to still compare against `last_status` in
`.claude/active_jobs.json` and report only changes. Adjust the wording and any
associated pseudocode/implementation notes to branch on the trigger type
("monitor output" vs "user-initiated") and on user-initiated flows ensure you
read the registry, check each job, return current state for each job, and then
update `last_status` accordingly. Ensure references to `last_status` and
`.claude/active_jobs.json` remain consistent so maintainers can find and
implement the conditional behavior.
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1493 +/- ##
==========================================
- Coverage 72.75% 67.03% -5.72%
==========================================
Files 473 473
Lines 51418 51419 +1
==========================================
- Hits 37407 34470 -2937
- Misses 14011 16949 +2938
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.claude/skills/evaluation/SKILL.md (1)
203-212:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFix task snippet schema mismatch (
tasksvsevaluation.tasks).The Step 5 example contradicts earlier instructions to edit
evaluation.tasks. Keepingtasks:here can make generated configs invalid or ignored.Suggested fix
- tasks: - - name: <task> - nemo_evaluator_config: - config: - params: - temperature: <value> - max_new_tokens: <value> - ... + evaluation: + tasks: + - name: <task> + nemo_evaluator_config: + config: + params: + temperature: <value> + max_new_tokens: <value> + ...🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.claude/skills/evaluation/SKILL.md around lines 203 - 212, The YAML example uses a top-level "tasks:" key which conflicts with the expected "evaluation.tasks" namespace; update the snippet so the tasks list is nested under "evaluation.tasks" (e.g., replace "tasks:" with "evaluation.tasks:" and keep the existing task entries like "name" and "nemo_evaluator_config" intact), and verify any references to "tasks" in the surrounding text or examples are corrected to "evaluation.tasks" to keep schema consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.claude/skills/evaluation/recipes/tasks/gpqa.md:
- Around line 73-75: The extractor extract_gpqa_score currently can raise
IndexError when called without args and uses raw open(...) with yaml.safe_load;
add basic argument validation (ensure path is provided and repeats, if given, is
an int) and use a safe file context (with open(path, "r") as f) and
yaml.safe_load(f) while catching FileNotFoundError and yaml.YAMLError and
re-raising a clear ValueError; also validate that the expected keys exist in the
loaded dict (results -> groups -> gpqa -> metrics) and raise ValueError if
missing. Apply the same validation and safe-loading pattern to the similar
extractor function around lines 94-97 to ensure consistent error handling.
In @.claude/skills/evaluation/recipes/tasks/scicode.md:
- Around line 105-108: The extract_score function currently assumes a valid path
and opens the YAML without a context manager; fix it by validating the path
argument (raise ValueError or return a clear error if path is falsy), check the
file exists (catch FileNotFoundError), and read the YAML using a context manager
(with open(path) as f: data = yaml.safe_load(f)); then safely access
TASKS[group] and data["results"]["groups"][group"]["metrics"] (use .get or catch
KeyError to provide a clearer error). Apply the same changes to the other
identical snippet that reads the YAML and accesses metrics.
---
Outside diff comments:
In @.claude/skills/evaluation/SKILL.md:
- Around line 203-212: The YAML example uses a top-level "tasks:" key which
conflicts with the expected "evaluation.tasks" namespace; update the snippet so
the tasks list is nested under "evaluation.tasks" (e.g., replace "tasks:" with
"evaluation.tasks:" and keep the existing task entries like "name" and
"nemo_evaluator_config" intact), and verify any references to "tasks" in the
surrounding text or examples are corrected to "evaluation.tasks" to keep schema
consistent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1c2365f0-8719-4c16-9dd8-7640d57849f7
📒 Files selected for processing (18)
.claude/skills/evaluation/SKILL.md.claude/skills/evaluation/recipes/examples/example_eval.yaml.claude/skills/evaluation/recipes/tasks/aime2025.md.claude/skills/evaluation/recipes/tasks/aime2025.yaml.claude/skills/evaluation/recipes/tasks/gpqa.md.claude/skills/evaluation/recipes/tasks/gpqa.yaml.claude/skills/evaluation/recipes/tasks/ifbench.md.claude/skills/evaluation/recipes/tasks/ifbench.yaml.claude/skills/evaluation/recipes/tasks/livecodebench.md.claude/skills/evaluation/recipes/tasks/livecodebench.yaml.claude/skills/evaluation/recipes/tasks/mmlu_pro.md.claude/skills/evaluation/recipes/tasks/mmlu_pro.yaml.claude/skills/evaluation/recipes/tasks/scicode.md.claude/skills/evaluation/recipes/tasks/scicode.yaml.claude/skills/evaluation/tests/evals.json.claude/skills/ptq/SKILL.md.claude/skills/ptq/references/checkpoint-validation.md.claude/skills/ptq/tests.json
💤 Files with no reviewable changes (6)
- .claude/skills/evaluation/recipes/tasks/ifbench.yaml
- .claude/skills/evaluation/recipes/tasks/mmlu_pro.yaml
- .claude/skills/evaluation/recipes/tasks/aime2025.yaml
- .claude/skills/evaluation/recipes/tasks/scicode.yaml
- .claude/skills/evaluation/recipes/tasks/gpqa.yaml
- .claude/skills/evaluation/recipes/tasks/livecodebench.yaml
✅ Files skipped from review due to trivial changes (3)
- .claude/skills/evaluation/recipes/examples/example_eval.yaml
- .claude/skills/evaluation/recipes/tasks/ifbench.md
- .claude/skills/evaluation/recipes/tasks/mmlu_pro.md
🚧 Files skipped from review as they are similar to previous changes (1)
- .claude/skills/evaluation/tests/evals.json
| def extract_gpqa_score(path, repeats=None): | ||
| data = yaml.safe_load(open(path)) | ||
| metrics = data["results"]["groups"]["gpqa"]["metrics"] |
There was a problem hiding this comment.
Add basic argument validation and safe file loading in the extractor.
This snippet can throw IndexError if invoked without args and should avoid raw open(...) in yaml.safe_load.
Suggested fix
def extract_gpqa_score(path, repeats=None):
- data = yaml.safe_load(open(path))
+ with open(path, "r", encoding="utf-8") as f:
+ data = yaml.safe_load(f)
@@
if __name__ == "__main__":
- path = sys.argv[1]
+ if len(sys.argv) < 2:
+ raise SystemExit("Usage: python extract_gpqa_score.py <results.yaml> [repeats]")
+ path = sys.argv[1]
repeats = int(sys.argv[2]) if len(sys.argv) > 2 else None
print(extract_gpqa_score(path, repeats))Also applies to: 94-97
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.claude/skills/evaluation/recipes/tasks/gpqa.md around lines 73 - 75, The
extractor extract_gpqa_score currently can raise IndexError when called without
args and uses raw open(...) with yaml.safe_load; add basic argument validation
(ensure path is provided and repeats, if given, is an int) and use a safe file
context (with open(path, "r") as f) and yaml.safe_load(f) while catching
FileNotFoundError and yaml.YAMLError and re-raising a clear ValueError; also
validate that the expected keys exist in the loaded dict (results -> groups ->
gpqa -> metrics) and raise ValueError if missing. Apply the same validation and
safe-loading pattern to the similar extractor function around lines 94-97 to
ensure consistent error handling.
| def extract_score(path, group="scicode"): | ||
| spec = TASKS[group] | ||
| data = yaml.safe_load(open(path)) | ||
| metrics = data["results"]["groups"][group]["metrics"] |
There was a problem hiding this comment.
Harden CLI/file handling in the score extractor snippet.
The snippet crashes with IndexError when no path is passed, and it opens the YAML file without a context manager.
Suggested fix
def extract_score(path, group="scicode"):
spec = TASKS[group]
- data = yaml.safe_load(open(path))
+ with open(path, "r", encoding="utf-8") as f:
+ data = yaml.safe_load(f)
@@
if __name__ == "__main__":
- path = sys.argv[1]
+ if len(sys.argv) < 2:
+ raise SystemExit("Usage: python extract_score.py <results.yaml> [scicode|gpqa]")
+ path = sys.argv[1]
group = sys.argv[2] if len(sys.argv) > 2 else "scicode"
print(extract_score(path, group))Also applies to: 135-138
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.claude/skills/evaluation/recipes/tasks/scicode.md around lines 105 - 108,
The extract_score function currently assumes a valid path and opens the YAML
without a context manager; fix it by validating the path argument (raise
ValueError or return a clear error if path is falsy), check the file exists
(catch FileNotFoundError), and read the YAML using a context manager (with
open(path) as f: data = yaml.safe_load(f)); then safely access TASKS[group] and
data["results"]["groups"][group"]["metrics"] (use .get or catch KeyError to
provide a clearer error). Apply the same changes to the other identical snippet
that reads the YAML and accesses metrics.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.claude/skills/evaluation/SKILL.md:
- Around line 260-267: The preflight check currently tests for any credential
entry using the generic grep command; change it to verify credentials per
registry host used by the selected images by searching for the specific registry
hostnames (not just any "machine" entry) in ~/.config/enroot/.credentials.
Update the documented check (the grep invocation shown) to demonstrate matching
the actual registry host(s) (e.g., loop or run grep for each selected image's
registry host) so the preflight returns true only when credentials exist for
those specific hosts.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 466a2d74-5bf4-40f5-a94e-78d17d191a59
📒 Files selected for processing (7)
.claude/skills/evaluation/SKILL.md.claude/skills/evaluation/recipes/env.example.claude/skills/evaluation/recipes/tasks/aa_lcr.md.claude/skills/evaluation/recipes/tasks/aime2025.md.claude/skills/evaluation/recipes/tasks/hle_aa.md.claude/skills/evaluation/recipes/tasks/ifbench.md.claude/skills/evaluation/recipes/tasks/mmlu_pro_aa_v3.md
✅ Files skipped from review due to trivial changes (6)
- .claude/skills/evaluation/recipes/env.example
- .claude/skills/evaluation/recipes/tasks/mmlu_pro_aa_v3.md
- .claude/skills/evaluation/recipes/tasks/aa_lcr.md
- .claude/skills/evaluation/recipes/tasks/hle_aa.md
- .claude/skills/evaluation/recipes/tasks/ifbench.md
- .claude/skills/evaluation/recipes/tasks/aime2025.md
| ```bash | ||
| ssh <host> "grep -E '^\s*machine\s+' ~/.config/enroot/.credentials 2>/dev/null" | ||
| ``` | ||
|
|
||
| **Decision flow (check before submitting):** | ||
| 1. Check if the cluster has credentials for the default DockerHub image (see command above) | ||
| 2. If DockerHub credentials exist → use the default image and submit | ||
| 3. If DockerHub credentials are missing but can be added → add them (see `slurm-setup.md` section 6), then submit | ||
| 4. If DockerHub credentials cannot be added → override `deployment.image` to the NGC alternative and submit: | ||
| 1. If the selected images are public → submit without an auth preflight | ||
| 2. If any selected image is private or access-restricted → check for credentials for that image's registry (see command above) | ||
| 3. If credentials exist → use the selected image and submit |
There was a problem hiding this comment.
Make credential verification registry-specific.
Line 261 currently checks whether any credential entry exists, not whether credentials exist for the selected registry host(s). That can pass preflight but still fail image pulls.
Suggested doc fix
-ssh <host> "grep -E '^\s*machine\s+' ~/.config/enroot/.credentials 2>/dev/null"
+ssh <host> "awk '/^\s*machine\s+/ {print $2}' ~/.config/enroot/.credentials 2>/dev/null"
+# Verify the required registry host(s) from selected images are present (e.g., docker.io, nvcr.io, registry.internal).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.claude/skills/evaluation/SKILL.md around lines 260 - 267, The preflight
check currently tests for any credential entry using the generic grep command;
change it to verify credentials per registry host used by the selected images by
searching for the specific registry hostnames (not just any "machine" entry) in
~/.config/enroot/.credentials. Update the documented check (the grep invocation
shown) to demonstrate matching the actual registry host(s) (e.g., loop or run
grep for each selected image's registry host) so the preflight returns true only
when credentials exist for those specific hosts.
What does this PR do?
Type of change: bug fix
Usage
Ask Claude Code:
Testing
I'm running the full loop with the above prompt, and iterating on skills to resolve undesired agent behavior.
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅Additional Information
See trials log for details.
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Chore