Distributed run pipeline#22
Open
seanrivera wants to merge 124 commits into
Open
Conversation
…o feature/run-pipeline
…into feature/run-pipeline # Conflicts: # ogbench
# Conflicts: # interface/agents.py # pyproject.toml
PR #18 added last_usage telemetry to the old single-file interface/agents.py, which no longer exists (now an interface/agents/ package). Port the pattern: ClaudeAnthropicAgent captures usage from the Anthropic response via normalize_token_usage; Qwen35VLAgent records prompt/output token counts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Wire the canonical pipeline stages over the interface/ runner (Stack A) and the scorer/ package into a single inspectable orchestrator: - pipeline/run_stage3.py: run one live-model episode -> episode.json - pipeline/episode_metrics.py: derive path_choice (test2), mechanism_interaction_order + failure_point (test3), token totals, and the Appendix A.3 episode_runs.jsonl row; enrich runs for the scorer - pipeline/reports.py: scoring_calibration_summary / complexity_distance_summary / mechanism_ordering_pairs aggregators - scripts/run_pipeline.py: Stage 1->5 CLI (multinet-run-pipeline) - scripts/validate_fixtures.py: validate fixtures + derive test2 route cells - gridworld/fixtures/: manifest + test2 shortcut maze + test3 ordering pairs (test1 reuses the existing validation_10 set) - tests for episode metrics, reports, and an end-to-end pipeline run Baselines (BFS/greedy) stay Stage-2 difficulty/canonical-path generators via the scorer; Stage-3 episodes are live-model-only. No DAG runner (kept sequential). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…m prompt template
Add a run-config layer that maps each model to its own task selection and provider/params, keeping the manifest as a separate metadata catalog: - scripts/run_pipeline.py: load_run_config + resolve_task_rows (entries may be task-file paths, catalog task_ids, or experiment keywords; catalog metadata is attached by path so test2/test3 signals survive); run_from_config drives multiple models, scoring the union suite once and aggregating one episode_runs.jsonl + report set. _build_agent_from_spec constructs claude/qwen agents from the model entry (provider/model/temperature/max_tokens). - CLI: --run-config is the primary path; --agent/--experiment remain a single-model fallback. - gridworld/fixtures/run_config.example.json: sample config. - tests for task resolution and a config-driven multi-model run (stub factory). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Cached artifacts are now reused only when their inputs hash still matches,
instead of skipping purely on file existence:
- Stage 2: reuse scored_static.json/canonical_paths.json only when the stored
inputs_hash equals the hash recomputed from the current task spec + scorer
config; otherwise regenerate the bundle. _expected_static_hash mirrors the
scorer recipe (guarded by a parity test).
- Stage 3 (model calls, the expensive stage): stamp each episode with a sidecar
run_inputs.json carrying an inputs_hash over {task spec, model_id, seed, prompt
config, backend, pipeline_version}; reuse the cached episode only on a hash
match. Scorer-config changes intentionally do NOT invalidate the episode.
- Stage 4 (cheap, deterministic): always re-score from the cached/fresh episode,
so scorer-config / static / canonical changes propagate to run_score.json.
- canonical_paths.json now carries its own inputs_hash (scorer/artifacts.py +
solvers.py), closing the last unhashed scorer artifact.
Tests: hash parity with the scorer, episode cache hit on unchanged re-run, task
edit invalidating both static and episode, and scorer-config change re-scoring
without re-running the model.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Keep dev specs/plans local-only; they pollute the pushed/release branch. docs/future_directions.md (product doc) retained. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This reverts commit 08366f1.
…ry_summary # Conflicts: # gridworld/custom_env.py # prompting_experiments/prompt_templates/system.py # prompting_experiments/prompt_templates/user.py # prompting_experiments/prompts.txt # tests/test_prompt_observation_text.py
# Conflicts: # .gitignore # interface/runner.py # pyproject.toml # scorer/artifacts.py # scorer/runtime.py # scorer/solvers.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is just a commit for the final run pipeline before we kick things off.
After the pipeline lands this is configuration and management changes.