Distributed run pipeline by seanrivera · Pull Request #22 · ManifoldRG/MultiNet-v2.0

seanrivera · 2026-06-18T21:03:23Z

This is just a commit for the final run pipeline before we kick things off.

After the pipeline lands this is configuration and management changes.

…odules

…ps left

…o feature/run-pipeline

…into feature/run-pipeline # Conflicts: # ogbench

# Conflicts: # interface/agents.py # pyproject.toml

PR #18 added last_usage telemetry to the old single-file interface/agents.py, which no longer exists (now an interface/agents/ package). Port the pattern: ClaudeAnthropicAgent captures usage from the Anthropic response via normalize_token_usage; Qwen35VLAgent records prompt/output token counts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Wire the canonical pipeline stages over the interface/ runner (Stack A) and the scorer/ package into a single inspectable orchestrator: - pipeline/run_stage3.py: run one live-model episode -> episode.json - pipeline/episode_metrics.py: derive path_choice (test2), mechanism_interaction_order + failure_point (test3), token totals, and the Appendix A.3 episode_runs.jsonl row; enrich runs for the scorer - pipeline/reports.py: scoring_calibration_summary / complexity_distance_summary / mechanism_ordering_pairs aggregators - scripts/run_pipeline.py: Stage 1->5 CLI (multinet-run-pipeline) - scripts/validate_fixtures.py: validate fixtures + derive test2 route cells - gridworld/fixtures/: manifest + test2 shortcut maze + test3 ordering pairs (test1 reuses the existing validation_10 set) - tests for episode metrics, reports, and an end-to-end pipeline run Baselines (BFS/greedy) stay Stage-2 difficulty/canonical-path generators via the scorer; Stage-3 episodes are live-model-only. No DAG runner (kept sequential). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…m prompt template

…w=3 prompt

… prompt

Add a run-config layer that maps each model to its own task selection and provider/params, keeping the manifest as a separate metadata catalog: - scripts/run_pipeline.py: load_run_config + resolve_task_rows (entries may be task-file paths, catalog task_ids, or experiment keywords; catalog metadata is attached by path so test2/test3 signals survive); run_from_config drives multiple models, scoring the union suite once and aggregating one episode_runs.jsonl + report set. _build_agent_from_spec constructs claude/qwen agents from the model entry (provider/model/temperature/max_tokens). - CLI: --run-config is the primary path; --agent/--experiment remain a single-model fallback. - gridworld/fixtures/run_config.example.json: sample config. - tests for task resolution and a config-driven multi-model run (stub factory). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Cached artifacts are now reused only when their inputs hash still matches, instead of skipping purely on file existence: - Stage 2: reuse scored_static.json/canonical_paths.json only when the stored inputs_hash equals the hash recomputed from the current task spec + scorer config; otherwise regenerate the bundle. _expected_static_hash mirrors the scorer recipe (guarded by a parity test). - Stage 3 (model calls, the expensive stage): stamp each episode with a sidecar run_inputs.json carrying an inputs_hash over {task spec, model_id, seed, prompt config, backend, pipeline_version}; reuse the cached episode only on a hash match. Scorer-config changes intentionally do NOT invalidate the episode. - Stage 4 (cheap, deterministic): always re-score from the cached/fresh episode, so scorer-config / static / canonical changes propagate to run_score.json. - canonical_paths.json now carries its own inputs_hash (scorer/artifacts.py + solvers.py), closing the last unhashed scorer artifact. Tests: hash parity with the scorer, episode cache hit on unchanged re-run, task edit invalidating both static and episode, and scorer-config change re-scoring without re-running the model. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Keep dev specs/plans local-only; they pollute the pushed/release branch. docs/future_directions.md (product doc) retained. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

This reverts commit 08366f1.

…ew tests

…ry_summary # Conflicts: # gridworld/custom_env.py # prompting_experiments/prompt_templates/system.py # prompting_experiments/prompt_templates/user.py # prompting_experiments/prompts.txt # tests/test_prompt_observation_text.py

# Conflicts: # .gitignore # interface/runner.py # pyproject.toml # scorer/artifacts.py # scorer/runtime.py # scorer/solvers.py

helenlu66 and others added 30 commits May 29, 2026 01:27

made sure exp 3 prompts interface with the existing model interface m…

b47e2bc

…odules

added prompting_experiments

9ed266f

removed cardinal direction condition

410f5f7

removed the part of the prompt that tells the agent the number of ste…

314baf2

…ps left

got rid of the minimal prompt condition

5567780

Merge remote-tracking branch 'origin/codex/add-ogbench-submodule' int…

c5c589a

…o feature/run-pipeline

Merge remote-tracking branch 'origin/interface-prompts-consolidated' …

d6a5c4b

…into feature/run-pipeline # Conflicts: # ogbench

Merge remote-tracking branch 'origin/scorer' into feature/run-pipeline

ea109ea

# Conflicts: # interface/agents.py # pyproject.toml

moved the prose about the direction the agent's facing into the syste…

15dba61

…m prompt template

removed initial maze desc from standard prompts

6c7e54a

removed unecessary NL desc from system prompt

a3516c2

take 3 steps in the maze when previewing prompts to see context_windo…

9d74919

…w=3 prompt

fixed the prompt for full sequence of actions

652d686

added support for subgoal planning, moved output format cue into user…

9222d40

… prompt

renamed many conditions to standard

fcedf3c

added ogbench as a submodule

82ef138

fixed some D2 mazes

b71e7ad

Move S and M1 mazes to parent repo

0b06c3b

use the bfs solver in baselines

150e373

replaced 3D maze images with 2D ones

e31e28d

Remove VS Code workspace files

aef362b

added a description of the inventory to every prompt

f6aa8d7

fix one inconsistency between maze.goal and goal.target

3467c47

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

ensure maze.goal and goal.target are the same

8bc1660

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

ensuring the repo root is on sys.path

a614c7c

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

ensuring the repo root is on sys.path

6e68414

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

seanrivera and others added 30 commits June 24, 2026 13:25

qwen: prefer Qwen3.6 model class + no-quant smoke run-config

e5aadaa

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

deploy: Qwen3.6 load/throughput smoke (reports tok/s vs target)

f47ae6c

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

deploy: warn on 0-token smoke + assert MEETS tok/s value

7af4ba4

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

deploy: single invasive Qwen3.6 VM setup script

3c8731d

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

chore: drop dev planning docs from branch + gitignore docs/superpowers

d29d8a9

Keep dev specs/plans local-only; they pollute the pushed/release branch. docs/future_directions.md (product doc) retained. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

using public API plan_bfs_path to aovid merge conflict

fc222d1

gitmodules should always point to the main branch of the submodule

503c8da

resolve merge conflicts for submodules

6d8f563

Add standalone scorer artifacts

21f308c

Move scorer into standalone package

a96cff3

Split scorer modules and use merged planners

b1dd9d4

rebasing

05fd7c2

Harden scorer telemetry and artifact contracts

63594f1

Complete scorer runtime diagnostics and aggregation

c2305ef

Revert "Complete scorer runtime diagnostics and aggregation"

2ed8a3b

This reverts commit 08366f1.

Tighten scorer branch scope

d16f007

Fixed the comments, added more reliability to the run pipeline, and n…

91d76b4

…ew tests

Add kimi

6274470

Address comments

c34773e

Move S and M1 mazes to parent repo

5b0cd6c

added complete set of M mazes and D mazes

63bb919

fixed maze desc metadata

7102525

D tests now test for inactive switch

1f1394c

deleted duplicate images

a526805

Merge remote-tracking branch 'origin/pr/19' into feature/run-pipeline

bee26db

# Conflicts: # .gitignore # interface/runner.py # pyproject.toml # scorer/artifacts.py # scorer/runtime.py # scorer/solvers.py

Merge remote-tracking branch 'origin/pr/23' into feature/run-pipeline

915dad0

Fix PR 23 merge regressions

7fe131a

Tune distributed pipeline runtime and API usage

743a89b

Merge feature/run-pipeline into distributed pipeline

b748ce7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Distributed run pipeline#22

Distributed run pipeline#22
seanrivera wants to merge 124 commits into
mainfrom
Distributed-run-pipeline

seanrivera commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

seanrivera commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants