Standalone evaluation harness for tuning FreeFlow-style dictation prompts against curated fixtures.
The runner mirrors the app's prompt flow:
- Context synthesis from window metadata and an optional screenshot.
- Dictation post-processing from raw transcript + context summary.
It is intentionally separate from the Swift app and uses only the Python standard library.
The latest saved recommendation in this repo is:
- Model:
openai/gpt-oss-20b - System prompt:
system-gptoss-multilingual-email-v25
Recent result artifacts:
eval/results/v24-vs-v25-pairwise-hybrid-concurrency25-throughput-all-cases-2026-05-20-5run-aggregate.mdeval/results/v24-vs-v25-pairwise-hybrid-concurrency25-throughput-all-cases-2026-05-20-run1.jsoneval/results/v24-vs-v25-pairwise-hybrid-concurrency25-throughput-all-cases-2026-05-20-run2.jsoneval/results/v24-vs-v25-pairwise-hybrid-concurrency25-throughput-all-cases-2026-05-20-run3.jsoneval/results/v24-vs-v25-pairwise-hybrid-concurrency25-throughput-all-cases-2026-05-20-run4.jsoneval/results/v24-vs-v25-pairwise-hybrid-concurrency25-throughput-all-cases-2026-05-20-run5.jsoneval/results/model-compare-v24-hybrid-gpt54nano-judge-2026-04-05.jsoneval/results/model-compare-v24-hybrid-gpt54nano-judge-2026-04-05.mdeval/results/v24-vs-v9-hybrid-concurrency25-en-context-2026-04-01.jsoneval/results/v9-vs-v24-concurrency25-en-context-2026-04-01.md
Key numbers from the latest v24 vs v25 pairwise-hybrid runs:
- Five-run mean on the 37-case English-context suite:
v250.8875vsv240.8712. - Mean delta for
v25:+0.0163, with delta standard deviation0.0264;v25won 3 runs andv24won 2. - Instruction-preservation slice mean:
v250.9763vsv240.9397. - Because the mean margin is small relative to run-to-run variance, the recommendation is pragmatic rather than decisive:
v25is now default because it adds explicit instruction-preservation coverage without a clear aggregate regression.
Historical v24 reference numbers:
v24+openai/gpt-oss-20b:0.8901average hybrid score on the 32-case English-context suite withopenai/gpt-5.4-nanoas judge.v24beatv9head-to-head on the same 32-case suite:0.8717vs0.8300.- On the
v24model comparison run,openai/gpt-oss-20bbeatmeta-llama/llama-4-scoutunder both hybrid and heuristic scoring.
The current tradeoff is that v25 keeps the v24 behavior and adds a stricter instruction-preservation block for transcripts that describe requests to another person or AI assistant. Several Romanian, formal-email, and technical-prose cases still show run-to-run variance worth improving further.
eval_groq_prompts.py: standalone runner for context, post-process, and end-to-end pipeline evalseval/prompt_variants.json: context and system prompt candidates, including the currentv25system prompteval/prompt_eval_cases*.json: main and focused fixture suiteseval/results/: saved JSON outputs plus a few Markdown comparison notestests/test_eval_groq_prompts.py: regression tests for CLI parsing and output scoring behavior
contextmode: evaluate only the context-summary stagepostprocessmode: evaluate only transcript cleanuppipelinemode: run both stages together- OpenAI-compatible chat APIs via
--base-url - Groq direct and OpenRouter routing
- optional OpenRouter provider routing via
--provider-order,--provider-sort, and--allow-provider-fallbacks - heuristic, LLM-judge, hybrid, pairwise LLM, and pairwise-hybrid scoring
- separate judge model selection via
--judge-model - optional parallel case execution via
--max-concurrency - screenshot-backed context cases when
screenshot_pathis present in a fixture
Post-process scoring is no longer just raw reference similarity.
Heuristic scoring now combines:
- reference similarity
- required-term coverage
- forbidden-term penalties
- exact-match bonus
- formatting checks for email structure, explicit lists, and self-correction cleanup
- output-contract checks that penalize wrappers like
Here is the clean transcript
LLM scoring is available with:
--scoring-mode llm--scoring-mode hybrid--scoring-mode pairwise-llm--scoring-mode pairwise-hybrid
When LLM scoring is enabled, the judge defaults to the candidate model unless --judge-model is set explicitly.
Pairwise modes compare exactly two system variants in one judge call per case and tie normalized-identical outputs without calling the judge.
-
eval/prompt_eval_cases_system_only_en_context.jsonMain post-processing suite with English context summaries and multilingual transcripts. -
eval/prompt_eval_cases_system_only.jsonEarlier system-only suite. -
eval/prompt_eval_cases_wispr_claims.jsonWispr-inspired behavior suite. -
eval/prompt_eval_cases_productivity.jsonOlder productivity-focused Slack / email / prompt-writing cases. -
eval/prompt_eval_cases.jsonMixed pipeline-oriented cases.
There are also a few temporary focused case files under eval/tmp_*.json used during prompt iteration.
Python 3 is enough. There are no third-party dependencies.
Set one of these environment variables, or pass --api-key directly:
LLM_API_KEYOPENROUTER_API_KEYOPENAI_API_KEYGROQ_API_KEY
Optional base URL env vars:
LLM_BASE_URLOPENAI_BASE_URLGROQ_BASE_URL
Run the current recommended setup through OpenRouter with hybrid scoring and an explicit judge model:
python3 eval_groq_prompts.py \
--api-key "$OPENROUTER_API_KEY" \
--base-url https://openrouter.ai/api/v1 \
--mode postprocess \
--cases eval/prompt_eval_cases_system_only_en_context.json \
--models openai/gpt-oss-20b \
--system-variants system-gptoss-multilingual-email-v25 \
--scoring-mode hybrid \
--judge-model openai/gpt-5.4-nano \
--min-request-interval 0 \
--max-concurrency 6 \
--output-json eval/results/example-v25-hybrid.jsonCompare two prompt variants head to head:
python3 eval_groq_prompts.py \
--api-key "$OPENROUTER_API_KEY" \
--base-url https://openrouter.ai/api/v1 \
--mode postprocess \
--cases eval/prompt_eval_cases_system_only_en_context.json \
--models openai/gpt-oss-20b \
--system-variants system-gptoss-multilingual-email-v24 system-gptoss-multilingual-email-v25 \
--scoring-mode pairwise-hybrid \
--judge-model openai/gpt-5.4-nano \
--min-request-interval 0 \
--max-concurrency 6 \
--output-json eval/results/v24-vs-v25-pairwise-example.jsonRun an end-to-end pipeline smoke test:
python3 eval_groq_prompts.py \
--api-key "$OPENROUTER_API_KEY" \
--base-url https://openrouter.ai/api/v1 \
--mode pipeline \
--cases eval/prompt_eval_cases.json \
--models openai/gpt-oss-20b \
--context-variants app-default-context \
--system-variants app-default-system \
--max-postprocess-cases 3 \
--min-request-interval 0 \
--output-json eval/results/pipeline-smoke.jsonRun directly against Groq:
python3 eval_groq_prompts.py \
--api-key "$GROQ_API_KEY" \
--base-url https://api.groq.com/openai/v1 \
--mode postprocess \
--cases eval/prompt_eval_cases_system_only_en_context.json \
--models meta-llama/llama-4-scout-17b-16e-instruct openai/gpt-oss-20b \
--system-variants app-default-system system-gptoss-multilingual-email-v25 \
--scoring-mode heuristic \
--output-json eval/results/direct-groq-example.jsonThe runner prints the top summary rows to stdout and can optionally save a full JSON payload with:
- run metadata
- routing settings
- scoring mode
- summary table
- per-case outputs
- per-case score breakdowns
- raw judge responses when LLM scoring is enabled
Run the regression tests with:
python3 -m unittest discover -s tests -vCurrent tests cover:
- CLI parsing for
--judge-model - email formatting penalties
- wrapper / boilerplate penalties
- explicit list formatting expectations
- self-correction cleanup penalties
- dictated email closing structure
- literal handling of meta-instruction transcripts
- The script name is historical. It now targets generic OpenAI-compatible chat APIs, not just Groq.
- OpenRouter runs default to
provider.sort=throughputunless you override it. eval/results/is the main reproducibility archive for this repo.