FreeFlow Prompt Eval

Standalone evaluation harness for tuning FreeFlow-style dictation prompts against curated fixtures.

The runner mirrors the app's prompt flow:

Context synthesis from window metadata and an optional screenshot.
Dictation post-processing from raw transcript + context summary.

It is intentionally separate from the Swift app and uses only the Python standard library.

Current Status

The latest saved recommendation in this repo is:

Model: openai/gpt-oss-20b
System prompt: system-gptoss-multilingual-email-v25

What This Repo Contains

eval_groq_prompts.py: standalone runner for context, post-process, and end-to-end pipeline evals
eval/prompt_variants.json: context and system prompt candidates, including the current v25 system prompt
eval/prompt_eval_cases*.json: main and focused fixture suites
eval/results/: saved JSON outputs plus a few Markdown comparison notes
tests/test_eval_groq_prompts.py: regression tests for CLI parsing and output scoring behavior

What The Runner Supports

context mode: evaluate only the context-summary stage
postprocess mode: evaluate only transcript cleanup
pipeline mode: run both stages together
OpenAI-compatible chat APIs via --base-url
Groq direct and OpenRouter routing
optional OpenRouter provider routing via --provider-order, --provider-sort, and --allow-provider-fallbacks
heuristic, LLM-judge, hybrid, pairwise LLM, and pairwise-hybrid scoring
separate judge model selection via --judge-model
optional parallel case execution via --max-concurrency
screenshot-backed context cases when screenshot_path is present in a fixture

Scoring

Post-process scoring is no longer just raw reference similarity.

Heuristic scoring now combines:

reference similarity
required-term coverage
forbidden-term penalties
exact-match bonus
formatting checks for email structure, explicit lists, and self-correction cleanup
output-contract checks that penalize wrappers like Here is the clean transcript

LLM scoring is available with:

--scoring-mode llm
--scoring-mode hybrid
--scoring-mode pairwise-llm
--scoring-mode pairwise-hybrid

When LLM scoring is enabled, the judge defaults to the candidate model unless --judge-model is set explicitly. Pairwise modes compare exactly two system variants in one judge call per case and tie normalized-identical outputs without calling the judge.

Main Suites

eval/prompt_eval_cases_system_only_en_context.json Main post-processing suite with English context summaries and multilingual transcripts.
eval/prompt_eval_cases_system_only.json Earlier system-only suite.
eval/prompt_eval_cases_wispr_claims.json Wispr-inspired behavior suite.
eval/prompt_eval_cases_productivity.json Older productivity-focused Slack / email / prompt-writing cases.
eval/prompt_eval_cases.json Mixed pipeline-oriented cases.

There are also a few temporary focused case files under eval/tmp_*.json used during prompt iteration.

Quick Start

Python 3 is enough. There are no third-party dependencies.

Set one of these environment variables, or pass --api-key directly:

LLM_API_KEY
OPENROUTER_API_KEY
OPENAI_API_KEY
GROQ_API_KEY

Optional base URL env vars:

LLM_BASE_URL
OPENAI_BASE_URL
GROQ_BASE_URL

Example Commands

Run the current recommended setup through OpenRouter with hybrid scoring and an explicit judge model:

python3 eval_groq_prompts.py \
  --api-key "$OPENROUTER_API_KEY" \
  --base-url https://openrouter.ai/api/v1 \
  --mode postprocess \
  --cases eval/prompt_eval_cases_system_only_en_context.json \
  --models openai/gpt-oss-20b \
  --system-variants system-gptoss-multilingual-email-v25 \
  --scoring-mode hybrid \
  --judge-model openai/gpt-5.4-nano \
  --min-request-interval 0 \
  --max-concurrency 6 \
  --output-json eval/results/example-v25-hybrid.json

Compare two prompt variants head to head:

python3 eval_groq_prompts.py \
  --api-key "$OPENROUTER_API_KEY" \
  --base-url https://openrouter.ai/api/v1 \
  --mode postprocess \
  --cases eval/prompt_eval_cases_system_only_en_context.json \
  --models openai/gpt-oss-20b \
  --system-variants system-gptoss-multilingual-email-v24 system-gptoss-multilingual-email-v25 \
  --scoring-mode pairwise-hybrid \
  --judge-model openai/gpt-5.4-nano \
  --min-request-interval 0 \
  --max-concurrency 6 \
  --output-json eval/results/v24-vs-v25-pairwise-example.json

Run an end-to-end pipeline smoke test:

python3 eval_groq_prompts.py \
  --api-key "$OPENROUTER_API_KEY" \
  --base-url https://openrouter.ai/api/v1 \
  --mode pipeline \
  --cases eval/prompt_eval_cases.json \
  --models openai/gpt-oss-20b \
  --context-variants app-default-context \
  --system-variants app-default-system \
  --max-postprocess-cases 3 \
  --min-request-interval 0 \
  --output-json eval/results/pipeline-smoke.json

Run directly against Groq:

python3 eval_groq_prompts.py \
  --api-key "$GROQ_API_KEY" \
  --base-url https://api.groq.com/openai/v1 \
  --mode postprocess \
  --cases eval/prompt_eval_cases_system_only_en_context.json \
  --models meta-llama/llama-4-scout-17b-16e-instruct openai/gpt-oss-20b \
  --system-variants app-default-system system-gptoss-multilingual-email-v25 \
  --scoring-mode heuristic \
  --output-json eval/results/direct-groq-example.json

Output Format

The runner prints the top summary rows to stdout and can optionally save a full JSON payload with:

run metadata
routing settings
scoring mode
summary table
per-case outputs
per-case score breakdowns
raw judge responses when LLM scoring is enabled

Tests

Run the regression tests with:

python3 -m unittest discover -s tests -v

Current tests cover:

CLI parsing for --judge-model
email formatting penalties
wrapper / boilerplate penalties
explicit list formatting expectations
self-correction cleanup penalties
dictated email closing structure
literal handling of meta-instruction transcripts

Notes

The script name is historical. It now targets generic OpenAI-compatible chat APIs, not just Groq.
OpenRouter runs default to provider.sort=throughput unless you override it.
eval/results/ is the main reproducibility archive for this repo.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
eval		eval
tests		tests
.gitignore		.gitignore
README.md		README.md
eval_groq_prompts.py		eval_groq_prompts.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FreeFlow Prompt Eval

Current Status

What This Repo Contains

What The Runner Supports

Scoring

Main Suites

Quick Start

Example Commands

Output Format

Tests

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FreeFlow Prompt Eval

Current Status

What This Repo Contains

What The Runner Supports

Scoring

Main Suites

Quick Start

Example Commands

Output Format

Tests

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages