A benchmarking framework and command-line tool.
Two ways to use it:
bench run- ad-hoc command benchmarking from the shell.run(suite, ...)- declarative Python scripts for repeatable configurations.
$ cat fib.py
def fib(n):
return n if n < 2 else fib(n - 1) + fib(n - 2)
fib(35)
$ bench run --runs 5 'python3.9 fib.py' 'python3.14 fib.py'
run elapsed [ms]
matrix mean ± σ min … max
python3.9 fib.py 1392.4 ± 9.8 (1381.0 … 1403.2) (5 runs)
python3.14 fib.py 947.1 ± 5.6 (940.2 … 954.9) (5 runs)
Summary - run (elapsed)
python3.14 fib.py was
1.47 ± 0.01× better than python3.9 fib.py (5 runs)(5 runs) is the count of successful runs - Python 3.14 runs the recursive
fib(35) ~1.5x faster than 3.9. A run that crashes never enters the stats; it
is listed under a separate Failures: block instead.
from bench import Time, bench, run, suite
s = (
suite("fib",
bench("py3.9").with_command(["python3.9", "fib.py"]),
bench("py3.14").with_command(["python3.14", "fib.py"]),
)
.with_process_metric(Time()) # measure wall-clock elapsed
.with_runs(5) # 5 measured runs each
)
if __name__ == "__main__":
run(s)python compare.py --json out.jsonEvery CLI flag (--jobs, --json, --csv, --dir, --dry, --verbose,
--include/--exclude, --list, --show, ...) also works on a run(...)
script, and CLI flags override what the script set in code. See
examples/ for one runnable script per capability.
A run is a pipeline: you build suites of benchmarks, the framework resolves them into concrete variants, executes each one collecting measurements, and produces a report.
The building blocks:
- Suite - a named collection of benchmarks plus the defaults they inherit.
- Benchmark - command + cwd + env + metrics + stopping policy + variants.
- Run - one process execution: identity + outcome + its iterations (plus any whole-process samples).
- Iteration - one measured iteration: a bag of samples (or a failure).
- Sample - one metric value (name, value, unit).
- Report - the list of runs, JSON round-trippable.
A failed run still produces a Run (with failure set and no samples), so a
crashing benchmark is listed under a Failures: block with its reason - and a
variant that only sometimes fails shows a (failures|runs) count next to its
stats - instead of poisoning the stats with fake numbers.
The pieces you plug in:
- Metric - turns a run into samples. Two kinds: an
IterationMetricparses each iteration's text (Regex,FloatPerLine,Rebench), aProcessMetricreads the finished process (Time,RUsage,max_rss(), andPerfStatforperf stathardware counters). - StoppingPolicy - decides when enough runs have been collected:
FixedRuns(n),MaxDuration(seconds),CoefficientOfVariation(...), combined with&/|/.at_least(n)/.at_most(n). - Runner -
Sequential(default),Parallel(workers)(per-benchmark concurrency, only sound when wall-clock time isn't the metric), orDry(print what would run, no subprocess). - OutlierDetection - flags anomalous samples (kept in the report, not
dropped):
ModifiedZScore, orNoDetectionto switch it off.
Matrix benchmarks (.with_matrix(**dims)) expand into the cross-product of
dimension values, each cell a variant. Harness benchmarks
(.with_harness()) run the command once and let the process iterate
internally (e.g. for JIT warmup), each reported iteration becoming an
Iteration of that single run.
Outputs: live progress to the terminal, plus optional --json, --csv, and
--dir (per-execution tree). bench compare a.json b.json ... diffs saved
reports after the fact (the first file is the baseline), and bench show report.json re-renders a single saved report.
Environment & noise: bench doctor snapshots the machine and flags noise
sources (CPU governor, turbo, ASLR, ...), exiting non-zero on a high-severity
issue so it can gate a session. On Linux as root, bench denoise minimize|restore|status quiets those knobs and reverts them. Per run,
--check-environment records the snapshot and runs the checks (off by default)
and --denoise minimizes the knobs for the run and restores them after.
uv run pytest
uv run bench run --runs 20 'sleep 0.1' 'sleep 0.2'Much of the bench has been inspired by these great tools: