Skip to content

Add benchmark pipeline with Rust-native A/B validation#912

Merged
YuanyuanTian-hh merged 42 commits intomainfrom
user/tianyuanyuan/benchmark-regression
Apr 17, 2026
Merged

Add benchmark pipeline with Rust-native A/B validation#912
YuanyuanTian-hh merged 42 commits intomainfrom
user/tianyuanyuan/benchmark-regression

Conversation

@YuanyuanTian-hh
Copy link
Copy Markdown
Contributor

@YuanyuanTian-hh YuanyuanTian-hh commented Apr 7, 2026

Add benchmark regression pipeline with Rust-native A/B validation

Summary

Adds an automated benchmark regression pipeline to GitHub Actions. This PR builds on PR #900 (benchmark A/B test framework) by implementing the Regression trait for disk-index benchmarks and wiring it into CI workflows. The pipeline builds and searches two public 100K ANN datasets, compares before/after performance, and validates against configurable tolerances all in typed Rust code with no Python dependencies.

What's Changed

Benchmark Runner (from PR #900)

  • Rust-native A/B test framework in diskann-benchmark-runner
  • Regression trait, tolerance matching, check run CLI subcommand

Disk-Index Regression Support (this PR)

File Description
diskann-benchmark/src/backend/disk_index/benchmarks.rs Implements Regression trait for DiskIndex<T> with typed before/after comparison of 7 metrics
diskann-benchmark/src/backend/disk_index/search.rs Added Deserialize to DiskSearchStats, DiskSearchResult
diskann-benchmark/src/backend/disk_index/build.rs Added Deserialize to DiskBuildStats, exposed build_time_seconds()
diskann-benchmark/perf_test_inputs/disk-index-tolerances.json Tolerance config: 10% build/QPS, 1% recall/IOs/comps, 15% latency
diskann-benchmark/perf_test_inputs/wikipedia-100K-disk-index.json Benchmark config: 768-dim, inner_product, search_list=[200]
diskann-benchmark/perf_test_inputs/openai-100K-disk-index.json Benchmark config: 1536-dim, squared_l2, SQ_1_2.0, search_list=[200]

CI Workflows

File Description
.github/workflows/benchmarks.yml PR regression workflow triggers on pull_request to main, validates with cargo ... check run
.github/workflows/benchmarks-aa.yml Daily A/A stability test (main vs main) with GitHub issue on failure

How It Works

  1. Checkout both current branch and baseline (defaults to main)
  2. Download datasets from big-ann-benchmarks v0.4.0
  3. Build & search disk index on both branches
  4. Validate with cargo run -p diskann-benchmark --features disk-index --release -- check run --tolerances ... --before ... --after ...
  5. Upload JSON artifacts for 30-day retention

Regression Checks

The Regression trait implementation compares 7 metrics between before and after runs:

Metric Direction Tolerance Rationale
build_time Lower is better 10% CPU-bound, moderate noise
QPS Higher is better 10% CPU-bound, moderate noise
recall Higher is better 1% Algorithmic, near-deterministic
mean_ios Lower is better 1% Algorithmic, near-deterministic
mean_comparisons Lower is better 1% Algorithmic, near-deterministic
mean_latency Lower is better 15% Timing, noisy on shared runners
p95_latency Lower is better 15% Timing, noisy on shared runners

All checks use relative_change() which properly handles zero-baseline (returns error, not 0%).

Datasets

Dataset Dimensions Distance Vectors Queries search_list
Wikipedia-100K 768 inner_product 100K 5,000 200
OpenAI ArXiv-100K 1,536 squared_l2 100K 20,000 200

Improvements over PR #857

This PR addresses all blocking review comments from PR #857:

Concern PR #857 (Python) PR #912 (Rust)
Phantom thresholds / silent success Orphaned categories silently skipped Runner errors if tolerances don't match inputs
Bad direction values pass silently Falls through to else branch Type-safe, no string directions
Division by zero masked as 0% Returns 0 when baseline=0 Returns error ("before must be > 0")
Missing fields default to 0 .get(field, 0) Rust deserialization fails on missing fields
Manual trigger only workflow_dispatch only pull_request trigger with path filters
CI time ~70 min search_list=2000 search_list=200, ~10 min per job
Hardcoded Rust version rust_stable: "1.92" toolchain: stable, reads rust-toolchain.toml
Python dependencies benchmark_validate.py Zero Python, pure Rust

@YuanyuanTian-hh YuanyuanTian-hh changed the title User/tianyuanyuan/benchmark regression Add benchmark regression pipeline with Rust-native A/B validation Apr 7, 2026
@YuanyuanTian-hh YuanyuanTian-hh marked this pull request as ready for review April 7, 2026 08:31
@YuanyuanTian-hh YuanyuanTian-hh requested review from a team and Copilot April 7, 2026 08:31
@YuanyuanTian-hh YuanyuanTian-hh changed the title Add benchmark regression pipeline with Rust-native A/B validation Add benchmark pipeline with Rust-native A/B validation Apr 7, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Rust-native benchmark regression pipeline that compares “before vs after” benchmark outputs in CI using typed deserialization + a Regression trait, removing the need for external Python validation.

Changes:

  • Implement regression checking for disk-index benchmarks (tolerance type + metric comparisons) and make disk-index build/search stats deserializable for A/B comparison.
  • Extend diskann-benchmark-runner with check CLI subcommands and internal tolerance matching/dispatch + numerous regression/UX tests.
  • Add GitHub Actions workflows and benchmark/tolerance JSON inputs to run and validate regressions on PRs and via daily A/A runs.

Reviewed changes

Copilot reviewed 111 out of 151 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
diskann-benchmark/src/backend/disk_index/search.rs Adds Deserialize to disk search stats/results for typed A/B validation.
diskann-benchmark/src/backend/disk_index/build.rs Adds Deserialize and exposes build time in seconds for regression checks.
diskann-benchmark/src/backend/disk_index/benchmarks.rs Registers disk-index benchmarks as regression-capable and implements metric-based regression checking + tolerance input type.
diskann-benchmark/perf_test_inputs/wikipedia-100K-disk-index.json Adds Wikipedia-100K disk-index benchmark configuration for CI runs.
diskann-benchmark/perf_test_inputs/openai-100K-disk-index.json Adds OpenAI ArXiv-100K disk-index benchmark configuration for CI runs.
diskann-benchmark/perf_test_inputs/disk-index-tolerances.json Adds default tolerance thresholds for disk-index regression validation.
diskann-benchmark-simd/src/lib.rs Wires SIMD benchmarks into regression framework + adds tolerance type and regression check implementation + tests.
diskann-benchmark-simd/src/bin.rs Updates SIMD binary tests to include check verify invocation.
diskann-benchmark-simd/examples/tolerance.json Adds example tolerance file for SIMD regression checks.
diskann-benchmark-runner/tests/regression/check-verify-4/tolerances.json Adds regression UX fixture for incompatible input/tolerance tag error.
diskann-benchmark-runner/tests/regression/check-verify-4/stdout.txt Expected output for check-verify-4.
diskann-benchmark-runner/tests/regression/check-verify-4/stdin.txt Command script for check-verify-4.
diskann-benchmark-runner/tests/regression/check-verify-4/README.md Describes the check-verify-4 scenario.
diskann-benchmark-runner/tests/regression/check-verify-4/input.json Input fixture for check-verify-4.
diskann-benchmark-runner/tests/regression/check-verify-3/tolerances.json Adds regression UX fixture for ambiguous/uncovered/orphaned tolerance matching.
diskann-benchmark-runner/tests/regression/check-verify-3/stdout.txt Expected output for check-verify-3.
diskann-benchmark-runner/tests/regression/check-verify-3/stdin.txt Command script for check-verify-3.
diskann-benchmark-runner/tests/regression/check-verify-3/input.json Input fixture for check-verify-3.
diskann-benchmark-runner/tests/regression/check-verify-2/tolerances.json Adds regression UX fixture for “no matching benchmark” during verify.
diskann-benchmark-runner/tests/regression/check-verify-2/stdout.txt Expected output for check-verify-2.
diskann-benchmark-runner/tests/regression/check-verify-2/stdin.txt Command script for check-verify-2.
diskann-benchmark-runner/tests/regression/check-verify-2/README.md Describes the check-verify-2 scenario.
diskann-benchmark-runner/tests/regression/check-verify-2/input.json Input fixture for check-verify-2.
diskann-benchmark-runner/tests/regression/check-verify-1/tolerances.json Adds regression UX fixture for unrecognized tolerance tag.
diskann-benchmark-runner/tests/regression/check-verify-1/stdout.txt Expected output for check-verify-1.
diskann-benchmark-runner/tests/regression/check-verify-1/stdin.txt Command script for check-verify-1.
diskann-benchmark-runner/tests/regression/check-verify-1/README.md Describes the check-verify-1 scenario.
diskann-benchmark-runner/tests/regression/check-verify-1/input.json Input fixture for check-verify-1.
diskann-benchmark-runner/tests/regression/check-verify-0/tolerances.json Adds regression UX fixture for successful verify (no stdout).
diskann-benchmark-runner/tests/regression/check-verify-0/stdin.txt Command script for check-verify-0.
diskann-benchmark-runner/tests/regression/check-verify-0/README.md Describes the check-verify-0 scenario.
diskann-benchmark-runner/tests/regression/check-verify-0/input.json Input fixture for check-verify-0.
diskann-benchmark-runner/tests/regression/check-tolerances-2/stdout.txt Expected output for requesting a nonexistent tolerance kind.
diskann-benchmark-runner/tests/regression/check-tolerances-2/stdin.txt Command script for check-tolerances-2.
diskann-benchmark-runner/tests/regression/check-tolerances-2/README.md Describes the check-tolerances-2 scenario.
diskann-benchmark-runner/tests/regression/check-tolerances-1/stdout.txt Expected output for describing a specific tolerance kind.
diskann-benchmark-runner/tests/regression/check-tolerances-1/stdin.txt Command script for check-tolerances-1.
diskann-benchmark-runner/tests/regression/check-tolerances-1/README.md Describes the check-tolerances-1 scenario.
diskann-benchmark-runner/tests/regression/check-tolerances-0/stdout.txt Expected output for listing all tolerance kinds.
diskann-benchmark-runner/tests/regression/check-tolerances-0/stdin.txt Command script for check-tolerances-0.
diskann-benchmark-runner/tests/regression/check-tolerances-0/README.md Describes the check-tolerances-0 scenario.
diskann-benchmark-runner/tests/regression/check-skeleton-0/stdout.txt Expected output for tolerance skeleton printing.
diskann-benchmark-runner/tests/regression/check-skeleton-0/stdin.txt Command script for check-skeleton-0.
diskann-benchmark-runner/tests/regression/check-skeleton-0/README.md Describes the check-skeleton-0 scenario.
diskann-benchmark-runner/tests/regression/check-run-pass-0/tolerances.json Adds regression UX fixture for successful check run execution.
diskann-benchmark-runner/tests/regression/check-run-pass-0/stdout.txt Expected output for successful check run.
diskann-benchmark-runner/tests/regression/check-run-pass-0/stdin.txt Command script for pass-case check run.
diskann-benchmark-runner/tests/regression/check-run-pass-0/README.md Describes the check-run-pass-0 scenario.
diskann-benchmark-runner/tests/regression/check-run-pass-0/output.json Output fixture used as both before/after.
diskann-benchmark-runner/tests/regression/check-run-pass-0/input.json Input fixture for pass-case run.
diskann-benchmark-runner/tests/regression/check-run-pass-0/checks.json Expected JSON output from pass-case checks.
diskann-benchmark-runner/tests/regression/check-run-fail-0/tolerances.json Adds regression UX fixture for a failing check run result.
diskann-benchmark-runner/tests/regression/check-run-fail-0/stdout.txt Expected output for failing check run.
diskann-benchmark-runner/tests/regression/check-run-fail-0/stdin.txt Command script for fail-case check run.
diskann-benchmark-runner/tests/regression/check-run-fail-0/README.md Describes the check-run-fail-0 scenario.
diskann-benchmark-runner/tests/regression/check-run-fail-0/output.json Output fixture for fail-case run.
diskann-benchmark-runner/tests/regression/check-run-fail-0/input.json Input fixture for fail-case run.
diskann-benchmark-runner/tests/regression/check-run-fail-0/checks.json Expected JSON output from fail-case checks.
diskann-benchmark-runner/tests/regression/check-run-error-3/tolerances.json Adds regression UX fixture for before/after schema drift error reporting.
diskann-benchmark-runner/tests/regression/check-run-error-3/stdout.txt Expected output for schema drift error.
diskann-benchmark-runner/tests/regression/check-run-error-3/stdin.txt Command script for schema drift error.
diskann-benchmark-runner/tests/regression/check-run-error-3/regression_input.json Regression input fixture to force schema mismatch.
diskann-benchmark-runner/tests/regression/check-run-error-3/README.md Describes the check-run-error-3 scenario.
diskann-benchmark-runner/tests/regression/check-run-error-3/output.json Output fixture used to trigger schema mismatch.
diskann-benchmark-runner/tests/regression/check-run-error-3/input.json Input fixture used to generate output.json.
diskann-benchmark-runner/tests/regression/check-run-error-3/checks.json Expected JSON output from error-case checks.
diskann-benchmark-runner/tests/regression/check-run-error-2/tolerances.json Adds regression UX fixture for “input drift” dispatch failure in check run.
diskann-benchmark-runner/tests/regression/check-run-error-2/stdout.txt Expected output for input drift dispatch failure.
diskann-benchmark-runner/tests/regression/check-run-error-2/stdin.txt Command script for input drift dispatch failure.
diskann-benchmark-runner/tests/regression/check-run-error-2/regression_input.json Regression input fixture that drifts to unsupported type.
diskann-benchmark-runner/tests/regression/check-run-error-2/README.md Describes the check-run-error-2 scenario.
diskann-benchmark-runner/tests/regression/check-run-error-2/output.json Output fixture for error-case run.
diskann-benchmark-runner/tests/regression/check-run-error-2/input.json Input fixture for error-case run.
diskann-benchmark-runner/tests/regression/check-run-error-1/tolerances.json Adds regression UX fixture for before/after length mismatch.
diskann-benchmark-runner/tests/regression/check-run-error-1/stdout.txt Expected output for length mismatch.
diskann-benchmark-runner/tests/regression/check-run-error-1/stdin.txt Command script for length mismatch.
diskann-benchmark-runner/tests/regression/check-run-error-1/regression_input.json Regression input fixture with different job count.
diskann-benchmark-runner/tests/regression/check-run-error-1/README.md Describes the check-run-error-1 scenario.
diskann-benchmark-runner/tests/regression/check-run-error-1/output.json Output fixture with mismatched job count.
diskann-benchmark-runner/tests/regression/check-run-error-1/input.json Input fixture used to generate output.json.
diskann-benchmark-runner/tests/regression/check-run-error-0/tolerances.json Adds regression UX fixture for infrastructure error propagation.
diskann-benchmark-runner/tests/regression/check-run-error-0/stdout.txt Expected output for infrastructure errors.
diskann-benchmark-runner/tests/regression/check-run-error-0/stdin.txt Command script for infrastructure errors.
diskann-benchmark-runner/tests/regression/check-run-error-0/README.md Describes the check-run-error-0 scenario.
diskann-benchmark-runner/tests/regression/check-run-error-0/output.json Output fixture for infrastructure errors.
diskann-benchmark-runner/tests/regression/check-run-error-0/input.json Input fixture for infrastructure errors.
diskann-benchmark-runner/tests/regression/check-run-error-0/checks.json Expected JSON output from error-case checks.
diskann-benchmark-runner/tests/benchmark/test-success-1/stdout.txt Adds expected output for run --dry-run success.
diskann-benchmark-runner/tests/benchmark/test-success-1/stdin.txt Adds command script for run --dry-run.
diskann-benchmark-runner/tests/benchmark/test-success-1/README.md Describes the dry-run behavior expectation.
diskann-benchmark-runner/tests/benchmark/test-success-1/input.json Input fixture for dry-run test.
diskann-benchmark-runner/tests/benchmark/test-success-0/stdout.txt Updates expected stdout for successful run output text changes.
diskann-benchmark-runner/tests/benchmark/test-success-0/stdin.txt Adds command script for benchmark success test.
diskann-benchmark-runner/tests/benchmark/test-success-0/README.md Describes benchmark success test.
diskann-benchmark-runner/tests/benchmark/test-success-0/output.json Adds expected output.json for benchmark success test.
diskann-benchmark-runner/tests/benchmark/test-success-0/input.json Adds input fixture for benchmark success test.
diskann-benchmark-runner/tests/benchmark/test-overload-0/stdout.txt Adds expected output for overload/dispatch scoring test.
diskann-benchmark-runner/tests/benchmark/test-overload-0/stdin.txt Adds command script for overload test.
diskann-benchmark-runner/tests/benchmark/test-overload-0/README.md Describes overload/dispatch selection behavior.
diskann-benchmark-runner/tests/benchmark/test-overload-0/output.json Adds expected output.json for overload test.
diskann-benchmark-runner/tests/benchmark/test-overload-0/input.json Adds input fixture for overload test.
diskann-benchmark-runner/tests/benchmark/test-mismatch-1/stdout.txt Adds expected diagnostics for mismatch description paths.
diskann-benchmark-runner/tests/benchmark/test-mismatch-1/stdin.txt Adds command script for mismatch test.
diskann-benchmark-runner/tests/benchmark/test-mismatch-1/README.md Describes mismatch diagnostics scenario.
diskann-benchmark-runner/tests/benchmark/test-mismatch-1/input.json Adds input fixture for mismatch test.
diskann-benchmark-runner/tests/benchmark/test-mismatch-0/stdout.txt Adds expected diagnostics for “closest matches” reporting.
diskann-benchmark-runner/tests/benchmark/test-mismatch-0/stdin.txt Adds command script for mismatch test.
diskann-benchmark-runner/tests/benchmark/test-mismatch-0/README.md Describes mismatch “closest matches” behavior.
diskann-benchmark-runner/tests/benchmark/test-mismatch-0/input.json Adds input fixture for mismatch test.
diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/stdout.txt Adds expected output for input deserialization error reporting.
diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/stdin.txt Adds command script for deserialization error test.
diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/README.md Describes deserialization error behavior.
diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/input.json Adds input fixture with invalid enum value.
diskann-benchmark-runner/tests/benchmark/test-4/stdout.txt Updates benchmark listing output to include new simple bench.
diskann-benchmark-runner/tests/benchmark/test-4/stdin.txt Adds command script for benchmark listing.
diskann-benchmark-runner/tests/benchmark/test-4/README.md Describes benchmark listing test.
diskann-benchmark-runner/tests/benchmark/test-3/stdout.txt Adds expected output for describing a specific input kind.
diskann-benchmark-runner/tests/benchmark/test-3/stdin.txt Adds command script for input describe test.
diskann-benchmark-runner/tests/benchmark/test-3/README.md Describes input describe test.
diskann-benchmark-runner/tests/benchmark/test-2/stdout.txt Adds expected output for describing a specific input kind.
diskann-benchmark-runner/tests/benchmark/test-2/stdin.txt Adds command script for input describe test.
diskann-benchmark-runner/tests/benchmark/test-2/README.md Describes input describe test.
diskann-benchmark-runner/tests/benchmark/test-1/stdout.txt Adds expected output for listing available input kinds.
diskann-benchmark-runner/tests/benchmark/test-1/stdin.txt Adds command script for input listing.
diskann-benchmark-runner/tests/benchmark/test-1/README.md Describes input listing test.
diskann-benchmark-runner/tests/benchmark/test-0/stdout.txt Adds expected output for skeleton input printing.
diskann-benchmark-runner/tests/benchmark/test-0/stdin.txt Adds command script for skeleton test.
diskann-benchmark-runner/tests/benchmark/test-0/README.md Describes skeleton test.
diskann-benchmark-runner/src/ux.rs Adds scrub_path helper and improves backtrace stripping logic for deterministic test output.
diskann-benchmark-runner/src/utils/percentiles.rs Adds minimum percentile field and marks Percentiles non-exhaustive.
diskann-benchmark-runner/src/utils/num.rs Adds constrained numeric deserialization + relative_change helper for regression checks.
diskann-benchmark-runner/src/utils/mod.rs Exposes new num utilities module.
diskann-benchmark-runner/src/utils/fmt.rs Adds clippy expectation annotation for bounds-checked panic.
diskann-benchmark-runner/src/test/typed.rs Refactors test benches and adds regression-capable typed benchmark checks.
diskann-benchmark-runner/src/test/mod.rs Centralizes registration of test inputs/benchmarks including regression variants.
diskann-benchmark-runner/src/test/dim.rs Adds dimensional test benchmarks including a non-regression “simple bench”.
diskann-benchmark-runner/src/result.rs Adds RawResult loader for reuse in regression checking pipeline.
diskann-benchmark-runner/src/registry.rs Extends registry with regression benchmark registration + tolerance discovery.
diskann-benchmark-runner/src/lib.rs Exposes benchmark module publicly and adds internal module plumbing.
diskann-benchmark-runner/src/jobs.rs Refactors job loading/parsing and improves error messages + exposes raw job accessors.
diskann-benchmark-runner/src/internal/regression.rs Implements tolerance parsing, subset matching, regression job assembly, and execution reporting.
diskann-benchmark-runner/src/internal/mod.rs Adds shared load_from_disk helper and internal module structure.
diskann-benchmark-runner/src/input.rs Adds const INSTANCE for Input wrapper to support regression tolerance typing.
diskann-benchmark-runner/src/checker.rs Adds clippy expectation annotation for internal tag invariants.
diskann-benchmark-runner/src/benchmark.rs Introduces Regression trait + internal object-safe regression plumbing for the runner.
diskann-benchmark-runner/src/app.rs Adds check subcommands (skeleton/tolerances/verify/run) and upgrades UX test harness.
diskann-benchmark-runner/Cargo.toml Adjusts clippy lint configuration for unwrap/expect/panic, etc.
diskann-benchmark-runner/.clippy.toml Allows unwrap/expect/panic in tests for this crate.
.github/workflows/benchmarks.yml Adds PR-triggered and manual benchmark regression workflow for two datasets with Rust-native validation.
.github/workflows/benchmarks-aa.yml Adds daily scheduled A/A stability workflow and issue creation on failure.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread diskann-benchmark/src/backend/disk_index/benchmarks.rs
Comment thread diskann-benchmark/src/backend/disk_index/benchmarks.rs Outdated
Comment thread diskann-benchmark-runner/src/utils/percentiles.rs
Comment thread diskann-benchmark-simd/src/bin.rs
Comment thread .github/workflows/disk-benchmarks.yml
Comment thread .github/workflows/disk-benchmarks.yml
Comment thread .github/workflows/disk-benchmarks.yml Outdated
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.32%. Comparing base (54577b2) to head (a553fff).

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #912   +/-   ##
=======================================
  Coverage   89.32%   89.32%           
=======================================
  Files         448      448           
  Lines       83563    83563           
=======================================
+ Hits        74643    74645    +2     
+ Misses       8920     8918    -2     
Flag Coverage Δ
miri 89.32% <ø> (+<0.01%) ⬆️
unittests 89.16% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread .github/workflows/disk-benchmarks.yml
@YuanyuanTian-hh YuanyuanTian-hh requested a review from arrayka April 8, 2026 02:49
Comment thread .github/workflows/disk-benchmarks-aa.yml
Comment thread diskann-benchmark/src/backend/disk_index/benchmarks.rs
Comment thread diskann-benchmark/src/backend/disk_index/benchmarks.rs
Comment thread diskann-benchmark/perf_test_inputs/openai-100K-disk-index.json
Comment thread diskann-benchmark/perf_test_inputs/openai-100K-disk-index.json Outdated
Comment thread diskann-benchmark/perf_test_inputs/wikipedia-100K-disk-index.json Outdated
Comment thread diskann-benchmark/perf_test_inputs/wikipedia-100K-disk-index.json Outdated
Yuanyuan Tian (from Dev Box) added 18 commits April 9, 2026 10:45
- Add benchmarks.yml workflow using workflow_dispatch, comparing current
  branch against a configurable baseline ref
- Add compare_disk_index_json_output.py to diff benchmark crate JSON outputs
  into a CSV suitable for benchmark_result_parse.py
- Add benchmark_result_parse.py for validating results and posting PR comments
- Add wikipedia-100K-disk-index.json benchmark config using the public
  Wikipedia-100K dataset from big-ann-benchmarks (100K Cohere embeddings,
  768-dim, cosine distance) to replace internal ADO datasets
…or ADO mimir-enron, not applicable to public datasets on GitHub runners. Threshold calibration tracked in PBI.
Yuanyuan Tian (from Dev Box) added 6 commits April 9, 2026 10:45
- Add Deserialize to DiskIndexStats, DiskSearchStats, DiskSearchResult, DiskBuildStats
- Implement Regression trait for DiskIndex<T> with typed before/after comparison
- Add DiskIndexTolerance type with configurable thresholds for 7 metrics
- Create disk-index-tolerances.json (10% build/QPS, 1% recall/IOs/comps, 15% latency)
- Switch registration from register() to register_regression()
- Replace Python benchmark_validate.py with Rust-native check run in both workflows
- Delete benchmark_validate.py (no longer needed)
@YuanyuanTian-hh YuanyuanTian-hh force-pushed the user/tianyuanyuan/benchmark-regression branch from 257aeda to f2c14e8 Compare April 9, 2026 02:47
Copy link
Copy Markdown
Contributor

@hildebrandmw hildebrandmw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This is headed in a good direction. In addition to the individual comments below, I have a couple small requests regarding the CI .yml files.

  • Can these be named more accurately as disk-benchmarks[-aa].yml? There will presumably be other benchmarks and we might as well make the distinction early.
  • There is considerable duplication in the .yml structure as well. We should strive to keep our CI system as tidy as we can from the get go. Is it possible to factor out some reusable components (like dataset downloading) so other pipelines can be brought up more easily and if we need to change URLs, there is one place to do so.
  • On tidiness, the file paths for finding datasets is pretty verbose and repetitive. Job level environment variables could go a long way towards cutting this down and making it easier to comprehend at a glance.

Comment thread diskann-benchmark/src/backend/disk_index/benchmarks.rs Outdated
Comment thread diskann-benchmark/src/backend/disk_index/benchmarks.rs Outdated
Comment thread diskann-benchmark/src/backend/disk_index/benchmarks.rs Outdated
Comment thread .github/workflows/benchmarks-aa.yml Outdated
@YuanyuanTian-hh
Copy link
Copy Markdown
Contributor Author

All addressed:

  • Renamed to disk-benchmarks.yml / disk-benchmarks-aa.yml.
  • Factored out shared setup into .github/actions/setup-disk-benchmark/action.yml composite action (Rust toolchain, cargo cache, system deps install, dataset download with BAB_RELEASE_URL as single source of truth).
  • Added PERF_INPUTS workflow-level env var to reduce path repetition.

Yuanyuan Tian (from Dev Box) added 2 commits April 10, 2026 11:14
… action, workflow renames

- Fuse check_lower/check_higher into check_metric with Direction enum
- Use Table for aligned regression output
- Prefix metric names with L{value}: for multiple search_l entries
- Rename benchmarks[-aa].yml to disk-benchmarks[-aa].yml
- Factor shared setup into .github/actions/setup-disk-benchmark/action.yml
- A/A: build once, run twice (no duplicate clone/compile)
- Add PERF_INPUTS workflow-level env var
Comment thread .github/workflows/disk-benchmarks.yml
Copy link
Copy Markdown
Contributor

@hildebrandmw hildebrandmw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - one more round of feedback, then this ready to go.

Comment thread diskann-benchmark/src/backend/disk_index/benchmarks.rs Outdated
Comment thread .github/workflows/disk-benchmarks.yml Outdated
Comment thread .github/actions/setup-disk-benchmark/action.yml Outdated
Comment thread .github/actions/setup-disk-benchmark/action.yml Outdated
Comment thread .github/workflows/disk-benchmarks-aa.yml Outdated
Copy link
Copy Markdown
Contributor

@hildebrandmw hildebrandmw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@YuanyuanTian-hh YuanyuanTian-hh enabled auto-merge (squash) April 17, 2026 01:50
@YuanyuanTian-hh
Copy link
Copy Markdown
Contributor Author

@microsoft-github-policy-service agree

@YuanyuanTian-hh YuanyuanTian-hh enabled auto-merge (squash) April 17, 2026 02:12
@YuanyuanTian-hh YuanyuanTian-hh merged commit 99e0b7d into main Apr 17, 2026
26 checks passed
@YuanyuanTian-hh YuanyuanTian-hh deleted the user/tianyuanyuan/benchmark-regression branch April 17, 2026 02:13
@arkrishn94 arkrishn94 mentioned this pull request Apr 22, 2026
arkrishn94 added a commit that referenced this pull request Apr 22, 2026
Bumping to 0.50.1 to propagate changes to consumers.

Changes since previous bump: 

## What's Changed
* Add more agentic guard rails by @hildebrandmw in
#871
* Cleanup `diskann-benchmark-runner` and friends. by @hildebrandmw in
#865
* Use `--all-targets` for the no-default-features CI run. by
@hildebrandmw in #874
* Remove unused `normalizing_util.rs` from `diskann-providers` by
@Copilot in #902
* Benchmark Support for A/B Tests by @hildebrandmw in
#900
* [diskann-garnet] Bump diskann-garnet to 1.0.26 by @tiagonapoli in
#925
* Remove the `AdjacencyList` from `diskann-providers` by @hildebrandmw
in #915
* [PQ cleanup] Part 1: Move pq_scratch, quantizer_preprocess and
pq_dataset to `diskann-disk` by @arkrishn94 in
#930
* Forbid Debug in diskann-benchmark by @arrayka in
#914
* Remove DebugProvider by @JordanMaples in
#923
* [diskann-garnet] Create workflow to publish to nuget by @tiagonapoli
in #926
* Move k-means implementation from diskann-providers to diskann-disk by
@Copilot in #933
* Inline minmax distance evaluations by @arkrishn94 in
#935
* Use `rust-toolchain.toml` in CI by @hildebrandmw in
#934
* Add a globally blocking CI gate. by @hildebrandmw in
#932
* Remove `utils/math_util.rs` from `diskann-providers` by @Copilot in
#921
* Bump rand from 0.9.2 to 0.9.3 by @dependabot[bot] in
#945
* Remove OPQ and friends by @arkrishn94 in
#947
* Migrate test_flaky_consolidate from diskann_providers to diskann by
@JordanMaples in #942
* Remove GraphDataType from diskann-providers by @wuw92 in
#950
* Remove unused method extract_best_l_candidates in
NeighborPriorityQueue by @doliawu in
#951
* Add `Debug` bounds to `VectorRepr`'s distance GATs. by @hildebrandmw
in #948
* Add benchmark pipeline with Rust-native A/B validation by
@YuanyuanTian-hh in #912
* Remove unnecessary `Default` bound from `Neighbor`'s `VectorIdType` by
@doliawu in #956
* Replace `AlignedBoxWithSlice` with plain `Vec` / `Matrix` where
alignment is unused by @wuw92 in
#955
* [minmax] 8-bit benchmark by @arkrishn94 in
#959
* Add `MultiInsertStrategy` implementations for `BfTreeProvider` by
@hildebrandmw in #949
* Replace `AlignedBoxWithSlice` with `Vec` in PQScratch and disk fp
vector caches by @wuw92 in #960
* Adding unit tests for paged_search by @JordanMaples in
#962
* Remove AlignedBoxWithSlice wrapper and add alias to Poly<[T],
AlignedAllocator> by @JordanMaples in
#965
* Remove synthetic/structured data generation from diskann-providers by
@JordanMaples in #963
* added tests and some baselines for range_search by @JordanMaples in
#961

## New Contributors
* @JordanMaples made their first contribution in
#923
* @wuw92 made their first contribution in
#950
* @doliawu made their first contribution in
#951
* @YuanyuanTian-hh made their first contribution in
#912

**Full Changelog**:
v0.50.0...v0.50.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants