feat: Add VectorDBBench Cloud Leaderboard benchmark cases and client support by jamesgao-jpg · Pull Request #775 · zilliztech/VectorDBBench

jamesgao-jpg · 2026-05-11T09:34:35Z

Summary

This PR adds the VectorDBBench Cloud Leaderboard benchmark surface. The goal is to complement the existing raw-performance leaderboard with cloud-oriented cases that capture production behaviors managed vector database users care about: ingest readiness, payload-aware search, tenant-shaped traffic, cold latency, and cost-aware interpretation.

What is added

New Cloud Leaderboard cases

CloudPayloadSearchCase: measures search performance with explicit response payload profiles: ids_only, scalar_label, and vector. It supports unfiltered search, integer-filter search, and scalar-label filter search.
CloudInsertCase: measures insert throughput and separates client insert completion from downstream readiness signals such as fully searchable and fully indexed.
CloudColdLatencyCase: measures cold and warm serial latency so first-query and cache-sensitive serving behavior are visible instead of hidden by warm concurrent loops.
CloudMultiTenantSearchCase: models SaaS-style tenant-routed workloads with deterministic tenant assignment and tenant-aware query routing.

Runtime and metric plumbing

Threads payload profile configuration through case config, runners, client search calls, metrics, and JSON result output.
Adds payload byte estimation to result metrics so payload-heavy searches can be compared more explicitly.
Adds a cold/warm search runner and task-runner integration for cold latency measurement.
Adds concurrent insert readiness polling for cloud insert runs.
Adds first-class result fields for cloud insert, cold latency, payload profile, and related cloud case metadata.

Client support

Milvus/Zilliz Cloud: adds payload output handling, scalar-label support, multitenant partition-key validation, and related schema checks.
Pinecone: adds payload-profile search behavior, metadata/vector return handling, insert readiness polling through write/index LSNs where available, and namespace-based multitenant routing.
turbopuffer: adds payload-profile search behavior, scalar payload label configuration, multitenant namespace support, write backpressure control, namespace pin/unpin CLI support for pinned benchmark runs, and explicit non-retryable partial tenant-write failure reporting.

CLI, frontend, and docs

Adds CLI options for payload profile, cloud filters, cold query count, insert batch size/duration, tenant settings, and turbopuffer pinning.
Adds frontend case config entries for the cloud payload search cases.
Adds a May 2026 VectorDBBench Cloud Leaderboard release note under docs/release/2026-05-cloud-leaderboard.md.
Updates README.md to mention the new Cloud Leaderboard benchmark cases and link to the release note.

Notes

The raw Cloud Leaderboard result dump was intentionally removed from this PR. Result artifacts can be added later in a smaller, dedicated update or kept outside the source tree.
Cloud cost and Pareto interpretation are described in the release note, but this PR focuses on benchmark case support and documentation rather than publishing a full result dataset.

Test Plan

.venv/bin/python -m pytest tests/test_cloud_payload_search.py tests/test_cloud_payload_case.py tests/test_cloud_insert_case.py tests/test_cloud_cold_latency_case.py tests/test_multitenant_case.py tests/test_pinecone_multitenant.py tests/test_milvus.py -q -k 'not test_performance_1536d_50k'
.venv/bin/python -m ruff check vectordb_bench/backend/clients/api.py vectordb_bench/backend/clients/turbopuffer/turbopuffer.py vectordb_bench/backend/runner/concurrent_runner.py vectordb_bench/cli/cli.py

sre-ci-robot · 2026-05-11T09:34:42Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jamesgao-jpg
To complete the pull request process, please assign xuanyang-cn after the PR has been reviewed.
You can assign the PR to them by writing /assign @xuanyang-cn in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…h-case # Conflicts: # vectordb_bench/backend/clients/milvus/milvus.py # vectordb_bench/backend/clients/turbopuffer/turbopuffer.py # vectordb_bench/backend/dataset.py # vectordb_bench/backend/runner/serial_runner.py

Ported onto cloud-payload-search-case while preserving TurboPuffer payload, backpressure, and multitenant support. (cherry picked from commit adb1ae4)

(cherry picked from commit d9e2f5d)

…lution # Conflicts: # tests/test_milvus.py

This reverts commit 9648add.

foxspy · 2026-05-22T09:30:48Z

+
    @property
    def with_scalar_labels(self) -> bool:
        return self.filters.type == FilterOp.StrEqual


This looks like a correctness issue: with_scalar_labels is currently derived only from FilterOp.StrEqual, but PayloadProfile.SCALAR_LABEL also requires scalar label data/schema even when there is no label filter.

For example, CloudPayloadSearchCase(payload_profile="scalar_label") uses NonFilter, so with_scalar_labels becomes false. The load path then does not create/load label data, while the search path still requests the label payload via output_fields / metadata / attributes. That means the benchmark may either fail at query time or measure a run where the requested scalar payload was never actually stored.

The multitenant path has a related issue: tenant routing labels and scalar payload/filter labels are treated as the same scalar-label concept in some insert paths. These should be separate fields/metadata concepts so a multitenant case can write both tenant routing information and the scalar payload label.

Suggested fix: make scalar-label materialization depend on both label filtering and payload profile, e.g. filters.type == FilterOp.StrEqual or payload_profile == PayloadProfile.SCALAR_LABEL, and keep tenant routing labels separate from scalar payload labels in the provider insert schemas.

Fixed in 9343d01.

Changes made:

with_scalar_labels now includes PayloadProfile.SCALAR_LABEL, so scalar-label payload runs materialize label schema/data even without a label filter.

ConcurrentInsertRunner now loads scalar labels when scalar-label materialization is required, not only for FilterOp.StrEqual.

Milvus/Zilliz, Pinecone, and turbopuffer insert paths now keep tenant routing labels separate from scalar payload labels.

Added regression coverage for unfiltered scalar-label payload loads and multitenant scalar payload inserts.

Updated docs/release/2026-05-cloud-leaderboard.md to document the scalar-label and tenant-routing semantics.

Verification:

.venv/bin/python -m pytest tests/test_cloud_payload_search.py tests/test_cloud_insert_case.py tests/test_multitenant_case.py tests/test_pinecone_multitenant.py tests/test_milvus.py -q -k 'not test_performance_1536d_50k' -> 55 passed, 1 deselected

.venv/bin/python -m ruff check on touched production files -> All checks passed

foxspy · 2026-05-22T09:38:35Z

            "label_percentage": parameters["label_percentage"],
        }
+    elif parameters["case_type"] == "CloudPayloadSearchCase":
+        custom_case_config = {


This looks like a default/config propagation issue for the new cloud cases.

CloudPayloadSearchCase does not appear to receive dataset_with_size_type from the CLI custom case config, so a user-selected dataset can be silently ignored and the case falls back to its constructor default dataset. That would make the benchmark run against a different dataset than the CLI/config implies.

There is also an inconsistency in how missing defaults are handled across the cloud cases. CloudMultiTenantSearchCase receives parameters["dataset_with_size_type"] directly, which may be None, but the constructor later converts non-DatasetWithSizeType values with DatasetWithSizeType(dataset_with_size_type). If None is passed through, the default CLI path can fail during case construction instead of using the case default.

Suggested fix: normalize optional CLI values before building custom_case_config. Either omit dataset_with_size_type when it is not explicitly provided, or pass a concrete default consistently. Also add CLI regression tests for omitted dataset/default behavior for CloudPayloadSearchCase, CloudMultiTenantSearchCase, and CloudInsertCase.

Fixed in 552155c.

I verified the issue was real on the PR branch:

CloudPayloadSearchCase dropped an explicit CLI dataset_with_size_type, so the case fell back to LAION 100M.

CloudMultiTenantSearchCase passed dataset_with_size_type=None, which failed construction instead of using the case default.

Changes made:

CloudPayloadSearchCase now propagates an explicitly selected dataset and omits the field when absent, preserving its LAION 100M constructor default.

CloudMultiTenantSearchCase now omits absent dataset config, preserving its CohereLarge constructor default instead of passing None.

CloudInsertCase keeps the existing CLI default of Medium Cohere when omitted.

Added CLI regression coverage for payload explicit/omitted dataset behavior, multitenant omitted dataset behavior, and insert omitted/default behavior.

Verification:

.venv/bin/python -m pytest tests/test_cloud_payload_search.py tests/test_cloud_payload_case.py tests/test_cloud_insert_case.py tests/test_cloud_cold_latency_case.py tests/test_multitenant_case.py tests/test_pinecone_multitenant.py tests/test_milvus.py -q -k 'not test_performance_1536d_50k' -> 82 passed, 1 deselected

.venv/bin/python -m ruff check vectordb_bench/cli/cli.py -> passed

foxspy · 2026-05-22T09:57:13Z

                )
        except Exception as e:
            log.warning(f"Failed to insert. Error: {e}")
+            return 0, e


This looks like a partial-failure accounting issue in the multitenant insert path.

When tenant_labels_data is present, this method writes one tenant namespace at a time and increments inserted after each successful tenant write. However, if a later tenant write fails, the outer except returns 0, e even though earlier tenant namespaces may already have been written successfully.

That makes the caller's retry/statistics semantics misleading: the benchmark reports zero inserted rows for a batch that may have partially modified remote state, and a retry may re-write tenants that already succeeded.

Suggested fix: preserve the actual partial insert count on failure, or make partial tenant-write failures explicit/non-retryable with details about which tenants/counts were already written. A regression test where the first tenant write succeeds and the second tenant write fails would cover this behavior.

Fixed in c1164ce.

I verified the concern was real: TurboPuffer's multitenant insert path could successfully write one tenant namespace, then return (0, error) if a later tenant write failed. The concurrent insert runner also retried any returned error, so this could re-write tenants that had already succeeded.

Changes made:

Added explicit non-retryable partial-insert error metadata: inserted_count, successful_tenants, failed_tenant, and failed_tenant_count.

TurboPuffer multitenant insert now returns the actual partial count and that detailed non-retryable error when a tenant write fails after prior tenant writes.

ConcurrentInsertRunner now stops immediately on non-retryable insert errors instead of retrying the whole batch.

Added regression coverage for first-tenant-success/second-tenant-failure and for runner non-retry behavior.

Verification:

.venv/bin/python -m pytest tests/test_cloud_payload_search.py tests/test_cloud_payload_case.py tests/test_cloud_insert_case.py tests/test_cloud_cold_latency_case.py tests/test_multitenant_case.py tests/test_pinecone_multitenant.py tests/test_milvus.py -q -k 'not test_performance_1536d_50k' -> 84 passed, 1 deselected

.venv/bin/python -m ruff check vectordb_bench/backend/clients/api.py vectordb_bench/backend/clients/turbopuffer/turbopuffer.py vectordb_bench/backend/runner/concurrent_runner.py -> passed

XuanYang-cn

Requesting changes. I left the blocking findings inline.

XuanYang-cn · 2026-05-26T09:05:33Z

+            label_field = self.filters.label_field if self.filters.type == FilterOp.StrEqual else "labels"
            if self.dataset.data.scalar_labels_file_separated:
-                labels_data = self.dataset.scalar_labels[self.filters.label_field][all_metadata].to_list()
+                labels_data = self.dataset.scalar_labels[label_field][all_metadata].to_list()


Scalar-label payload runs still fail for datasets whose labels live in scalar_labels.parquet.

CloudPayloadSearchCase(payload_profile="scalar_label") sets with_scalar_labels=True, but DatasetManager.prepare() only hydrates dataset.scalar_labels when the active filter is StrEqual. For the default unfiltered scalar-label payload case, this line dereferences None for separated-label datasets like LAION/Cohere.

Please load separated scalar labels during dataset preparation whenever the case requires scalar-label materialization, not only when the active filter is StrEqual, and add a regression test for the separated-label path.

Fixed in d1b545b. DatasetManager.prepare() now accepts a with_scalar_labels signal and hydrates separated scalar_labels.parquet when either a StrEqual label filter or scalar-label payload materialization requires it. CaseRunner._pre_run() passes self.ca.with_scalar_labels into dataset preparation, so the default unfiltered CloudPayloadSearchCase(payload_profile="scalar_label") path loads dataset.scalar_labels before ConcurrentInsertRunner reads it.

Added regression coverage for both pieces:

test_dataset_prepare_loads_separated_scalar_labels_for_scalar_payload

test_pre_run_prepares_separated_scalar_labels_for_scalar_payload

Verification: python3.11 -m pytest tests/test_cloud_payload_search.py tests/test_cloud_insert_case.py -q -> 40 passed.

XuanYang-cn · 2026-05-26T09:05:33Z

+                            for insert_data, tenant_label in zip(insert_datas, batch_tenant_labels, strict=True)
+                            if tenant_label == tenant
+                        ]
+                        self._multitenant_insert_counts[tenant] = self._multitenant_insert_counts.get(tenant, 0) + len(


Pinecone multitenant inserts still account partial tenant writes incorrectly.

This increments _multitenant_insert_counts before the tenant namespace upsert succeeds, while insert_count is advanced only after all tenant writes in the batch complete. If tenant A succeeds and tenant B fails, the method returns (0, error) even though tenant A was written, and _multitenant_insert_counts includes the failed tenant too.

Please mirror the TurboPuffer partial-insert handling here: update tenant counts only after successful writes and return a non-retryable partial failure with the actual inserted count and tenant details.

Fixed in 7d2a4d6. Pinecone multitenant inserts now update _multitenant_insert_counts only after a tenant namespace upsert succeeds. The insert path tracks actual inserted rows and successful tenants; if a later tenant upsert fails, it returns a non-retryable PartialInsertError with inserted_count, successful_tenants, failed_tenant, and failed_tenant_count, matching the TurboPuffer handling. This prevents the concurrent insert runner from retrying a partially written tenant batch and prevents readiness expectations from including the failed tenant write.

Added regression coverage in test_pinecone_multitenant_partial_insert_failure_is_explicit.

Verification: python3.11 -m pytest tests/test_pinecone_multitenant.py tests/test_cloud_insert_case.py::test_concurrent_insert_runner_does_not_retry_non_retryable_insert_errors -q -> 4 passed.

XuanYang-cn · 2026-05-26T09:05:33Z

+        with self.db.init():
+            status = self.db.poll_insert_readiness(count)
+            searchable_started = time.perf_counter()
+            while not status["fully_searchable"]:


Cloud insert readiness polling can hang forever here.

Both readiness loops poll without any deadline. A service-side indexing stall, an API/status mismatch, or incorrect expected-count state will leave the benchmark process stuck indefinitely after the insert phase has completed.

Please add a readiness timeout/deadline, preferably configurable from the cloud insert case, and include the last readiness status in the raised error so failed runs are diagnosable.

Fixed in 589b854. Cloud insert readiness polling now supports an explicit timeout/deadline and includes the last readiness status in the raised TimeoutError for diagnosis. The timeout is intentionally opt-in: CloudInsertCase.readiness_timeout defaults to None, preserving the previous behavior of waiting indefinitely unless the user sets --cloud-insert-readiness-timeout or CLOUD_INSERT_READINESS_TIMEOUT. The poll interval is also configurable via --cloud-insert-readiness-poll-interval / CLOUD_INSERT_READINESS_POLL_INTERVAL and defaults to the prior 5s interval.

Added regression coverage for:

default CloudInsertCase has no readiness timeout

CLI propagation of explicit timeout/poll interval

stalled readiness raises TimeoutError with last_status when a timeout is explicitly configured

Verification: python3.11 -m pytest tests/test_cloud_insert_case.py -q -> 34 passed.

XuanYang-cn · 2026-05-26T09:05:33Z

    from .config import TurboPufferConfig, TurboPufferIndexConfig

+    pin_target_namespace_count = len(target_namespaces_for_pinning(parameters)) if parameters["pin_namespace"] else 0
+    if parameters["pin_namespace"]:


--dry-run --pin-namespace still mutates remote TurboPuffer state.

This calls pin_namespaces_once() before run() sees dry_run, and pin_namespaces_once() PATCHes namespace metadata and waits for pinning. Since pinning reserves persistent service resources, a dry run can still create billing/state side effects.

Please skip pinning when parameters["dry_run"] is true, or move the pinning action behind the same dry-run guard used for benchmark execution. A CLI test should assert that dry-run with --pin-namespace does not call the metadata PATCH helper.

XuanYang-cn

I found one additional required issue in the result read/write path.

These existing unresolved review threads still apply on the current head and should remain blockers: scalar-label payload loads with separated scalar_labels.parquet, Pinecone multitenant partial-insert accounting, cloud insert readiness polling without a timeout, and TurboPuffer --dry-run --pin-namespace mutating remote state before dry-run handling.

XuanYang-cn · 2026-05-26T12:24:03Z

+    def model_dump_for_output(self) -> dict:
+        output = self.model_dump(mode="json", serialize_as_any=True)
+        for idx, case_result in enumerate(self.results):
+            output["results"][idx]["metrics"] = self._output_metrics_for_case(case_result)


This projection makes the new cloud result files unreadable by the existing result collector. ResultCollector.collect() calls TestResult.read_file(..., trans_unit=True) for every result file, and that conversion still indexes max_load_count, serial_latency_p99, and related legacy metric keys. Since the CloudInsert and CloudColdLatency branches now write only case-specific metrics, collecting a cloud insert result raises KeyError: 'max_load_count' before model validation fills defaults. Please either keep the legacy metric keys/defaults in the serialized output or make the trans_unit conversion case-aware, and add a regression that reads a cloud insert/cold-latency result through ResultCollector.collect().

jamesgao-jpg added 3 commits May 8, 2026 10:12

Add cloud payload search benchmark case

8d83582

Add cloud payload scalar label support

1637484

Add cloud insert readiness case

829d7f4

jamesgao-jpg changed the title ~~Add cloud leaderboard payload and insert readiness cases~~ (feat) Add cloud leaderboard payload and insert readiness cases May 11, 2026

jamesgao-jpg force-pushed the cloudLeadboard branch 2 times, most recently from e258154 to f909086 Compare May 11, 2026 09:42

jamesgao-jpg force-pushed the cloudLeadboard branch from f909086 to c943513 Compare May 11, 2026 09:51

Fix cloud insert CI lint failures

519f01c

jamesgao-jpg changed the title ~~(feat) Add cloud leaderboard payload and insert readiness cases~~ [WIP] (feat) Add cloud leaderboard payload and insert readiness cases May 12, 2026

jamesgao-jpg added 19 commits May 12, 2026 13:17

Add multitenant VDBBench design spec

4df4b8f

Add multitenant implementation plan

20befac

Add cloud insert concurrency and Pinecone readiness support

76ad386

Add cloud cold latency case design

5fb9930

Add cloud cold latency implementation plan

e611187

Add cloud cold latency case model

2a99d70

Add cold warm search runner

f61b447

Fix cold warm runner lint issues

46c5c94

Wire cloud cold latency runner into tasks

a9f7668

Fix cloud cold latency task integration lint

33361d2

Fix cloud cold latency CLI defaults

52b7158

Add cloud multitenant search case

fc66bc1

Emit first-class cloud case result metrics

dd2485d

Pretty-print cloud result JSON output

2f92bbf

fix: validate multitenant search schema

77dcb72

fix: support turbopuffer multitenant payload runs

6cede04

fix: configure turbopuffer scalar payload field

df15217

Add Turbopuffer namespace pinning CLI

66475be

Ported onto cloud-payload-search-case while preserving TurboPuffer payload, backpressure, and multitenant support. (cherry picked from commit adb1ae4)

Document TurboPuffer unpin command

1994211

(cherry picked from commit d9e2f5d)

jamesgao-jpg added 2 commits May 14, 2026 09:13

Add cloud insert benchmark raw results

0bf0fef

Remove internal docs and raw cloud results from PR

c4620d1

jamesgao-jpg changed the title ~~[WIP] (feat) Add cloud leaderboard payload and insert readiness cases~~ Add cloud benchmark cases for payload, insert, cold latency, and multitenancy May 15, 2026

jamesgao-jpg changed the title ~~Add cloud benchmark cases for payload, insert, cold latency, and multitenancy~~ Add cloud leaderboard benchmark cases and client support May 18, 2026

jamesgao-jpg added 2 commits May 18, 2026 06:39

Merge remote-tracking branch 'upstream/main' into pr775-conflict-reso…

c623e17

…lution # Conflicts: # tests/test_milvus.py

Format cloud leaderboard changes

b7ec21e

jamesgao-jpg changed the title ~~Add cloud leaderboard benchmark cases and client support~~ [Feature] Add cloud leaderboard benchmark cases and client support May 18, 2026

jamesgao-jpg added 4 commits May 19, 2026 07:14

Add cloud leaderboard test results

9648add

Revert "Add cloud leaderboard test results"

379f81e

This reverts commit 9648add.

Add CloudLeadboard release note

7bfa991

Document CloudLeadboard in README

b353ac0

foxspy reviewed May 22, 2026

View reviewed changes

jamesgao-jpg added 3 commits May 25, 2026 06:42

Fix scalar label payload materialization

9343d01

Fix cloud case CLI dataset defaults

552155c

Make turbopuffer partial tenant insert failures explicit

c1164ce

jamesgao-jpg changed the title ~~[Feature] Add cloud leaderboard benchmark cases and client support~~ [Feature] Add VectorDBBench Cloud Leaderboard benchmark cases and client support May 25, 2026

jamesgao-jpg added 4 commits May 25, 2026 09:19

Use Cloud Leaderboard terminology

62ea5d4

Handle cloud result metrics in frontend reads

1ff51e8

Allow display reads with sanitized db config

4fd1708

Remove frontend-scoped cloud leaderboard changes

2bae0e9

XuanYang-cn requested changes May 26, 2026

View reviewed changes

Add Cloud Leaderboard result staging folders

3e625e8

XuanYang-cn requested changes May 26, 2026

View reviewed changes

jamesgao-jpg added 3 commits May 27, 2026 03:43

Fix scalar payload separated label preparation

d1b545b

Handle Pinecone multitenant partial inserts

7d2a4d6

Add opt-in cloud insert readiness timeout

589b854

jamesgao-jpg changed the title ~~[Feature] Add VectorDBBench Cloud Leaderboard benchmark cases and client support~~ feat: Add VectorDBBench Cloud Leaderboard benchmark cases and client support May 27, 2026

Add LAION 100M cloud insert dataset option

48d90af

Conversation

jamesgao-jpg commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What is added

New Cloud Leaderboard cases

Runtime and metric plumbing

Client support

CLI, frontend, and docs

Notes

Test Plan

Uh oh!

sre-ci-robot commented May 11, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jamesgao-jpg May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XuanYang-cn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XuanYang-cn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jamesgao-jpg commented May 11, 2026 •

edited

Loading

jamesgao-jpg May 25, 2026 •

edited

Loading