Skip to content

feat: Add VectorDBBench Cloud Leaderboard benchmark cases and client support#775

Open
jamesgao-jpg wants to merge 45 commits into
zilliztech:mainfrom
jamesgao-jpg:cloudLeadboard
Open

feat: Add VectorDBBench Cloud Leaderboard benchmark cases and client support#775
jamesgao-jpg wants to merge 45 commits into
zilliztech:mainfrom
jamesgao-jpg:cloudLeadboard

Conversation

@jamesgao-jpg
Copy link
Copy Markdown
Collaborator

@jamesgao-jpg jamesgao-jpg commented May 11, 2026

Summary

This PR adds the VectorDBBench Cloud Leaderboard benchmark surface. The goal is to complement the existing raw-performance leaderboard with cloud-oriented cases that capture production behaviors managed vector database users care about: ingest readiness, payload-aware search, tenant-shaped traffic, cold latency, and cost-aware interpretation.

What is added

New Cloud Leaderboard cases

  • CloudPayloadSearchCase: measures search performance with explicit response payload profiles: ids_only, scalar_label, and vector. It supports unfiltered search, integer-filter search, and scalar-label filter search.
  • CloudInsertCase: measures insert throughput and separates client insert completion from downstream readiness signals such as fully searchable and fully indexed.
  • CloudColdLatencyCase: measures cold and warm serial latency so first-query and cache-sensitive serving behavior are visible instead of hidden by warm concurrent loops.
  • CloudMultiTenantSearchCase: models SaaS-style tenant-routed workloads with deterministic tenant assignment and tenant-aware query routing.

Runtime and metric plumbing

  • Threads payload profile configuration through case config, runners, client search calls, metrics, and JSON result output.
  • Adds payload byte estimation to result metrics so payload-heavy searches can be compared more explicitly.
  • Adds a cold/warm search runner and task-runner integration for cold latency measurement.
  • Adds concurrent insert readiness polling for cloud insert runs.
  • Adds first-class result fields for cloud insert, cold latency, payload profile, and related cloud case metadata.

Client support

  • Milvus/Zilliz Cloud: adds payload output handling, scalar-label support, multitenant partition-key validation, and related schema checks.
  • Pinecone: adds payload-profile search behavior, metadata/vector return handling, insert readiness polling through write/index LSNs where available, and namespace-based multitenant routing.
  • turbopuffer: adds payload-profile search behavior, scalar payload label configuration, multitenant namespace support, write backpressure control, namespace pin/unpin CLI support for pinned benchmark runs, and explicit non-retryable partial tenant-write failure reporting.

CLI, frontend, and docs

  • Adds CLI options for payload profile, cloud filters, cold query count, insert batch size/duration, tenant settings, and turbopuffer pinning.
  • Adds frontend case config entries for the cloud payload search cases.
  • Adds a May 2026 VectorDBBench Cloud Leaderboard release note under docs/release/2026-05-cloud-leaderboard.md.
  • Updates README.md to mention the new Cloud Leaderboard benchmark cases and link to the release note.

Notes

  • The raw Cloud Leaderboard result dump was intentionally removed from this PR. Result artifacts can be added later in a smaller, dedicated update or kept outside the source tree.
  • Cloud cost and Pareto interpretation are described in the release note, but this PR focuses on benchmark case support and documentation rather than publishing a full result dataset.

Test Plan

  • .venv/bin/python -m pytest tests/test_cloud_payload_search.py tests/test_cloud_payload_case.py tests/test_cloud_insert_case.py tests/test_cloud_cold_latency_case.py tests/test_multitenant_case.py tests/test_pinecone_multitenant.py tests/test_milvus.py -q -k 'not test_performance_1536d_50k'
  • .venv/bin/python -m ruff check vectordb_bench/backend/clients/api.py vectordb_bench/backend/clients/turbopuffer/turbopuffer.py vectordb_bench/backend/runner/concurrent_runner.py vectordb_bench/cli/cli.py

@sre-ci-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jamesgao-jpg
To complete the pull request process, please assign xuanyang-cn after the PR has been reviewed.
You can assign the PR to them by writing /assign @xuanyang-cn in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jamesgao-jpg jamesgao-jpg changed the title Add cloud leaderboard payload and insert readiness cases (feat) Add cloud leaderboard payload and insert readiness cases May 11, 2026
@jamesgao-jpg jamesgao-jpg force-pushed the cloudLeadboard branch 2 times, most recently from e258154 to f909086 Compare May 11, 2026 09:42
…h-case

# Conflicts:
#	vectordb_bench/backend/clients/milvus/milvus.py
#	vectordb_bench/backend/clients/turbopuffer/turbopuffer.py
#	vectordb_bench/backend/dataset.py
#	vectordb_bench/backend/runner/serial_runner.py
@jamesgao-jpg jamesgao-jpg changed the title (feat) Add cloud leaderboard payload and insert readiness cases [WIP] (feat) Add cloud leaderboard payload and insert readiness cases May 12, 2026
@jamesgao-jpg jamesgao-jpg changed the title [WIP] (feat) Add cloud leaderboard payload and insert readiness cases Add cloud benchmark cases for payload, insert, cold latency, and multitenancy May 15, 2026
@jamesgao-jpg jamesgao-jpg changed the title Add cloud benchmark cases for payload, insert, cold latency, and multitenancy Add cloud leaderboard benchmark cases and client support May 18, 2026
@jamesgao-jpg jamesgao-jpg changed the title Add cloud leaderboard benchmark cases and client support [Feature] Add cloud leaderboard benchmark cases and client support May 18, 2026
Comment thread vectordb_bench/backend/cases.py Outdated

@property
def with_scalar_labels(self) -> bool:
return self.filters.type == FilterOp.StrEqual
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a correctness issue: with_scalar_labels is currently derived only from FilterOp.StrEqual, but PayloadProfile.SCALAR_LABEL also requires scalar label data/schema even when there is no label filter.

For example, CloudPayloadSearchCase(payload_profile="scalar_label") uses NonFilter, so with_scalar_labels becomes false. The load path then does not create/load label data, while the search path still requests the label payload via output_fields / metadata / attributes. That means the benchmark may either fail at query time or measure a run where the requested scalar payload was never actually stored.

The multitenant path has a related issue: tenant routing labels and scalar payload/filter labels are treated as the same scalar-label concept in some insert paths. These should be separate fields/metadata concepts so a multitenant case can write both tenant routing information and the scalar payload label.

Suggested fix: make scalar-label materialization depend on both label filtering and payload profile, e.g. filters.type == FilterOp.StrEqual or payload_profile == PayloadProfile.SCALAR_LABEL, and keep tenant routing labels separate from scalar payload labels in the provider insert schemas.

Copy link
Copy Markdown
Collaborator Author

@jamesgao-jpg jamesgao-jpg May 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 9343d01.

Changes made:

  • with_scalar_labels now includes PayloadProfile.SCALAR_LABEL, so scalar-label payload runs materialize label schema/data even without a label filter.
  • ConcurrentInsertRunner now loads scalar labels when scalar-label materialization is required, not only for FilterOp.StrEqual.
  • Milvus/Zilliz, Pinecone, and turbopuffer insert paths now keep tenant routing labels separate from scalar payload labels.
  • Added regression coverage for unfiltered scalar-label payload loads and multitenant scalar payload inserts.
  • Updated docs/release/2026-05-cloud-leaderboard.md to document the scalar-label and tenant-routing semantics.

Verification:

  • .venv/bin/python -m pytest tests/test_cloud_payload_search.py tests/test_cloud_insert_case.py tests/test_multitenant_case.py tests/test_pinecone_multitenant.py tests/test_milvus.py -q -k 'not test_performance_1536d_50k' -> 55 passed, 1 deselected
  • .venv/bin/python -m ruff check on touched production files -> All checks passed

Comment thread vectordb_bench/cli/cli.py
"label_percentage": parameters["label_percentage"],
}
elif parameters["case_type"] == "CloudPayloadSearchCase":
custom_case_config = {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a default/config propagation issue for the new cloud cases.

CloudPayloadSearchCase does not appear to receive dataset_with_size_type from the CLI custom case config, so a user-selected dataset can be silently ignored and the case falls back to its constructor default dataset. That would make the benchmark run against a different dataset than the CLI/config implies.

There is also an inconsistency in how missing defaults are handled across the cloud cases. CloudMultiTenantSearchCase receives parameters["dataset_with_size_type"] directly, which may be None, but the constructor later converts non-DatasetWithSizeType values with DatasetWithSizeType(dataset_with_size_type). If None is passed through, the default CLI path can fail during case construction instead of using the case default.

Suggested fix: normalize optional CLI values before building custom_case_config. Either omit dataset_with_size_type when it is not explicitly provided, or pass a concrete default consistently. Also add CLI regression tests for omitted dataset/default behavior for CloudPayloadSearchCase, CloudMultiTenantSearchCase, and CloudInsertCase.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 552155c.

I verified the issue was real on the PR branch:

  • CloudPayloadSearchCase dropped an explicit CLI dataset_with_size_type, so the case fell back to LAION 100M.
  • CloudMultiTenantSearchCase passed dataset_with_size_type=None, which failed construction instead of using the case default.

Changes made:

  • CloudPayloadSearchCase now propagates an explicitly selected dataset and omits the field when absent, preserving its LAION 100M constructor default.
  • CloudMultiTenantSearchCase now omits absent dataset config, preserving its CohereLarge constructor default instead of passing None.
  • CloudInsertCase keeps the existing CLI default of Medium Cohere when omitted.
  • Added CLI regression coverage for payload explicit/omitted dataset behavior, multitenant omitted dataset behavior, and insert omitted/default behavior.

Verification:

  • .venv/bin/python -m pytest tests/test_cloud_payload_search.py tests/test_cloud_payload_case.py tests/test_cloud_insert_case.py tests/test_cloud_cold_latency_case.py tests/test_multitenant_case.py tests/test_pinecone_multitenant.py tests/test_milvus.py -q -k 'not test_performance_1536d_50k' -> 82 passed, 1 deselected
  • .venv/bin/python -m ruff check vectordb_bench/cli/cli.py -> passed

)
except Exception as e:
log.warning(f"Failed to insert. Error: {e}")
return 0, e
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a partial-failure accounting issue in the multitenant insert path.

When tenant_labels_data is present, this method writes one tenant namespace at a time and increments inserted after each successful tenant write. However, if a later tenant write fails, the outer except returns 0, e even though earlier tenant namespaces may already have been written successfully.

That makes the caller's retry/statistics semantics misleading: the benchmark reports zero inserted rows for a batch that may have partially modified remote state, and a retry may re-write tenants that already succeeded.

Suggested fix: preserve the actual partial insert count on failure, or make partial tenant-write failures explicit/non-retryable with details about which tenants/counts were already written. A regression test where the first tenant write succeeds and the second tenant write fails would cover this behavior.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c1164ce.

I verified the concern was real: TurboPuffer's multitenant insert path could successfully write one tenant namespace, then return (0, error) if a later tenant write failed. The concurrent insert runner also retried any returned error, so this could re-write tenants that had already succeeded.

Changes made:

  • Added explicit non-retryable partial-insert error metadata: inserted_count, successful_tenants, failed_tenant, and failed_tenant_count.
  • TurboPuffer multitenant insert now returns the actual partial count and that detailed non-retryable error when a tenant write fails after prior tenant writes.
  • ConcurrentInsertRunner now stops immediately on non-retryable insert errors instead of retrying the whole batch.
  • Added regression coverage for first-tenant-success/second-tenant-failure and for runner non-retry behavior.

Verification:

  • .venv/bin/python -m pytest tests/test_cloud_payload_search.py tests/test_cloud_payload_case.py tests/test_cloud_insert_case.py tests/test_cloud_cold_latency_case.py tests/test_multitenant_case.py tests/test_pinecone_multitenant.py tests/test_milvus.py -q -k 'not test_performance_1536d_50k' -> 84 passed, 1 deselected
  • .venv/bin/python -m ruff check vectordb_bench/backend/clients/api.py vectordb_bench/backend/clients/turbopuffer/turbopuffer.py vectordb_bench/backend/runner/concurrent_runner.py -> passed

@jamesgao-jpg jamesgao-jpg changed the title [Feature] Add cloud leaderboard benchmark cases and client support [Feature] Add VectorDBBench Cloud Leaderboard benchmark cases and client support May 25, 2026
Copy link
Copy Markdown
Collaborator

@XuanYang-cn XuanYang-cn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes. I left the blocking findings inline.

label_field = self.filters.label_field if self.filters.type == FilterOp.StrEqual else "labels"
if self.dataset.data.scalar_labels_file_separated:
labels_data = self.dataset.scalar_labels[self.filters.label_field][all_metadata].to_list()
labels_data = self.dataset.scalar_labels[label_field][all_metadata].to_list()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scalar-label payload runs still fail for datasets whose labels live in scalar_labels.parquet.

CloudPayloadSearchCase(payload_profile="scalar_label") sets with_scalar_labels=True, but DatasetManager.prepare() only hydrates dataset.scalar_labels when the active filter is StrEqual. For the default unfiltered scalar-label payload case, this line dereferences None for separated-label datasets like LAION/Cohere.

Please load separated scalar labels during dataset preparation whenever the case requires scalar-label materialization, not only when the active filter is StrEqual, and add a regression test for the separated-label path.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d1b545b. DatasetManager.prepare() now accepts a with_scalar_labels signal and hydrates separated scalar_labels.parquet when either a StrEqual label filter or scalar-label payload materialization requires it. CaseRunner._pre_run() passes self.ca.with_scalar_labels into dataset preparation, so the default unfiltered CloudPayloadSearchCase(payload_profile="scalar_label") path loads dataset.scalar_labels before ConcurrentInsertRunner reads it.

Added regression coverage for both pieces:

  • test_dataset_prepare_loads_separated_scalar_labels_for_scalar_payload
  • test_pre_run_prepares_separated_scalar_labels_for_scalar_payload

Verification: python3.11 -m pytest tests/test_cloud_payload_search.py tests/test_cloud_insert_case.py -q -> 40 passed.

for insert_data, tenant_label in zip(insert_datas, batch_tenant_labels, strict=True)
if tenant_label == tenant
]
self._multitenant_insert_counts[tenant] = self._multitenant_insert_counts.get(tenant, 0) + len(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinecone multitenant inserts still account partial tenant writes incorrectly.

This increments _multitenant_insert_counts before the tenant namespace upsert succeeds, while insert_count is advanced only after all tenant writes in the batch complete. If tenant A succeeds and tenant B fails, the method returns (0, error) even though tenant A was written, and _multitenant_insert_counts includes the failed tenant too.

Please mirror the TurboPuffer partial-insert handling here: update tenant counts only after successful writes and return a non-retryable partial failure with the actual inserted count and tenant details.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7d2a4d6. Pinecone multitenant inserts now update _multitenant_insert_counts only after a tenant namespace upsert succeeds. The insert path tracks actual inserted rows and successful tenants; if a later tenant upsert fails, it returns a non-retryable PartialInsertError with inserted_count, successful_tenants, failed_tenant, and failed_tenant_count, matching the TurboPuffer handling. This prevents the concurrent insert runner from retrying a partially written tenant batch and prevents readiness expectations from including the failed tenant write.

Added regression coverage in test_pinecone_multitenant_partial_insert_failure_is_explicit.

Verification: python3.11 -m pytest tests/test_pinecone_multitenant.py tests/test_cloud_insert_case.py::test_concurrent_insert_runner_does_not_retry_non_retryable_insert_errors -q -> 4 passed.

with self.db.init():
status = self.db.poll_insert_readiness(count)
searchable_started = time.perf_counter()
while not status["fully_searchable"]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cloud insert readiness polling can hang forever here.

Both readiness loops poll without any deadline. A service-side indexing stall, an API/status mismatch, or incorrect expected-count state will leave the benchmark process stuck indefinitely after the insert phase has completed.

Please add a readiness timeout/deadline, preferably configurable from the cloud insert case, and include the last readiness status in the raised error so failed runs are diagnosable.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 589b854. Cloud insert readiness polling now supports an explicit timeout/deadline and includes the last readiness status in the raised TimeoutError for diagnosis. The timeout is intentionally opt-in: CloudInsertCase.readiness_timeout defaults to None, preserving the previous behavior of waiting indefinitely unless the user sets --cloud-insert-readiness-timeout or CLOUD_INSERT_READINESS_TIMEOUT. The poll interval is also configurable via --cloud-insert-readiness-poll-interval / CLOUD_INSERT_READINESS_POLL_INTERVAL and defaults to the prior 5s interval.

Added regression coverage for:

  • default CloudInsertCase has no readiness timeout
  • CLI propagation of explicit timeout/poll interval
  • stalled readiness raises TimeoutError with last_status when a timeout is explicitly configured

Verification: python3.11 -m pytest tests/test_cloud_insert_case.py -q -> 34 passed.

from .config import TurboPufferConfig, TurboPufferIndexConfig

pin_target_namespace_count = len(target_namespaces_for_pinning(parameters)) if parameters["pin_namespace"] else 0
if parameters["pin_namespace"]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--dry-run --pin-namespace still mutates remote TurboPuffer state.

This calls pin_namespaces_once() before run() sees dry_run, and pin_namespaces_once() PATCHes namespace metadata and waits for pinning. Since pinning reserves persistent service resources, a dry run can still create billing/state side effects.

Please skip pinning when parameters["dry_run"] is true, or move the pinning action behind the same dry-run guard used for benchmark execution. A CLI test should assert that dry-run with --pin-namespace does not call the metadata PATCH helper.

Copy link
Copy Markdown
Collaborator

@XuanYang-cn XuanYang-cn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one additional required issue in the result read/write path.

These existing unresolved review threads still apply on the current head and should remain blockers: scalar-label payload loads with separated scalar_labels.parquet, Pinecone multitenant partial-insert accounting, cloud insert readiness polling without a timeout, and TurboPuffer --dry-run --pin-namespace mutating remote state before dry-run handling.

Comment thread vectordb_bench/models.py
def model_dump_for_output(self) -> dict:
output = self.model_dump(mode="json", serialize_as_any=True)
for idx, case_result in enumerate(self.results):
output["results"][idx]["metrics"] = self._output_metrics_for_case(case_result)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This projection makes the new cloud result files unreadable by the existing result collector. ResultCollector.collect() calls TestResult.read_file(..., trans_unit=True) for every result file, and that conversion still indexes max_load_count, serial_latency_p99, and related legacy metric keys. Since the CloudInsert and CloudColdLatency branches now write only case-specific metrics, collecting a cloud insert result raises KeyError: 'max_load_count' before model validation fills defaults. Please either keep the legacy metric keys/defaults in the serialized output or make the trans_unit conversion case-aware, and add a regression that reads a cloud insert/cold-latency result through ResultCollector.collect().

@jamesgao-jpg jamesgao-jpg changed the title [Feature] Add VectorDBBench Cloud Leaderboard benchmark cases and client support feat: Add VectorDBBench Cloud Leaderboard benchmark cases and client support May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants