feat: Add VectorDBBench Cloud Leaderboard benchmark cases and client support#775
feat: Add VectorDBBench Cloud Leaderboard benchmark cases and client support#775jamesgao-jpg wants to merge 45 commits into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jamesgao-jpg The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
e258154 to
f909086
Compare
…h-case # Conflicts: # vectordb_bench/backend/clients/milvus/milvus.py # vectordb_bench/backend/clients/turbopuffer/turbopuffer.py # vectordb_bench/backend/dataset.py # vectordb_bench/backend/runner/serial_runner.py
f909086 to
c943513
Compare
…lution # Conflicts: # tests/test_milvus.py
|
|
||
| @property | ||
| def with_scalar_labels(self) -> bool: | ||
| return self.filters.type == FilterOp.StrEqual |
There was a problem hiding this comment.
This looks like a correctness issue: with_scalar_labels is currently derived only from FilterOp.StrEqual, but PayloadProfile.SCALAR_LABEL also requires scalar label data/schema even when there is no label filter.
For example, CloudPayloadSearchCase(payload_profile="scalar_label") uses NonFilter, so with_scalar_labels becomes false. The load path then does not create/load label data, while the search path still requests the label payload via output_fields / metadata / attributes. That means the benchmark may either fail at query time or measure a run where the requested scalar payload was never actually stored.
The multitenant path has a related issue: tenant routing labels and scalar payload/filter labels are treated as the same scalar-label concept in some insert paths. These should be separate fields/metadata concepts so a multitenant case can write both tenant routing information and the scalar payload label.
Suggested fix: make scalar-label materialization depend on both label filtering and payload profile, e.g. filters.type == FilterOp.StrEqual or payload_profile == PayloadProfile.SCALAR_LABEL, and keep tenant routing labels separate from scalar payload labels in the provider insert schemas.
There was a problem hiding this comment.
Fixed in 9343d01.
Changes made:
with_scalar_labelsnow includesPayloadProfile.SCALAR_LABEL, so scalar-label payload runs materialize label schema/data even without a label filter.ConcurrentInsertRunnernow loads scalar labels when scalar-label materialization is required, not only forFilterOp.StrEqual.- Milvus/Zilliz, Pinecone, and turbopuffer insert paths now keep tenant routing labels separate from scalar payload labels.
- Added regression coverage for unfiltered scalar-label payload loads and multitenant scalar payload inserts.
- Updated
docs/release/2026-05-cloud-leaderboard.mdto document the scalar-label and tenant-routing semantics.
Verification:
.venv/bin/python -m pytest tests/test_cloud_payload_search.py tests/test_cloud_insert_case.py tests/test_multitenant_case.py tests/test_pinecone_multitenant.py tests/test_milvus.py -q -k 'not test_performance_1536d_50k'->55 passed, 1 deselected.venv/bin/python -m ruff checkon touched production files ->All checks passed
| "label_percentage": parameters["label_percentage"], | ||
| } | ||
| elif parameters["case_type"] == "CloudPayloadSearchCase": | ||
| custom_case_config = { |
There was a problem hiding this comment.
This looks like a default/config propagation issue for the new cloud cases.
CloudPayloadSearchCase does not appear to receive dataset_with_size_type from the CLI custom case config, so a user-selected dataset can be silently ignored and the case falls back to its constructor default dataset. That would make the benchmark run against a different dataset than the CLI/config implies.
There is also an inconsistency in how missing defaults are handled across the cloud cases. CloudMultiTenantSearchCase receives parameters["dataset_with_size_type"] directly, which may be None, but the constructor later converts non-DatasetWithSizeType values with DatasetWithSizeType(dataset_with_size_type). If None is passed through, the default CLI path can fail during case construction instead of using the case default.
Suggested fix: normalize optional CLI values before building custom_case_config. Either omit dataset_with_size_type when it is not explicitly provided, or pass a concrete default consistently. Also add CLI regression tests for omitted dataset/default behavior for CloudPayloadSearchCase, CloudMultiTenantSearchCase, and CloudInsertCase.
There was a problem hiding this comment.
Fixed in 552155c.
I verified the issue was real on the PR branch:
CloudPayloadSearchCasedropped an explicit CLIdataset_with_size_type, so the case fell back to LAION 100M.CloudMultiTenantSearchCasepasseddataset_with_size_type=None, which failed construction instead of using the case default.
Changes made:
CloudPayloadSearchCasenow propagates an explicitly selected dataset and omits the field when absent, preserving its LAION 100M constructor default.CloudMultiTenantSearchCasenow omits absent dataset config, preserving itsCohereLargeconstructor default instead of passingNone.CloudInsertCasekeeps the existing CLI default of Medium Cohere when omitted.- Added CLI regression coverage for payload explicit/omitted dataset behavior, multitenant omitted dataset behavior, and insert omitted/default behavior.
Verification:
.venv/bin/python -m pytest tests/test_cloud_payload_search.py tests/test_cloud_payload_case.py tests/test_cloud_insert_case.py tests/test_cloud_cold_latency_case.py tests/test_multitenant_case.py tests/test_pinecone_multitenant.py tests/test_milvus.py -q -k 'not test_performance_1536d_50k'-> 82 passed, 1 deselected.venv/bin/python -m ruff check vectordb_bench/cli/cli.py-> passed
| ) | ||
| except Exception as e: | ||
| log.warning(f"Failed to insert. Error: {e}") | ||
| return 0, e |
There was a problem hiding this comment.
This looks like a partial-failure accounting issue in the multitenant insert path.
When tenant_labels_data is present, this method writes one tenant namespace at a time and increments inserted after each successful tenant write. However, if a later tenant write fails, the outer except returns 0, e even though earlier tenant namespaces may already have been written successfully.
That makes the caller's retry/statistics semantics misleading: the benchmark reports zero inserted rows for a batch that may have partially modified remote state, and a retry may re-write tenants that already succeeded.
Suggested fix: preserve the actual partial insert count on failure, or make partial tenant-write failures explicit/non-retryable with details about which tenants/counts were already written. A regression test where the first tenant write succeeds and the second tenant write fails would cover this behavior.
There was a problem hiding this comment.
Fixed in c1164ce.
I verified the concern was real: TurboPuffer's multitenant insert path could successfully write one tenant namespace, then return (0, error) if a later tenant write failed. The concurrent insert runner also retried any returned error, so this could re-write tenants that had already succeeded.
Changes made:
- Added explicit non-retryable partial-insert error metadata:
inserted_count,successful_tenants,failed_tenant, andfailed_tenant_count. - TurboPuffer multitenant insert now returns the actual partial count and that detailed non-retryable error when a tenant write fails after prior tenant writes.
ConcurrentInsertRunnernow stops immediately on non-retryable insert errors instead of retrying the whole batch.- Added regression coverage for first-tenant-success/second-tenant-failure and for runner non-retry behavior.
Verification:
.venv/bin/python -m pytest tests/test_cloud_payload_search.py tests/test_cloud_payload_case.py tests/test_cloud_insert_case.py tests/test_cloud_cold_latency_case.py tests/test_multitenant_case.py tests/test_pinecone_multitenant.py tests/test_milvus.py -q -k 'not test_performance_1536d_50k'-> 84 passed, 1 deselected.venv/bin/python -m ruff check vectordb_bench/backend/clients/api.py vectordb_bench/backend/clients/turbopuffer/turbopuffer.py vectordb_bench/backend/runner/concurrent_runner.py-> passed
XuanYang-cn
left a comment
There was a problem hiding this comment.
Requesting changes. I left the blocking findings inline.
| label_field = self.filters.label_field if self.filters.type == FilterOp.StrEqual else "labels" | ||
| if self.dataset.data.scalar_labels_file_separated: | ||
| labels_data = self.dataset.scalar_labels[self.filters.label_field][all_metadata].to_list() | ||
| labels_data = self.dataset.scalar_labels[label_field][all_metadata].to_list() |
There was a problem hiding this comment.
Scalar-label payload runs still fail for datasets whose labels live in scalar_labels.parquet.
CloudPayloadSearchCase(payload_profile="scalar_label") sets with_scalar_labels=True, but DatasetManager.prepare() only hydrates dataset.scalar_labels when the active filter is StrEqual. For the default unfiltered scalar-label payload case, this line dereferences None for separated-label datasets like LAION/Cohere.
Please load separated scalar labels during dataset preparation whenever the case requires scalar-label materialization, not only when the active filter is StrEqual, and add a regression test for the separated-label path.
There was a problem hiding this comment.
Fixed in d1b545b. DatasetManager.prepare() now accepts a with_scalar_labels signal and hydrates separated scalar_labels.parquet when either a StrEqual label filter or scalar-label payload materialization requires it. CaseRunner._pre_run() passes self.ca.with_scalar_labels into dataset preparation, so the default unfiltered CloudPayloadSearchCase(payload_profile="scalar_label") path loads dataset.scalar_labels before ConcurrentInsertRunner reads it.
Added regression coverage for both pieces:
test_dataset_prepare_loads_separated_scalar_labels_for_scalar_payloadtest_pre_run_prepares_separated_scalar_labels_for_scalar_payload
Verification: python3.11 -m pytest tests/test_cloud_payload_search.py tests/test_cloud_insert_case.py -q -> 40 passed.
| for insert_data, tenant_label in zip(insert_datas, batch_tenant_labels, strict=True) | ||
| if tenant_label == tenant | ||
| ] | ||
| self._multitenant_insert_counts[tenant] = self._multitenant_insert_counts.get(tenant, 0) + len( |
There was a problem hiding this comment.
Pinecone multitenant inserts still account partial tenant writes incorrectly.
This increments _multitenant_insert_counts before the tenant namespace upsert succeeds, while insert_count is advanced only after all tenant writes in the batch complete. If tenant A succeeds and tenant B fails, the method returns (0, error) even though tenant A was written, and _multitenant_insert_counts includes the failed tenant too.
Please mirror the TurboPuffer partial-insert handling here: update tenant counts only after successful writes and return a non-retryable partial failure with the actual inserted count and tenant details.
There was a problem hiding this comment.
Fixed in 7d2a4d6. Pinecone multitenant inserts now update _multitenant_insert_counts only after a tenant namespace upsert succeeds. The insert path tracks actual inserted rows and successful tenants; if a later tenant upsert fails, it returns a non-retryable PartialInsertError with inserted_count, successful_tenants, failed_tenant, and failed_tenant_count, matching the TurboPuffer handling. This prevents the concurrent insert runner from retrying a partially written tenant batch and prevents readiness expectations from including the failed tenant write.
Added regression coverage in test_pinecone_multitenant_partial_insert_failure_is_explicit.
Verification: python3.11 -m pytest tests/test_pinecone_multitenant.py tests/test_cloud_insert_case.py::test_concurrent_insert_runner_does_not_retry_non_retryable_insert_errors -q -> 4 passed.
| with self.db.init(): | ||
| status = self.db.poll_insert_readiness(count) | ||
| searchable_started = time.perf_counter() | ||
| while not status["fully_searchable"]: |
There was a problem hiding this comment.
Cloud insert readiness polling can hang forever here.
Both readiness loops poll without any deadline. A service-side indexing stall, an API/status mismatch, or incorrect expected-count state will leave the benchmark process stuck indefinitely after the insert phase has completed.
Please add a readiness timeout/deadline, preferably configurable from the cloud insert case, and include the last readiness status in the raised error so failed runs are diagnosable.
There was a problem hiding this comment.
Fixed in 589b854. Cloud insert readiness polling now supports an explicit timeout/deadline and includes the last readiness status in the raised TimeoutError for diagnosis. The timeout is intentionally opt-in: CloudInsertCase.readiness_timeout defaults to None, preserving the previous behavior of waiting indefinitely unless the user sets --cloud-insert-readiness-timeout or CLOUD_INSERT_READINESS_TIMEOUT. The poll interval is also configurable via --cloud-insert-readiness-poll-interval / CLOUD_INSERT_READINESS_POLL_INTERVAL and defaults to the prior 5s interval.
Added regression coverage for:
- default
CloudInsertCasehas no readiness timeout - CLI propagation of explicit timeout/poll interval
- stalled readiness raises
TimeoutErrorwithlast_statuswhen a timeout is explicitly configured
Verification: python3.11 -m pytest tests/test_cloud_insert_case.py -q -> 34 passed.
| from .config import TurboPufferConfig, TurboPufferIndexConfig | ||
|
|
||
| pin_target_namespace_count = len(target_namespaces_for_pinning(parameters)) if parameters["pin_namespace"] else 0 | ||
| if parameters["pin_namespace"]: |
There was a problem hiding this comment.
--dry-run --pin-namespace still mutates remote TurboPuffer state.
This calls pin_namespaces_once() before run() sees dry_run, and pin_namespaces_once() PATCHes namespace metadata and waits for pinning. Since pinning reserves persistent service resources, a dry run can still create billing/state side effects.
Please skip pinning when parameters["dry_run"] is true, or move the pinning action behind the same dry-run guard used for benchmark execution. A CLI test should assert that dry-run with --pin-namespace does not call the metadata PATCH helper.
XuanYang-cn
left a comment
There was a problem hiding this comment.
I found one additional required issue in the result read/write path.
These existing unresolved review threads still apply on the current head and should remain blockers: scalar-label payload loads with separated scalar_labels.parquet, Pinecone multitenant partial-insert accounting, cloud insert readiness polling without a timeout, and TurboPuffer --dry-run --pin-namespace mutating remote state before dry-run handling.
| def model_dump_for_output(self) -> dict: | ||
| output = self.model_dump(mode="json", serialize_as_any=True) | ||
| for idx, case_result in enumerate(self.results): | ||
| output["results"][idx]["metrics"] = self._output_metrics_for_case(case_result) |
There was a problem hiding this comment.
This projection makes the new cloud result files unreadable by the existing result collector. ResultCollector.collect() calls TestResult.read_file(..., trans_unit=True) for every result file, and that conversion still indexes max_load_count, serial_latency_p99, and related legacy metric keys. Since the CloudInsert and CloudColdLatency branches now write only case-specific metrics, collecting a cloud insert result raises KeyError: 'max_load_count' before model validation fills defaults. Please either keep the legacy metric keys/defaults in the serialized output or make the trans_unit conversion case-aware, and add a regression that reads a cloud insert/cold-latency result through ResultCollector.collect().
Summary
This PR adds the VectorDBBench Cloud Leaderboard benchmark surface. The goal is to complement the existing raw-performance leaderboard with cloud-oriented cases that capture production behaviors managed vector database users care about: ingest readiness, payload-aware search, tenant-shaped traffic, cold latency, and cost-aware interpretation.
What is added
New Cloud Leaderboard cases
CloudPayloadSearchCase: measures search performance with explicit response payload profiles:ids_only,scalar_label, andvector. It supports unfiltered search, integer-filter search, and scalar-label filter search.CloudInsertCase: measures insert throughput and separates client insert completion from downstream readiness signals such as fully searchable and fully indexed.CloudColdLatencyCase: measures cold and warm serial latency so first-query and cache-sensitive serving behavior are visible instead of hidden by warm concurrent loops.CloudMultiTenantSearchCase: models SaaS-style tenant-routed workloads with deterministic tenant assignment and tenant-aware query routing.Runtime and metric plumbing
Client support
CLI, frontend, and docs
docs/release/2026-05-cloud-leaderboard.md.README.mdto mention the new Cloud Leaderboard benchmark cases and link to the release note.Notes
Test Plan
.venv/bin/python -m pytest tests/test_cloud_payload_search.py tests/test_cloud_payload_case.py tests/test_cloud_insert_case.py tests/test_cloud_cold_latency_case.py tests/test_multitenant_case.py tests/test_pinecone_multitenant.py tests/test_milvus.py -q -k 'not test_performance_1536d_50k'.venv/bin/python -m ruff check vectordb_bench/backend/clients/api.py vectordb_bench/backend/clients/turbopuffer/turbopuffer.py vectordb_bench/backend/runner/concurrent_runner.py vectordb_bench/cli/cli.py