[Feature] Add happy-path execution tests for vectorSearch()#5366
Open
mengweieric wants to merge 42 commits intoopensearch-project:feature/vector-search-p0from
Open
[Feature] Add happy-path execution tests for vectorSearch()#5366mengweieric wants to merge 42 commits intoopensearch-project:feature/vector-search-p0from
mengweieric wants to merge 42 commits intoopensearch-project:feature/vector-search-p0from
Conversation
…ch-project#5318) * [Feature] Add table function relation to SQL grammar for vectorSearch() Add table function relation support to the SQL parser: - New `tableFunctionRelation` alternative in `relation` grammar rule - Named argument syntax: `key=value` (e.g., table='index', field='vec') - Alias is required by grammar (FROM func(...) AS alias) - AstBuilder emits existing TableFunction + SubqueryAlias AST nodes - 3 parser unit tests: basic parse, with WHERE/ORDER BY/LIMIT, alias required This is a pure grammar change — no execution support yet. Queries will parse successfully but fail at the Analyzer with "unsupported function". Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Address review feedback on table function relation grammar 1. Canonicalize argument names at parser boundary: unquoteIdentifier + toLowerCase(Locale.ROOT) in visitTableFunctionRelation so FIELD='x' and `field`='x' both produce argName="field" 2. Make AS keyword optional (AS? alias) for consistency with tableAsRelation and subqueryAsRelation grammar rules 3. Strengthen test coverage: - Full structural AST assertion for WHERE + ORDER BY + LIMIT (verifies Sort, Limit, Filter nodes, not just toString) - Argument reorder test proves names resolve by name not position - Case canonicalization test (TABLE= → table=) - Alias-without-AS test (FROM func(...) v) Signed-off-by: Eric Wei <mengwei.eric@gmail.com> * Apply spotless formatting Signed-off-by: Eric Wei <mengwei.eric@gmail.com> --------- Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Maps knn_vector fields to ExprCoreType.ARRAY so they appear in DESCRIBE output and can be referenced in projections. This is a visibility shim — not a full vector type. Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
VectorSearchIndex.createScanBuilder() needs to construct an OpenSearchIndexScanBuilder with a custom VectorSearchQueryBuilder delegate. The existing constructor was protected (test-only). Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Introduces the core execution pipeline for vectorsearch(): - VectorSearchTableFunctionResolver: registers vectorsearch with 4 STRING args - VectorSearchTableFunctionImplementation: parses named args, vector literal, options string, validates search mode (k/max_distance/min_score) - VectorSearchIndex: extends OpenSearchIndex with knn query seeding, score tracking, and WrapperQueryBuilder DSL construction - VectorSearchQueryBuilder: keeps knn in must (scoring) context, WHERE filters in filter (non-scoring) context Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Override getFunctions() to expose vectorsearch() table function to the query analysis pipeline. Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Verifies knn query is placed in scoring (must) context, not wrapped in bool.filter when no WHERE clause is present. Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
- Add pushDownFilter() unit test asserting knn stays in bool.must (scoring) and WHERE predicate goes to bool.filter (non-scoring) - Add option key allowlist (k, max_distance, min_score) to reject unknown/unsupported keys before they reach DSL generation - Add field name validation to reject characters that could corrupt the WrapperQueryBuilder JSON (allows alphanumeric, dots, underscores, hyphens) - Add named-arg type guard to reject non-NamedArgumentExpression args early with a clear error message Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Parse k as integer, max_distance and min_score as double before they reach buildKnnQuery(). Rejects non-numeric and non-finite values with clear errors. This closes the residual JSON-injection path through option values without requiring full XContent migration. Also fixes toString() to be consistent with the named-arg guard (no longer blindly casts to NamedArgumentExpression). Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
- parseOptions: reject malformed segments and duplicate keys - parseVector: wrap errors in ExpressionEvaluationException, reject non-finite floats (Infinity, NaN) - VectorSearchIndex: default requestedTotalSize to k via pushDownLimitToRequestTotal so queries without LIMIT return k results - Add 5 new tests: malformed option, duplicate key, empty vector, malformed vector component, non-finite vector component Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
- validateNamedArgs() now rejects null/empty arg names defensively, closing a potential NPE if the shared table-function path is later wired into PPL - OpenSearchStorageEngineTest uses contains-check instead of exact collection size assertion - Add testNullArgNameThrows test Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
- Remove unused VECTOR_OPTION constant from VectorSearchIndex - Clarify buildKnnQuery() comment: quoted fallback is for forward compatibility, all P0 values are already canonicalized as numeric - Rename testMissingSearchModeOptionThrows to testUnknownOptionKeyOnlyThrows to match what it actually tests Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
- Enforce exactly one of k, max_distance, or min_score - Validate k is in [1, 10000] range - Add 6 tests: mutual exclusivity (3 combos), k too small, k too large, k boundary values (1 and 10000) Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
VectorSearchQueryBuilder now accepts options map and rejects pushDownLimit when LIMIT exceeds k. Radial modes (max_distance, min_score) have no LIMIT restriction. Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
- Create VectorSearchIndexTest: 7 tests covering buildKnnQueryJson() for top-k, max_distance, min_score, nested fields, multi-element and single-element vectors, numeric option rendering - Add edge case tests to VectorSearchTableFunctionImplementationTest: NaN vector component, empty option key/value, negative k, NaN for max_distance and min_score (6 new tests) - Add VectorSearchQueryBuilderTest: min_score radial mode LIMIT, pushDownSort delegation to parent (2 new tests) - Extract buildKnnQueryJson() as package-private for direct testing Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Test too-many (5) and zero arguments paths in VectorSearchTableFunctionResolver to complement existing too-few (2) test. Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
- Cap radial mode (max_distance/min_score) results at maxResultWindow to prevent unbounded result sets - Reject ORDER BY on non-_score fields and _score ASC in vectorSearch since knn results are naturally sorted by _score DESC - Add 12 integration tests: 4 _explain DSL shape verification tests and 8 validation error path tests Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
- Add multi-sort expression test: ORDER BY _score DESC, name ASC correctly rejects the non-_score field (VectorSearchQueryBuilderTest) - Add case-insensitive argument name lookup test to verify TABLE='x' resolves same as table='x' (Implementation test) - Add non-numeric option fallback test: verifies string options are quoted in JSON output (VectorSearchIndexTest) - Add 4 integration tests: ORDER BY _score DESC succeeds, ORDER BY non-score rejects, ORDER BY _score ASC rejects, LIMIT within k succeeds (VectorSearchIT, now 16 tests) Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
The base OpenSearchIndexScanQueryBuilder.pushDownSort() pushes sort.getCount() as a limit when non-zero. Our override validated _score DESC and returned true, but did not preserve this contract. SQL always sets count=0, so this was not reachable today, but PPL or future callers may set a non-zero count to combine sort+limit in one LogicalSort node. Preserve the behavior defensively. Add focused test: LogicalSort(count=7) with _score DESC verifies the count is pushed down as request size. Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
- Unit test: compound AND predicate survives pushdown into bool.filter - Integration test: compound WHERE (term + range) produces bool query - Integration test: radial max_distance with WHERE produces bool query Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
pushDownSort() called requestBuilder.pushDownLimit() directly, bypassing the LIMIT > k guard in pushDownLimit(). Extract validateLimitWithinK() helper and call it from both paths so the invariant holds when PPL or future callers set a non-zero sort count. Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Move all explainQuery()-based DSL shape tests into a dedicated VectorSearchExplainIT suite. VectorSearchIT now contains only validation and error-path tests. Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
…SearchIndex Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
…on in VectorSearchQueryBuilder Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
…fficient mode Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
…matting Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Reject construction of VectorSearchQueryBuilder with FilterType.EFFICIENT and a null rebuildKnnWithFilter callback at construction time instead of deferring to an NPE in pushDownFilter. Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
… test, rename constructor comment - Reword filter_type error message to be user-friendly and actionable (no longer leaks internal ScriptQueryUnSupportedException text) - Strengthen efficient-mode explain IT: assert no bool/must (proves not post-filter shape), decode base64 knn payload to verify filter and predicate field are embedded inside knn query - Rename "Backward-compatible constructor" to clarify intent Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
…are-aggregate ITs; drop subquery workaround text Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
The explain suite previously relied on substring checks (contains("wrapper"),
contains("bool"), contains("filter")) that could false-positive across distinct
push-down shapes — notably, post-filter vs. efficient-mode filter_type output
look similar at the substring level.
Add shared helpers to (1) decode every base64 WrapperQueryBuilder payload into
its inner k-NN JSON and (2) extract and unescape the outer sourceBuilder JSON
from explain output. Strengthen the top-k, radial (max_distance, min_score),
post-filter, compound-predicate, radial-with-WHERE, and filter_type=post tests
to assert both the outer bool/must/filter shape and that WHERE predicates
stay out of the inner k-NN payload. Refactor the efficient-mode test onto the
same helpers so its no-outer-bool/must guard is verified against the unescaped
sourceBuilder JSON rather than accidentally passing on escaped output.
Leave the ORDER BY _score DESC and LIMIT <= k smoke tests untouched — those
are guard-rail "does planning succeed" checks where shape is covered elsewhere.
Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
…R BY
- Add assertFalse(contains("\"bool\"")) on the outer sourceBuilder JSON for the
plain top-k and radial (max_distance, min_score) cases so a regression that
wrapped the knn in an outer bool while preserving the inner payload is
caught.
- Document the SOURCE_BUILDER_JSON helper's test-only coupling to the
surrounding "sourceBuilder=...", "pitId=" tokens in the request-string
toString() output — if that format changes, update the regex anchors.
- Strengthen testEfficientFilterWithOrderByScoreDescSucceeds onto the same
helpers as testExplainFilterTypeEfficientProducesKnnWithFilter: verify no
outer bool/must in the sourceBuilder JSON and that the WHERE predicate is
embedded inside the decoded knn payload.
Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
VectorSearchIT covered only rejection paths; this adds positive end-to-end coverage for top-k, POST-filter, EFFICIENT-filter, and both radial modes (max_distance / min_score). The k-NN plugin is not provisioned by the default integ-test cluster, so each test guards init() with Assume.assumeTrue(isKnnPluginInstalled()) — tests skip cleanly when k-NN is absent and run when it is (e.g. locally via scripts/setup-knn-local.sh). Provisioning k-NN in CI is a separate follow-up tracked outside this PR. Test data is a 6-doc 2D knn_vector index with two well-separated clusters so filter correctness is assertable by document id. Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Strengthen the happy-path IT so the efficient-filter test actually
discriminates EFFICIENT from POST and so the other tests assert
exact id sets instead of "at least one row, all in expected range".
- Pin the test index to Lucene HNSW + L2 so efficient filtering is
deterministic (k-NN only supports efficient filtering on lucene+hnsw
and faiss+hnsw/ivf) and the L2 -> 1/(1+d) scoring used by min_score
is well-defined.
- Rework the efficient-filter test to query near TX with k=3 and
WHERE state='CA'. A POST-filter implementation would return 0 rows
here (the 3 nearest candidates are all TX and get filtered out), so
the assertion on exactly {4,5,6} is discriminative between modes.
- Tighten POST and radial tests to exact id-set assertions.
- Drop the scripts/setup-knn-local.sh reference in the class javadoc
(the helper is local-only working-tree tooling, not a tracked script).
Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
9aa62fc to
fa444fe
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
VectorSearchExecutionITwith end-to-end execution coverage for top-k, POST-filter, EFFICIENT-filter, and both radial modes (max_distance/min_score).VectorSearchITalready covers rejection paths; this adds conditional happy-path coverage that runs locally against a k-NN-enabled cluster.init()withAssume.assumeTrue(isKnnPluginInstalled()). Tests skip cleanly when k-NN is absent and run when it is. Ongoing CI regression protection depends on a separate follow-up to provision k-NN in the integ-test cluster.knn_vectorindex with two well-separated clusters, pinned to Lucene HNSW + L2 so efficient filtering is deterministic and L2 scoring is well-defined.What the tests assert
WHERE state='TX'withk=10returns exactly the 3 near-cluster docs{1,2,3}(proves the filter is applied, not just vector similarity).WHERE state='CA'withk=3,filter_type=efficientreturns exactly the 3 CA docs{4,5,6}. This is discriminative: a POST-filter implementation atk=3would return 0 rows (all 3 nearest candidates are TX and get filtered out), so only an efficient-filter path that counts matching neighbors towardkproduces this result.max_distance— returns exactly{1,2,3}(cluster A within L2 distance 1.0).min_score— returned docs satisfy the score floor and match exactly{1,2,3}.Follow-up
Provisioning k-NN in integ-test CI is a separate change tracked outside this PR; once k-NN is on the test cluster the skip becomes a no-op and these tests light up automatically.
Test plan
./gradlew :integ-test:integTest --tests "org.opensearch.sql.sql.VectorSearchExecutionIT"— 5 skipped on default cluster (no k-NN), 0 failures./gradlew spotlessCheck