Skip to content

[Feature] Add happy-path execution tests for vectorSearch()#5366

Open
mengweieric wants to merge 42 commits intoopensearch-project:feature/vector-search-p0from
mengweieric:pr/E-execution-its
Open

[Feature] Add happy-path execution tests for vectorSearch()#5366
mengweieric wants to merge 42 commits intoopensearch-project:feature/vector-search-p0from
mengweieric:pr/E-execution-its

Conversation

@mengweieric
Copy link
Copy Markdown
Collaborator

@mengweieric mengweieric commented Apr 17, 2026

Summary

  • Adds VectorSearchExecutionIT with end-to-end execution coverage for top-k, POST-filter, EFFICIENT-filter, and both radial modes (max_distance / min_score). VectorSearchIT already covers rejection paths; this adds conditional happy-path coverage that runs locally against a k-NN-enabled cluster.
  • The k-NN plugin is not provisioned by the default integ-test cluster, so each test guards init() with Assume.assumeTrue(isKnnPluginInstalled()). Tests skip cleanly when k-NN is absent and run when it is. Ongoing CI regression protection depends on a separate follow-up to provision k-NN in the integ-test cluster.
  • Test fixture is a 6-doc 2D knn_vector index with two well-separated clusters, pinned to Lucene HNSW + L2 so efficient filtering is deterministic and L2 scoring is well-defined.

What the tests assert

  • Top-k — all 3 returned docs come from the near cluster; scores are non-increasing.
  • POST filter — vector from far cluster + WHERE state='TX' with k=10 returns exactly the 3 near-cluster docs {1,2,3} (proves the filter is applied, not just vector similarity).
  • EFFICIENT filter — vector near TX + WHERE state='CA' with k=3,filter_type=efficient returns exactly the 3 CA docs {4,5,6}. This is discriminative: a POST-filter implementation at k=3 would return 0 rows (all 3 nearest candidates are TX and get filtered out), so only an efficient-filter path that counts matching neighbors toward k produces this result.
  • Radial max_distance — returns exactly {1,2,3} (cluster A within L2 distance 1.0).
  • Radial min_score — returned docs satisfy the score floor and match exactly {1,2,3}.

Follow-up

Provisioning k-NN in integ-test CI is a separate change tracked outside this PR; once k-NN is on the test cluster the skip becomes a no-op and these tests light up automatically.

Test plan

  • ./gradlew :integ-test:integTest --tests "org.opensearch.sql.sql.VectorSearchExecutionIT" — 5 skipped on default cluster (no k-NN), 0 failures
  • ./gradlew spotlessCheck
  • Run locally with k-NN installed to confirm all 5 tests pass

mengweieric and others added 30 commits April 7, 2026 15:30
…ch-project#5318)

* [Feature] Add table function relation to SQL grammar for vectorSearch()

Add table function relation support to the SQL parser:
- New `tableFunctionRelation` alternative in `relation` grammar rule
- Named argument syntax: `key=value` (e.g., table='index', field='vec')
- Alias is required by grammar (FROM func(...) AS alias)
- AstBuilder emits existing TableFunction + SubqueryAlias AST nodes
- 3 parser unit tests: basic parse, with WHERE/ORDER BY/LIMIT, alias required

This is a pure grammar change — no execution support yet. Queries will
parse successfully but fail at the Analyzer with "unsupported function".

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Address review feedback on table function relation grammar

1. Canonicalize argument names at parser boundary:
   unquoteIdentifier + toLowerCase(Locale.ROOT) in visitTableFunctionRelation
   so FIELD='x' and `field`='x' both produce argName="field"

2. Make AS keyword optional (AS? alias) for consistency with
   tableAsRelation and subqueryAsRelation grammar rules

3. Strengthen test coverage:
   - Full structural AST assertion for WHERE + ORDER BY + LIMIT
     (verifies Sort, Limit, Filter nodes, not just toString)
   - Argument reorder test proves names resolve by name not position
   - Case canonicalization test (TABLE= → table=)
   - Alias-without-AS test (FROM func(...) v)

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

* Apply spotless formatting

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>

---------

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Maps knn_vector fields to ExprCoreType.ARRAY so they appear in
DESCRIBE output and can be referenced in projections. This is a
visibility shim — not a full vector type.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
VectorSearchIndex.createScanBuilder() needs to construct an
OpenSearchIndexScanBuilder with a custom VectorSearchQueryBuilder
delegate. The existing constructor was protected (test-only).

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Introduces the core execution pipeline for vectorsearch():
- VectorSearchTableFunctionResolver: registers vectorsearch with 4 STRING args
- VectorSearchTableFunctionImplementation: parses named args, vector literal,
  options string, validates search mode (k/max_distance/min_score)
- VectorSearchIndex: extends OpenSearchIndex with knn query seeding,
  score tracking, and WrapperQueryBuilder DSL construction
- VectorSearchQueryBuilder: keeps knn in must (scoring) context,
  WHERE filters in filter (non-scoring) context

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Override getFunctions() to expose vectorsearch() table function
to the query analysis pipeline.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Verifies knn query is placed in scoring (must) context, not wrapped
in bool.filter when no WHERE clause is present.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
- Add pushDownFilter() unit test asserting knn stays in bool.must
  (scoring) and WHERE predicate goes to bool.filter (non-scoring)
- Add option key allowlist (k, max_distance, min_score) to reject
  unknown/unsupported keys before they reach DSL generation
- Add field name validation to reject characters that could corrupt
  the WrapperQueryBuilder JSON (allows alphanumeric, dots, underscores,
  hyphens)
- Add named-arg type guard to reject non-NamedArgumentExpression args
  early with a clear error message

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Parse k as integer, max_distance and min_score as double before they
reach buildKnnQuery(). Rejects non-numeric and non-finite values with
clear errors. This closes the residual JSON-injection path through
option values without requiring full XContent migration.

Also fixes toString() to be consistent with the named-arg guard
(no longer blindly casts to NamedArgumentExpression).

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
- parseOptions: reject malformed segments and duplicate keys
- parseVector: wrap errors in ExpressionEvaluationException, reject
  non-finite floats (Infinity, NaN)
- VectorSearchIndex: default requestedTotalSize to k via
  pushDownLimitToRequestTotal so queries without LIMIT return k results
- Add 5 new tests: malformed option, duplicate key, empty vector,
  malformed vector component, non-finite vector component

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
- validateNamedArgs() now rejects null/empty arg names defensively,
  closing a potential NPE if the shared table-function path is later
  wired into PPL
- OpenSearchStorageEngineTest uses contains-check instead of exact
  collection size assertion
- Add testNullArgNameThrows test

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
- Remove unused VECTOR_OPTION constant from VectorSearchIndex
- Clarify buildKnnQuery() comment: quoted fallback is for forward
  compatibility, all P0 values are already canonicalized as numeric
- Rename testMissingSearchModeOptionThrows to
  testUnknownOptionKeyOnlyThrows to match what it actually tests

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
- Enforce exactly one of k, max_distance, or min_score
- Validate k is in [1, 10000] range
- Add 6 tests: mutual exclusivity (3 combos), k too small, k too
  large, k boundary values (1 and 10000)

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
VectorSearchQueryBuilder now accepts options map and rejects
pushDownLimit when LIMIT exceeds k. Radial modes (max_distance,
min_score) have no LIMIT restriction.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
- Create VectorSearchIndexTest: 7 tests covering buildKnnQueryJson()
  for top-k, max_distance, min_score, nested fields, multi-element
  and single-element vectors, numeric option rendering
- Add edge case tests to VectorSearchTableFunctionImplementationTest:
  NaN vector component, empty option key/value, negative k, NaN for
  max_distance and min_score (6 new tests)
- Add VectorSearchQueryBuilderTest: min_score radial mode LIMIT,
  pushDownSort delegation to parent (2 new tests)
- Extract buildKnnQueryJson() as package-private for direct testing

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Test too-many (5) and zero arguments paths in
VectorSearchTableFunctionResolver to complement existing
too-few (2) test.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
- Cap radial mode (max_distance/min_score) results at maxResultWindow
  to prevent unbounded result sets
- Reject ORDER BY on non-_score fields and _score ASC in vectorSearch
  since knn results are naturally sorted by _score DESC
- Add 12 integration tests: 4 _explain DSL shape verification tests
  and 8 validation error path tests

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
- Add multi-sort expression test: ORDER BY _score DESC, name ASC
  correctly rejects the non-_score field (VectorSearchQueryBuilderTest)
- Add case-insensitive argument name lookup test to verify
  TABLE='x' resolves same as table='x' (Implementation test)
- Add non-numeric option fallback test: verifies string options
  are quoted in JSON output (VectorSearchIndexTest)
- Add 4 integration tests: ORDER BY _score DESC succeeds,
  ORDER BY non-score rejects, ORDER BY _score ASC rejects,
  LIMIT within k succeeds (VectorSearchIT, now 16 tests)

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
The base OpenSearchIndexScanQueryBuilder.pushDownSort() pushes
sort.getCount() as a limit when non-zero. Our override validated
_score DESC and returned true, but did not preserve this contract.

SQL always sets count=0, so this was not reachable today, but PPL
or future callers may set a non-zero count to combine sort+limit
in one LogicalSort node. Preserve the behavior defensively.

Add focused test: LogicalSort(count=7) with _score DESC verifies
the count is pushed down as request size.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
- Unit test: compound AND predicate survives pushdown into bool.filter
- Integration test: compound WHERE (term + range) produces bool query
- Integration test: radial max_distance with WHERE produces bool query

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
pushDownSort() called requestBuilder.pushDownLimit() directly, bypassing
the LIMIT > k guard in pushDownLimit(). Extract validateLimitWithinK()
helper and call it from both paths so the invariant holds when PPL or
future callers set a non-zero sort count.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Move all explainQuery()-based DSL shape tests into a dedicated
VectorSearchExplainIT suite. VectorSearchIT now contains only
validation and error-path tests.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
…SearchIndex

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
…on in VectorSearchQueryBuilder

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
…fficient mode

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
…matting

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Reject construction of VectorSearchQueryBuilder with
FilterType.EFFICIENT and a null rebuildKnnWithFilter callback at
construction time instead of deferring to an NPE in pushDownFilter.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
… test, rename constructor comment

- Reword filter_type error message to be user-friendly and actionable
  (no longer leaks internal ScriptQueryUnSupportedException text)
- Strengthen efficient-mode explain IT: assert no bool/must (proves not
  post-filter shape), decode base64 knn payload to verify filter and
  predicate field are embedded inside knn query
- Rename "Backward-compatible constructor" to clarify intent

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
…are-aggregate ITs; drop subquery workaround text

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
The explain suite previously relied on substring checks (contains("wrapper"),
contains("bool"), contains("filter")) that could false-positive across distinct
push-down shapes — notably, post-filter vs. efficient-mode filter_type output
look similar at the substring level.

Add shared helpers to (1) decode every base64 WrapperQueryBuilder payload into
its inner k-NN JSON and (2) extract and unescape the outer sourceBuilder JSON
from explain output. Strengthen the top-k, radial (max_distance, min_score),
post-filter, compound-predicate, radial-with-WHERE, and filter_type=post tests
to assert both the outer bool/must/filter shape and that WHERE predicates
stay out of the inner k-NN payload. Refactor the efficient-mode test onto the
same helpers so its no-outer-bool/must guard is verified against the unescaped
sourceBuilder JSON rather than accidentally passing on escaped output.

Leave the ORDER BY _score DESC and LIMIT <= k smoke tests untouched — those
are guard-rail "does planning succeed" checks where shape is covered elsewhere.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
…R BY

- Add assertFalse(contains("\"bool\"")) on the outer sourceBuilder JSON for the
  plain top-k and radial (max_distance, min_score) cases so a regression that
  wrapped the knn in an outer bool while preserving the inner payload is
  caught.
- Document the SOURCE_BUILDER_JSON helper's test-only coupling to the
  surrounding "sourceBuilder=...", "pitId=" tokens in the request-string
  toString() output — if that format changes, update the regex anchors.
- Strengthen testEfficientFilterWithOrderByScoreDescSucceeds onto the same
  helpers as testExplainFilterTypeEfficientProducesKnnWithFilter: verify no
  outer bool/must in the sourceBuilder JSON and that the WHERE predicate is
  embedded inside the decoded knn payload.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
VectorSearchIT covered only rejection paths; this adds positive end-to-end
coverage for top-k, POST-filter, EFFICIENT-filter, and both radial modes
(max_distance / min_score).

The k-NN plugin is not provisioned by the default integ-test cluster, so
each test guards init() with Assume.assumeTrue(isKnnPluginInstalled()) —
tests skip cleanly when k-NN is absent and run when it is (e.g. locally
via scripts/setup-knn-local.sh). Provisioning k-NN in CI is a separate
follow-up tracked outside this PR.

Test data is a 6-doc 2D knn_vector index with two well-separated clusters
so filter correctness is assertable by document id.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
@mengweieric mengweieric added feature skip-diff-analyzer Maintainer to skip code-diff-analyzer check, after reviewing issues in AI analysis. skip-diff-reviewer Maintainer to skip code-diff-reviewer check, after reviewing issues in AI analysis. SQL labels Apr 17, 2026
Strengthen the happy-path IT so the efficient-filter test actually
discriminates EFFICIENT from POST and so the other tests assert
exact id sets instead of "at least one row, all in expected range".

- Pin the test index to Lucene HNSW + L2 so efficient filtering is
  deterministic (k-NN only supports efficient filtering on lucene+hnsw
  and faiss+hnsw/ivf) and the L2 -> 1/(1+d) scoring used by min_score
  is well-defined.
- Rework the efficient-filter test to query near TX with k=3 and
  WHERE state='CA'. A POST-filter implementation would return 0 rows
  here (the 3 nearest candidates are all TX and get filtered out), so
  the assertion on exactly {4,5,6} is discriminative between modes.
- Tighten POST and radial tests to exact id-set assertions.
- Drop the scripts/setup-knn-local.sh reference in the class javadoc
  (the helper is local-only working-tree tooling, not a tracked script).

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
@mengweieric mengweieric force-pushed the feature/vector-search-p0 branch from 9aa62fc to fa444fe Compare April 20, 2026 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature skip-diff-analyzer Maintainer to skip code-diff-analyzer check, after reviewing issues in AI analysis. skip-diff-reviewer Maintainer to skip code-diff-reviewer check, after reviewing issues in AI analysis. SQL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant