[BugFix] Set per-field missingBucket=false for non-nullable group-by fields (#4463) by songkant-aws · Pull Request #5336 · opensearch-project/sql

songkant-aws · 2026-04-13T05:20:43Z

Summary

Resolves #4463

When isnotnull() filter is used with stats ... by on a text field whose keyword subfield has ignore_above, null buckets appear in aggregation results because the exists query operates on the root text field while the composite aggregation groups by the keyword subfield
This fix propagates NOT NULL constraints from pushed filters into the aggregate pushdown phase, setting missingBucket=false per-field for group-by fields known to be non-null
Uses Calcite's Strong.isNotTrue() to detect implicit NOT NULL constraints (e.g., x <> '' implies x IS NOT NULL), which handles Calcite's RexSimplify optimization that removes redundant IS_NOT_NULL predicates before filter pushdown

Changes

File	Change
`FilterDigest.java`	Add `notNullFieldIndices` field to carry NOT NULL constraint info
`CalciteLogicalIndexScan.java`	Extract NOT NULL indices at filter push time, pass to aggregate helper
`AggregateAnalyzer.java`	Add `nonNullableGroupIndices` + `isBucketNullable(int)` for per-field control
`AggregateAnalyzerTest.java`	Unit tests for partial and full non-nullable group fields
`CalcitePPLAggregationIT.java`	Integration test with custom index mapping
`issues/4463.yml`	YAML REST test reproducing the bug

Test plan

Unit tests: AggregateAnalyzerTest — analyze_groupBy_nonNullableGroupField, analyze_groupBy_allNonNullableGroupFields
Integration test: CalcitePPLAggregationIT.testIsNotNullFilterExcludesNullBucketsInAggregation (pushdown-only)
YAML REST test: issues/4463.yml
./gradlew :opensearch:test — all pass
./gradlew spotlessCheck — clean
Full CI

Signed-off-by: Songkan Tang songkant@amazon.com

…fields (opensearch-project#4463) When a PPL query uses isnotnull() filter with stats group-by on a text field that has a keyword subfield with ignore_above, the exists query operates on the root text field while the aggregation uses the keyword subfield. Documents with text values exceeding ignore_above have null keyword values, causing unexpected null buckets in aggregation results despite the isnotnull filter. This fix propagates NOT NULL constraints from pushed filters to the aggregate pushdown phase. It uses Calcite's Strong.isNotTrue() to detect implicit NOT NULL constraints (covering both explicit IS_NOT_NULL and implied constraints like x <> ''), stores them in FilterDigest, and sets missingBucket=false per-field for constrained group-by fields. Signed-off-by: Songkan Tang <songkant@amazon.com>

github-actions · 2026-04-13T05:22:01Z

PR Reviewer Guide 🔍

(Review updated until commit `a00fb07`)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes Sub-PR theme: Core fix - propagate NOT NULL constraints to aggregate pushdown (missingBucket per-field) Relevant files: opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/context/FilterDigest.java opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/context/PushDownContext.java opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/CalciteLogicalIndexScan.java opensearch/src/main/java/org/opensearch/sql/opensearch/request/AggregateAnalyzer.java opensearch/src/test/java/org/opensearch/sql/opensearch/request/AggregateAnalyzerTest.java Sub-PR theme: Integration and YAML REST tests for issue 4463 fix Relevant files: integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLAggregationIT.java integ-test/src/test/resources/expectedOutput/calcite/chart_multiple_group_keys.yaml integ-test/src/test/resources/expectedOutput/calcite/chart_timestamp_span_and_category.yaml integ-test/src/test/resources/expectedOutput/calcite/chart_use_other.yaml integ-test/src/test/resources/expectedOutput/calcite/chart_with_limit.yaml integ-test/src/test/resources/expectedOutput/calcite/explain_scalar_correlated_subquery_in_select.yaml integ-test/src/test/resources/expectedOutput/calcite/explain_scalar_correlated_subquery_in_where.yaml integ-test/src/yamlRestTest/resources/rest-api-spec/test/issues/4463.yml
⚡ Recommended focus areas for review Index Remapping Bug In `remapFilterDigest`, the remapping logic uses `selectedColumns.indexOf(oldIdx)` which finds the position of `oldIdx` in `selectedColumns`. However, `selectedColumns` contains the original scan-level column indices, and `indexOf` returns the position (new index) in the projected schema. This logic may be incorrect if multiple project layers are composed — the composed `currentColumns` list contains already-remapped indices, not original scan indices. Verify that the remapping is correct when multiple project pushdowns are chained. private static PushDownOperation remapFilterDigest( PushDownOperation operation, FilterDigest fd, List<Integer> selectedColumns) { Set<Integer> remapped = new HashSet<>(); for (int oldIdx : fd.notNullFieldIndices()) { int newIdx = selectedColumns.indexOf(oldIdx); if (newIdx >= 0) { remapped.add(newIdx); } } FilterDigest newDigest = new FilterDigest(fd.scriptCount(), fd.condition(), remapped); return new PushDownOperation(operation.type(), newDigest, operation.action()); } Group Index Mismatch In `extractNonNullableGroupIndices`, the `nonNullableGroupIndices` set is populated with `groupIndex` values from `aggregate.getGroupSet().asList()`, but these are passed directly to `AggregateBuilderHelper` as `nonNullableGroupIndices`. In `isBucketNullable(int groupIndex)`, the check is `nonNullableGroupIndices.contains(groupIndex)` where `groupIndex` is the loop variable in `createCompositeBucket`. Verify that the group index used in `createCompositeBucket` matches the indices stored in `nonNullableGroupIndices` — specifically, whether the index passed to `createCompositeBucket` is the position in the group list (0, 1, 2...) or the actual field index from `getGroupSet()`. private static Set<Integer> extractNonNullableGroupIndices( PushDownContext context, Aggregate aggregate, @Nullable Project project) { Set<Integer> allNotNullIndices = new HashSet<>(); for (PushDownOperation op : context) { // FilterDigest is attached to both FILTER and SCRIPT push-down operations; // the type only reflects whether a script is involved, not whether NOT NULL metadata exists. if ((op.type() == PushDownType.FILTER \|\| op.type() == PushDownType.SCRIPT) && op.digest() instanceof FilterDigest fd) { allNotNullIndices.addAll(fd.notNullFieldIndices()); } } if (allNotNullIndices.isEmpty()) { return Collections.emptySet(); } // Map scan-level not-null indices to aggregate group indices, resolving through project List<Integer> groupList = aggregate.getGroupSet().asList(); Set<Integer> result = new HashSet<>(); for (int groupIndex : groupList) { int scanIndex = resolveGroupToScanIndex(groupIndex, project); if (scanIndex >= 0 && allNotNullIndices.contains(scanIndex)) { result.add(groupIndex); } } return result; } Overly Broad NOT NULL `extractNotNullFieldIndices` uses `Strong.isNotTrue` to detect fields that are implicitly non-null. This may mark fields as non-null in cases where the filter is only partially pushed down to OpenSearch (e.g., script predicates). If the NOT NULL constraint comes from a part of the condition that is NOT actually pushed to OpenSearch (only evaluated in Calcite), setting `missingBucket=false` in the aggregation DSL would be incorrect and could exclude valid null-keyword buckets. The code uses `digestCondition` (the pushed portion) rather than the full filter condition, which mitigates this, but the interaction with partial pushdown in script cases should be carefully validated. private static Set<Integer> extractNotNullFieldIndices(RexNode condition) { Set<Integer> allRefs = new HashSet<>(); condition.accept( new RexVisitorImpl<Void>(true) { @Override public Void visitInputRef(RexInputRef inputRef) { allRefs.add(inputRef.getIndex()); return null; } }); Set<Integer> result = new HashSet<>(); for (int idx : allRefs) { if (Strong.isNotTrue(condition, ImmutableBitSet.of(idx))) { result.add(idx); } } return result; } Test Isolation The `issue4463Index` field is an instance variable on the test class. If `setupIssue4463Index()` is called but `cleanupIssue4463Index()` fails or is skipped (e.g., due to a test framework issue), the index may persist and affect other tests. Consider using `@Before`/`@After` annotations or ensuring cleanup is always called. Also, if `enabledOnlyWhenPushdownIsEnabled()` throws/skips before `setupIssue4463Index()`, the index is never created, which is fine — but if it throws after setup, cleanup won't run. public void testIsNotNullFilterExcludesNullBucketsInAggregation() throws IOException { enabledOnlyWhenPushdownIsEnabled(); setupIssue4463Index(); try { // isnotnull + != '' → RexSimplify folds into just x<>''; Strong still infers non-null JSONObject actual = executeQuery( "source=" + issue4463Index + " \| where isnotnull(description) and description != ''" + " \| stats count() by description"); assertEquals(3, actual.getInt("total")); } finally { cleanupIssue4463Index(); }

github-actions · 2026-04-13T05:22:36Z

PR Code Suggestions ✨

Latest suggestions up to a00fb07

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Fix group index vs. field index confusion in mapping The method maps `groupIndex` (from `aggregate.getGroupSet().asList()`) directly to `resolveGroupToScanIndex`, but `groupIndex` here is the position within the group set list, not the actual field index in the aggregate's input row type. The actual field index should come from `aggregate.getGroupSet().asList().get(i)` (which is what `groupList` already contains), but then `resolveGroupToScanIndex` is called with that value as both the group-set field index and the project lookup index. When a project is present, `project.getProjects().get(groupIndex)` uses the group-set field index to index into the project's expressions, which is only correct if the group fields are the first N fields of the project — this may not always hold. Verify that `groupIndex` correctly indexes into `project.getProjects()`, or use the position within `groupList` (loop variable `i`) to index the project. opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/CalciteLogicalIndexScan.java [472-480] -private static Set<Integer> extractNonNullableGroupIndices( - PushDownContext context, Aggregate aggregate, @Nullable Project project) { - Set<Integer> allNotNullIndices = new HashSet<>(); - for (PushDownOperation op : context) { - if ((op.type() == PushDownType.FILTER \|\| op.type() == PushDownType.SCRIPT) - && op.digest() instanceof FilterDigest fd) { - allNotNullIndices.addAll(fd.notNullFieldIndices()); - } +List<Integer> groupList = aggregate.getGroupSet().asList(); +Set<Integer> result = new HashSet<>(); +for (int i = 0; i < groupList.size(); i++) { + int fieldIndex = groupList.get(i); + int scanIndex = resolveGroupToScanIndex(i, project); + if (scanIndex >= 0 && allNotNullIndices.contains(scanIndex)) { + result.add(fieldIndex); } - if (allNotNullIndices.isEmpty()) { - return Collections.emptySet(); - } - // Map scan-level not-null indices to aggregate group indices, resolving through project - List<Integer> groupList = aggregate.getGroupSet().asList(); - Set<Integer> result = new HashSet<>(); - for (int groupIndex : groupList) { - int scanIndex = resolveGroupToScanIndex(groupIndex, project); - if (scanIndex >= 0 && allNotNullIndices.contains(scanIndex)) { - result.add(groupIndex); - } - } - return result; } +return result; Suggestion importance[1-10]: 5 __ Why: The suggestion raises a valid concern about whether `groupIndex` from `aggregate.getGroupSet().asList()` correctly indexes into `project.getProjects()`. However, in the typical Calcite plan structure, the aggregate's group fields do correspond to the first positions in the project's output, making the current code likely correct in practice. The improved code changes the semantics by using position `i` instead of `fieldIndex` for project lookup, which may actually introduce a different bug if the group set indices don't start at 0.	Low
Possible issue	Ensure index extraction occurs before aggregate pushdown `extractNonNullableGroupIndices` is called with `newScan.pushDownContext`, which at this point already has the aggregate pushed down (via `cloneForAggregate` earlier in the method). However, the `FilterDigest` entries in the context may have already been remapped through `cloneForAggregate`'s project-remapping logic. Ensure that `extractNonNullableGroupIndices` is called before the aggregate is added to `newScan.pushDownContext`, or verify that the indices in `FilterDigest` at this point correctly correspond to the aggregate's group field space and not the original scan field space. opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/CalciteLogicalIndexScan.java [415-416] +// Extract non-nullable group indices before pushing down the aggregate, +// so FilterDigest indices still refer to the pre-aggregate field space. Set<Integer> nonNullableGroupIndices = extractNonNullableGroupIndices(newScan.pushDownContext, aggregate, project); +// ... then proceed with aggregate pushdown Suggestion importance[1-10]: 3 __ Why: The suggestion asks to verify ordering of operations, but looking at the PR code, `extractNonNullableGroupIndices` is called on `newScan.pushDownContext` which is populated via `cloneForAggregate` earlier — the `improved_code` is essentially the same as `existing_code` with just a comment added, so this is more of a verification request than an actionable fix.	Low
General	Use inverse map for efficient index remapping When multiple project pushdowns are chained, `currentColumns` is already remapped iteratively in the loop above this call, so `selectedColumns` passed here represents the composed mapping from the original scan field space to the post-project field space. However, `fd.notNullFieldIndices()` contains scan-level indices (pre-project), and `selectedColumns.indexOf(oldIdx)` correctly finds the new position. This is fine for a single project, but if `currentColumns` was built by composing multiple projects (via `selectedColumns.stream().map(currentColumns::get).toList()`), the final `currentColumns` list contains scan-level indices at positions corresponding to the aggregate's input. The `indexOf` lookup is O(n) per index; for correctness and performance, consider building an inverse map (`Map<Integer, Integer>`) from `selectedColumns` before the loop. opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/context/PushDownContext.java [133-144] private static PushDownOperation remapFilterDigest( PushDownOperation operation, FilterDigest fd, List<Integer> selectedColumns) { + Map<Integer, Integer> inverseMap = new HashMap<>(); + for (int newIdx = 0; newIdx < selectedColumns.size(); newIdx++) { + inverseMap.put(selectedColumns.get(newIdx), newIdx); + } Set<Integer> remapped = new HashSet<>(); for (int oldIdx : fd.notNullFieldIndices()) { - int newIdx = selectedColumns.indexOf(oldIdx); - if (newIdx >= 0) { + Integer newIdx = inverseMap.get(oldIdx); + if (newIdx != null) { remapped.add(newIdx); } } FilterDigest newDigest = new FilterDigest(fd.scriptCount(), fd.condition(), remapped); return new PushDownOperation(operation.type(), newDigest, operation.action()); } Suggestion importance[1-10]: 3 __ Why: The suggestion to use a `HashMap` inverse map instead of `indexOf` is a valid performance improvement, converting O(n*m) to O(n+m). However, in practice the number of `notNullFieldIndices` and `selectedColumns` is small (typically single digits), making this a minor optimization with negligible real-world impact.	Low

Previous suggestions

Suggestions up to commit 5c5e02b

Category	Suggestion	Impact
Possible issue	Fix group index mismatch in non-nullable set The `groupIndex` used in `result.add(groupIndex)` is the position within the group set list, but `isBucketNullable(groupIndex)` in `AggregateAnalyzer` is called with the actual group field index from `aggregate.getGroupSet()`. The group set list position and the actual field index may differ when the group set is not contiguous (e.g., `{0, 2}`). You should add the actual field index from the group set, not the list position. opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/CalciteLogicalIndexScan.java [472-479] -List<Integer> groupList = aggregate.getGroupSet().asList(); +ImmutableBitSet groupSet = aggregate.getGroupSet(); +List<Integer> groupList = groupSet.asList(); Set<Integer> result = new HashSet<>(); for (int groupIndex : groupList) { int scanIndex = resolveGroupToScanIndex(groupIndex, project); if (scanIndex >= 0 && allNotNullIndices.contains(scanIndex)) { result.add(groupIndex); } } Suggestion importance[1-10]: 3 __ Why: The suggestion claims `groupIndex` (the actual field index from `getGroupSet().asList()`) differs from the list position, but `asList()` on an `ImmutableBitSet` already returns the actual set bit indices (e.g., for `{0, 2}` it returns `[0, 2]`), so `groupIndex` is already the correct field index. The improved code adds `ImmutableBitSet groupSet = aggregate.getGroupSet();` but doesn't actually use it, making the change a no-op. The suggestion is based on a misunderstanding.	Low
General	Fix misleading comment about filtered documents The comment says "3 short (len ~12)" but the short descriptions in the test data are "Short desc 1/2/3" which are 12 characters, and the test also inserts an empty string document. The empty string has length 0, so it is filtered out by `length(description) > 5`. However, the 3 long documents have null keyword values and should be excluded from the aggregation result. The expected count of 3 (only the 3 short descriptions) appears correct, but the comment should clarify that the empty string is also excluded by the length filter. integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLAggregationIT.java [1766-1767] -// 3 short (len ~12) + 3 long (len >50) pass length>5; null keyword bucket excluded → 3 +// 3 short (len ~12) pass length>5; empty string (len 0) filtered out; null keyword bucket excluded → 3 assertEquals(3, actual.getInt("total")); Suggestion importance[1-10]: 2 __ Why: This is a minor comment improvement that clarifies the empty string document is also excluded by the `length > 5` filter. The logic and assertion are correct; only the comment is slightly misleading. Low impact as it doesn't affect correctness.	Low

Suggestions up to commit eaaa6e4

Category	Suggestion	Impact
Possible issue	Fix group index vs positional index mismatch The `groupIndex` used in `result.add(groupIndex)` is the position within the group list (0-based iteration index), not the actual group field index from `aggregate.getGroupSet()`. The `groupList` contains the actual field indices from the aggregate's group set, so you should be adding `groupIndex` (the field index from the group set) to `result`, but the loop variable `groupIndex` already holds the correct value from `groupList`. However, the `result` set is later used in `isBucketNullable(groupIndex)` where `groupIndex` is the position in `createCompositeBucket`, which iterates over group fields positionally. Verify that the indices stored in `result` (from `aggregate.getGroupSet().asList()`) match the positional indices used in `isBucketNullable` calls to avoid off-by-one mismatches. opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/CalciteLogicalIndexScan.java [472-479] List<Integer> groupList = aggregate.getGroupSet().asList(); Set<Integer> result = new HashSet<>(); -for (int groupIndex : groupList) { - int scanIndex = resolveGroupToScanIndex(groupIndex, project); +for (int pos = 0; pos < groupList.size(); pos++) { + int fieldIndex = groupList.get(pos); + int scanIndex = resolveGroupToScanIndex(fieldIndex, project); if (scanIndex >= 0 && allNotNullIndices.contains(scanIndex)) { - result.add(groupIndex); + result.add(pos); // store positional index used in isBucketNullable(groupIndex) } } Suggestion importance[1-10]: 7 __ Why: This is a valid concern: `aggregate.getGroupSet().asList()` returns the actual field indices (e.g., `{2, 5}` → `[2, 5]`), while `isBucketNullable` is called with a positional index (0, 1, 2...) in `createCompositeBucket`. If the group set contains non-contiguous indices, storing the field index instead of the positional index would cause a mismatch. The suggested fix of using `pos` instead of `groupIndex` (field index) is likely correct and important for correctness.	Medium
General	Ensure immutability of mutable set in record The `notNullFieldIndices` field is a mutable `Set` stored in a record, which can be mutated externally after construction. Consider storing it as an unmodifiable set to preserve immutability guarantees of the record. opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/context/FilterDigest.java [11] public record FilterDigest(int scriptCount, RexNode condition, Set<Integer> notNullFieldIndices) { + public FilterDigest { + notNullFieldIndices = Collections.unmodifiableSet(new HashSet<>(notNullFieldIndices)); + } Suggestion importance[1-10]: 4 __ Why: While making `notNullFieldIndices` unmodifiable is a good defensive practice for record immutability, the callers in this PR use `Set.of()` or `Collections.emptySet()` (already unmodifiable) or `HashSet` instances that aren't shared after construction, so the practical risk is low. It's a minor quality improvement.	Low

Suggestions up to commit 8d27262

Category	Suggestion	Impact
Possible issue	Fix group index mapping for non-contiguous group sets The `groupIndex` used as the key in `result` is the position within the group set, but `isBucketNullable(groupIndex)` in `AggregateBuilderHelper` is called with the index passed to `createCompositeBucket`, which iterates over group fields sequentially (0, 1, 2...). If `aggregate.getGroupSet()` contains non-contiguous bit indices (e.g., {0, 2}), `groupList` returns [0, 2], but the composite bucket loop uses sequential indices 0, 1. You should store the sequential position within the group list, not the raw bit index. opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/CalciteLogicalIndexScan.java [472-479] List<Integer> groupList = aggregate.getGroupSet().asList(); Set<Integer> result = new HashSet<>(); -for (int groupIndex : groupList) { - int scanIndex = resolveGroupToScanIndex(groupIndex, project); +for (int seqIndex = 0; seqIndex < groupList.size(); seqIndex++) { + int groupBitIndex = groupList.get(seqIndex); + int scanIndex = resolveGroupToScanIndex(groupBitIndex, project); if (scanIndex >= 0 && allNotNullIndices.contains(scanIndex)) { - result.add(groupIndex); + result.add(seqIndex); } } Suggestion importance[1-10]: 7 __ Why: This is a valid concern: if `aggregate.getGroupSet()` contains non-contiguous bit indices (e.g., {0, 2}), the raw bit indices stored in `result` won't match the sequential indices used in the composite bucket loop. However, the actual behavior depends on how `createCompositeBucket` is called and whether group sets are always contiguous in practice. This could be a real bug for non-contiguous group sets.	Medium
Possible issue	Guard against out-of-bounds project list access When `project` is not null, `project.getProjects().get(groupIndex)` uses the raw bit index from the group set as a list index into the project's expressions. However, the project list is indexed by its own output position, not by the aggregate's group bit index. If the project has fewer outputs than the group bit index value, this will throw an `IndexOutOfBoundsException`. The method should use the sequential position within the group list, which should be passed in rather than the raw bit index. opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/CalciteLogicalIndexScan.java [509-518] -private static int resolveGroupToScanIndex(int groupIndex, @Nullable Project project) { +private static int resolveGroupToScanIndex(int groupBitIndex, @Nullable Project project) { if (project == null) { - return groupIndex; + return groupBitIndex; } - RexNode projected = project.getProjects().get(groupIndex); + if (groupBitIndex >= project.getProjects().size()) { + return -1; + } + RexNode projected = project.getProjects().get(groupBitIndex); if (projected instanceof RexInputRef scanRef) { return scanRef.getIndex(); } return -1; } Suggestion importance[1-10]: 4 __ Why: Adding a bounds check on `project.getProjects().size()` is a defensive improvement, but in practice the aggregate's group bit indices should always be valid indices into the project's output list since the aggregate sits on top of the project. The `improved_code` doesn't address the root concern from suggestion 1 about using sequential vs. bit indices.	Low
General	Ensure immutability of stored index set The `notNullFieldIndices` field is a mutable `Set` stored in a record, which can be mutated externally after construction. Consider using `Set.copyOf()` in the canonical constructor to ensure immutability and prevent accidental modification of the stored set. opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/context/FilterDigest.java [11] public record FilterDigest(int scriptCount, RexNode condition, Set<Integer> notNullFieldIndices) { + public FilterDigest { + notNullFieldIndices = Set.copyOf(notNullFieldIndices); + } Suggestion importance[1-10]: 4 __ Why: Using `Set.copyOf()` in the canonical constructor is a good defensive practice to prevent external mutation of the stored set. However, the callers in this PR use `Collections.emptySet()` or `Set.of()` which are already immutable, so the practical risk is low.	Low

Suggestions up to commit afcd173

Category	Suggestion	Impact
Possible issue	Fix group index mismatch in non-nullable index resolution The `groupIndex` used as the key in `result` is the index within the aggregate's group set (i.e., the position in the row type of the aggregate's input), but `AggregateBuilderHelper.isBucketNullable` is called with the position in `groupList` (0-based iteration index). If `aggregate.getGroupSet()` is not contiguous starting from 0, the indices stored in `result` won't match the iteration indices used in `createCompositeBucket`. Consider storing the position in `groupList` (i.e., `groupList.indexOf(groupIndex)` or iterating with an explicit counter) rather than the raw `groupIndex` value. opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/CalciteLogicalIndexScan.java [466-473] List<Integer> groupList = aggregate.getGroupSet().asList(); Set<Integer> result = new HashSet<>(); -for (int groupIndex : groupList) { +for (int pos = 0; pos < groupList.size(); pos++) { + int groupIndex = groupList.get(pos); int scanIndex = resolveGroupToScanIndex(groupIndex, project); if (scanIndex >= 0 && allNotNullIndices.contains(scanIndex)) { - result.add(groupIndex); + result.add(pos); } } Suggestion importance[1-10]: 7 __ Why: This is a valid concern: if `aggregate.getGroupSet()` is not contiguous from 0, the `groupIndex` stored in `result` may not match the iteration position used in `createCompositeBucket`. However, in practice Calcite's `ImmutableBitSet.asList()` for a standard group-by typically produces contiguous indices starting from 0, so this may not manifest in common cases. Still, the fix is logically sound and prevents a subtle index mismatch bug.	Medium
General	Strengthen test to assert no null bucket values The test only asserts the total count of rows but does not verify that none of the returned rows have a null `description` value. A regression could still pass this check if, for example, one short description is missing and a null bucket appears instead. Add an assertion that iterates over the result rows and confirms no null description values are present. integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLAggregationIT.java [1713-1726] try { - // isnotnull + non-empty filter with stats group-by should exclude null keyword buckets JSONObject actual = executeQuery( "source=" + index + " \| where isnotnull(description) and description != ''" + " \| stats count() by description"); - // Should return exactly 3 rows (one per short description), no null bucket assertEquals(3, actual.getInt("total")); + // Verify no null description bucket is present + org.json.JSONArray dataRows = actual.getJSONArray("datarows"); + for (int i = 0; i < dataRows.length(); i++) { + org.json.JSONArray row = dataRows.getJSONArray(i); + // description column should not be null + assertNotNull(row.get(1)); + } } finally { - // Cleanup client().performRequest(new Request("DELETE", "/" + index)); } Suggestion importance[1-10]: 4 __ Why: Adding a check that no row contains a null `description` value would make the regression test more robust. However, the current `assertEquals(3, actual.getInt("total"))` already catches the primary regression (null bucket appearing as an extra row), so this is a moderate improvement to test thoroughness.	Low
General	Document index validity assumption for pushed filter indices The `notNullFieldIndices` stored in `FilterDigest` are indices relative to the scan's row type at the time the filter was pushed down. However, if a `Project` was pushed down before the filter, the scan's row type may have changed, making these indices stale or misaligned. Ensure that the indices are resolved against the correct row type, or document clearly that this assumption holds. opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/CalciteLogicalIndexScan.java [456-461] +// Note: notNullFieldIndices are relative to the scan row type at filter push-down time. +// This is correct only if no project reordering occurred before the filter push-down. Set<Integer> allNotNullIndices = new HashSet<>(); for (PushDownOperation op : context) { if (op.type() == PushDownType.FILTER && op.digest() instanceof FilterDigest fd) { allNotNullIndices.addAll(fd.notNullFieldIndices()); } } Suggestion importance[1-10]: 2 __ Why: This suggestion only adds a comment to document an assumption rather than fixing actual code. The `improved_code` is functionally identical to the `existing_code`, making this a documentation-only change with minimal impact.	Low

…undary tests - Read FilterDigest from both FILTER and SCRIPT PushDownType operations, so isnotnull + script predicates (e.g. length()) still propagate NOT NULL constraints to the aggregate phase. - Extract NOT NULL indices from digestCondition (the actually-pushed condition) instead of the original filter.getCondition(), ensuring partial pushdown consistency. - Add boundary tests: isnotnull-only, isnotnull + script predicate, DSL-level missing_bucket:false verification via explain API. Signed-off-by: Songkan Tang <songkant@amazon.com>

github-actions · 2026-04-13T05:44:21Z

Persistent review updated to latest commit 8d27262

Non-nullable group-by fields now emit missing_bucket:false in the composite aggregation DSL. Update 6 YAML expected-output files to reflect the changed serialization. Signed-off-by: Songkan Tang <songkant@amazon.com>

github-actions · 2026-04-13T07:04:58Z

Persistent review updated to latest commit eaaa6e4

…gate When a project pushdown occurs between a filter and an aggregate, the FilterDigest.notNullFieldIndices become stale because the project re-numbers the fields. This caused NfwPplDashboardIT failures where valid non-null rows were incorrectly excluded from aggregation results. Remap notNullFieldIndices through the project column mapping in PushDownContext.cloneForAggregate() and update YAML expected outputs for the corrected missing_bucket values. Signed-off-by: Songkan Tang <songkant@amazon.com>

github-actions · 2026-04-15T06:07:54Z

Persistent review updated to latest commit 5c5e02b

…issing-bucket-4463

github-actions · 2026-04-24T10:52:42Z

Persistent review updated to latest commit a00fb07

Update CalciteExplainIT expected output for missing_bucket:false

eaaa6e4

Non-nullable group-by fields now emit missing_bucket:false in the composite aggregation DSL. Update 6 YAML expected-output files to reflect the changed serialization. Signed-off-by: Songkan Tang <songkant@amazon.com>

songkant-aws added the bugFix label Apr 15, 2026

Merge remote-tracking branch 'upstream/main' into fix/agg-isnotnull-m…

a00fb07

…issing-bucket-4463

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix] Set per-field missingBucket=false for non-nullable group-by fields (#4463)#5336

[BugFix] Set per-field missingBucket=false for non-nullable group-by fields (#4463)#5336
songkant-aws wants to merge 5 commits intoopensearch-project:mainfrom
songkant-aws:fix/agg-isnotnull-missing-bucket-4463

songkant-aws commented Apr 13, 2026

Uh oh!

github-actions Bot commented Apr 13, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 13, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 13, 2026

Uh oh!

github-actions Bot commented Apr 13, 2026

Uh oh!

github-actions Bot commented Apr 15, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

songkant-aws commented Apr 13, 2026

Summary

Changes

Test plan

Uh oh!

github-actions Bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Reviewer Guide 🔍

(Review updated until commit a00fb07)

Uh oh!

github-actions Bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Previous suggestions

Uh oh!

github-actions Bot commented Apr 13, 2026

Uh oh!

github-actions Bot commented Apr 13, 2026

Uh oh!

github-actions Bot commented Apr 15, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Apr 13, 2026 •

edited

Loading

(Review updated until commit `a00fb07`)

github-actions Bot commented Apr 13, 2026 •

edited

Loading