Skip to content

[BugFix] Set per-field missingBucket=false for non-nullable group-by fields (#4463)#5336

Draft
songkant-aws wants to merge 5 commits intoopensearch-project:mainfrom
songkant-aws:fix/agg-isnotnull-missing-bucket-4463
Draft

[BugFix] Set per-field missingBucket=false for non-nullable group-by fields (#4463)#5336
songkant-aws wants to merge 5 commits intoopensearch-project:mainfrom
songkant-aws:fix/agg-isnotnull-missing-bucket-4463

Conversation

@songkant-aws
Copy link
Copy Markdown
Collaborator

Summary

Resolves #4463

  • When isnotnull() filter is used with stats ... by on a text field whose keyword subfield has ignore_above, null buckets appear in aggregation results because the exists query operates on the root text field while the composite aggregation groups by the keyword subfield
  • This fix propagates NOT NULL constraints from pushed filters into the aggregate pushdown phase, setting missingBucket=false per-field for group-by fields known to be non-null
  • Uses Calcite's Strong.isNotTrue() to detect implicit NOT NULL constraints (e.g., x <> '' implies x IS NOT NULL), which handles Calcite's RexSimplify optimization that removes redundant IS_NOT_NULL predicates before filter pushdown

Changes

File Change
FilterDigest.java Add notNullFieldIndices field to carry NOT NULL constraint info
CalciteLogicalIndexScan.java Extract NOT NULL indices at filter push time, pass to aggregate helper
AggregateAnalyzer.java Add nonNullableGroupIndices + isBucketNullable(int) for per-field control
AggregateAnalyzerTest.java Unit tests for partial and full non-nullable group fields
CalcitePPLAggregationIT.java Integration test with custom index mapping
issues/4463.yml YAML REST test reproducing the bug

Test plan

  • Unit tests: AggregateAnalyzerTestanalyze_groupBy_nonNullableGroupField, analyze_groupBy_allNonNullableGroupFields
  • Integration test: CalcitePPLAggregationIT.testIsNotNullFilterExcludesNullBucketsInAggregation (pushdown-only)
  • YAML REST test: issues/4463.yml
  • ./gradlew :opensearch:test — all pass
  • ./gradlew spotlessCheck — clean
  • Full CI

Signed-off-by: Songkan Tang songkant@amazon.com

…fields (opensearch-project#4463)

When a PPL query uses isnotnull() filter with stats group-by on a text
field that has a keyword subfield with ignore_above, the exists query
operates on the root text field while the aggregation uses the keyword
subfield. Documents with text values exceeding ignore_above have null
keyword values, causing unexpected null buckets in aggregation results
despite the isnotnull filter.

This fix propagates NOT NULL constraints from pushed filters to the
aggregate pushdown phase. It uses Calcite's Strong.isNotTrue() to
detect implicit NOT NULL constraints (covering both explicit IS_NOT_NULL
and implied constraints like x <> ''), stores them in FilterDigest, and
sets missingBucket=false per-field for constrained group-by fields.

Signed-off-by: Songkan Tang <songkant@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 13, 2026

PR Reviewer Guide 🔍

(Review updated until commit a00fb07)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Core fix - propagate NOT NULL constraints to aggregate pushdown (missingBucket per-field)

Relevant files:

  • opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/context/FilterDigest.java
  • opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/context/PushDownContext.java
  • opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/CalciteLogicalIndexScan.java
  • opensearch/src/main/java/org/opensearch/sql/opensearch/request/AggregateAnalyzer.java
  • opensearch/src/test/java/org/opensearch/sql/opensearch/request/AggregateAnalyzerTest.java

Sub-PR theme: Integration and YAML REST tests for issue 4463 fix

Relevant files:

  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLAggregationIT.java
  • integ-test/src/test/resources/expectedOutput/calcite/chart_multiple_group_keys.yaml
  • integ-test/src/test/resources/expectedOutput/calcite/chart_timestamp_span_and_category.yaml
  • integ-test/src/test/resources/expectedOutput/calcite/chart_use_other.yaml
  • integ-test/src/test/resources/expectedOutput/calcite/chart_with_limit.yaml
  • integ-test/src/test/resources/expectedOutput/calcite/explain_scalar_correlated_subquery_in_select.yaml
  • integ-test/src/test/resources/expectedOutput/calcite/explain_scalar_correlated_subquery_in_where.yaml
  • integ-test/src/yamlRestTest/resources/rest-api-spec/test/issues/4463.yml

⚡ Recommended focus areas for review

Index Remapping Bug

In remapFilterDigest, the remapping logic uses selectedColumns.indexOf(oldIdx) which finds the position of oldIdx in selectedColumns. However, selectedColumns contains the original scan-level column indices, and indexOf returns the position (new index) in the projected schema. This logic may be incorrect if multiple project layers are composed — the composed currentColumns list contains already-remapped indices, not original scan indices. Verify that the remapping is correct when multiple project pushdowns are chained.

private static PushDownOperation remapFilterDigest(
    PushDownOperation operation, FilterDigest fd, List<Integer> selectedColumns) {
  Set<Integer> remapped = new HashSet<>();
  for (int oldIdx : fd.notNullFieldIndices()) {
    int newIdx = selectedColumns.indexOf(oldIdx);
    if (newIdx >= 0) {
      remapped.add(newIdx);
    }
  }
  FilterDigest newDigest = new FilterDigest(fd.scriptCount(), fd.condition(), remapped);
  return new PushDownOperation(operation.type(), newDigest, operation.action());
}
Group Index Mismatch

In extractNonNullableGroupIndices, the nonNullableGroupIndices set is populated with groupIndex values from aggregate.getGroupSet().asList(), but these are passed directly to AggregateBuilderHelper as nonNullableGroupIndices. In isBucketNullable(int groupIndex), the check is nonNullableGroupIndices.contains(groupIndex) where groupIndex is the loop variable in createCompositeBucket. Verify that the group index used in createCompositeBucket matches the indices stored in nonNullableGroupIndices — specifically, whether the index passed to createCompositeBucket is the position in the group list (0, 1, 2...) or the actual field index from getGroupSet().

private static Set<Integer> extractNonNullableGroupIndices(
    PushDownContext context, Aggregate aggregate, @Nullable Project project) {
  Set<Integer> allNotNullIndices = new HashSet<>();
  for (PushDownOperation op : context) {
    // FilterDigest is attached to both FILTER and SCRIPT push-down operations;
    // the type only reflects whether a script is involved, not whether NOT NULL metadata exists.
    if ((op.type() == PushDownType.FILTER || op.type() == PushDownType.SCRIPT)
        && op.digest() instanceof FilterDigest fd) {
      allNotNullIndices.addAll(fd.notNullFieldIndices());
    }
  }
  if (allNotNullIndices.isEmpty()) {
    return Collections.emptySet();
  }
  // Map scan-level not-null indices to aggregate group indices, resolving through project
  List<Integer> groupList = aggregate.getGroupSet().asList();
  Set<Integer> result = new HashSet<>();
  for (int groupIndex : groupList) {
    int scanIndex = resolveGroupToScanIndex(groupIndex, project);
    if (scanIndex >= 0 && allNotNullIndices.contains(scanIndex)) {
      result.add(groupIndex);
    }
  }
  return result;
}
Overly Broad NOT NULL

extractNotNullFieldIndices uses Strong.isNotTrue to detect fields that are implicitly non-null. This may mark fields as non-null in cases where the filter is only partially pushed down to OpenSearch (e.g., script predicates). If the NOT NULL constraint comes from a part of the condition that is NOT actually pushed to OpenSearch (only evaluated in Calcite), setting missingBucket=false in the aggregation DSL would be incorrect and could exclude valid null-keyword buckets. The code uses digestCondition (the pushed portion) rather than the full filter condition, which mitigates this, but the interaction with partial pushdown in script cases should be carefully validated.

private static Set<Integer> extractNotNullFieldIndices(RexNode condition) {
  Set<Integer> allRefs = new HashSet<>();
  condition.accept(
      new RexVisitorImpl<Void>(true) {
        @Override
        public Void visitInputRef(RexInputRef inputRef) {
          allRefs.add(inputRef.getIndex());
          return null;
        }
      });
  Set<Integer> result = new HashSet<>();
  for (int idx : allRefs) {
    if (Strong.isNotTrue(condition, ImmutableBitSet.of(idx))) {
      result.add(idx);
    }
  }
  return result;
}
Test Isolation

The issue4463Index field is an instance variable on the test class. If setupIssue4463Index() is called but cleanupIssue4463Index() fails or is skipped (e.g., due to a test framework issue), the index may persist and affect other tests. Consider using @Before/@After annotations or ensuring cleanup is always called. Also, if enabledOnlyWhenPushdownIsEnabled() throws/skips before setupIssue4463Index(), the index is never created, which is fine — but if it throws after setup, cleanup won't run.

public void testIsNotNullFilterExcludesNullBucketsInAggregation() throws IOException {
  enabledOnlyWhenPushdownIsEnabled();
  setupIssue4463Index();
  try {
    // isnotnull + != '' → RexSimplify folds into just x<>''; Strong still infers non-null
    JSONObject actual =
        executeQuery(
            "source="
                + issue4463Index
                + " | where isnotnull(description) and description != ''"
                + " | stats count() by description");
    assertEquals(3, actual.getInt("total"));
  } finally {
    cleanupIssue4463Index();
  }

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 13, 2026

PR Code Suggestions ✨

Latest suggestions up to a00fb07

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix group index vs. field index confusion in mapping

The method maps groupIndex (from aggregate.getGroupSet().asList()) directly to
resolveGroupToScanIndex, but groupIndex here is the position within the group set
list, not the actual field index in the aggregate's input row type. The actual field
index should come from aggregate.getGroupSet().asList().get(i) (which is what
groupList already contains), but then resolveGroupToScanIndex is called with that
value as both the group-set field index and the project lookup index. When a project
is present, project.getProjects().get(groupIndex) uses the group-set field index to
index into the project's expressions, which is only correct if the group fields are
the first N fields of the project — this may not always hold. Verify that groupIndex
correctly indexes into project.getProjects(), or use the position within groupList
(loop variable i) to index the project.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/CalciteLogicalIndexScan.java [472-480]

-private static Set<Integer> extractNonNullableGroupIndices(
-    PushDownContext context, Aggregate aggregate, @Nullable Project project) {
-  Set<Integer> allNotNullIndices = new HashSet<>();
-  for (PushDownOperation op : context) {
-    if ((op.type() == PushDownType.FILTER || op.type() == PushDownType.SCRIPT)
-        && op.digest() instanceof FilterDigest fd) {
-      allNotNullIndices.addAll(fd.notNullFieldIndices());
-    }
+List<Integer> groupList = aggregate.getGroupSet().asList();
+Set<Integer> result = new HashSet<>();
+for (int i = 0; i < groupList.size(); i++) {
+  int fieldIndex = groupList.get(i);
+  int scanIndex = resolveGroupToScanIndex(i, project);
+  if (scanIndex >= 0 && allNotNullIndices.contains(scanIndex)) {
+    result.add(fieldIndex);
   }
-  if (allNotNullIndices.isEmpty()) {
-    return Collections.emptySet();
-  }
-  // Map scan-level not-null indices to aggregate group indices, resolving through project
-  List<Integer> groupList = aggregate.getGroupSet().asList();
-  Set<Integer> result = new HashSet<>();
-  for (int groupIndex : groupList) {
-    int scanIndex = resolveGroupToScanIndex(groupIndex, project);
-    if (scanIndex >= 0 && allNotNullIndices.contains(scanIndex)) {
-      result.add(groupIndex);
-    }
-  }
-  return result;
 }
+return result;
Suggestion importance[1-10]: 5

__

Why: The suggestion raises a valid concern about whether groupIndex from aggregate.getGroupSet().asList() correctly indexes into project.getProjects(). However, in the typical Calcite plan structure, the aggregate's group fields do correspond to the first positions in the project's output, making the current code likely correct in practice. The improved code changes the semantics by using position i instead of fieldIndex for project lookup, which may actually introduce a different bug if the group set indices don't start at 0.

Low
Ensure index extraction occurs before aggregate pushdown

extractNonNullableGroupIndices is called with newScan.pushDownContext, which at this
point already has the aggregate pushed down (via cloneForAggregate earlier in the
method). However, the FilterDigest entries in the context may have already been
remapped through cloneForAggregate's project-remapping logic. Ensure that
extractNonNullableGroupIndices is called before the aggregate is added to
newScan.pushDownContext, or verify that the indices in FilterDigest at this point
correctly correspond to the aggregate's group field space and not the original scan
field space.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/CalciteLogicalIndexScan.java [415-416]

+// Extract non-nullable group indices before pushing down the aggregate,
+// so FilterDigest indices still refer to the pre-aggregate field space.
 Set<Integer> nonNullableGroupIndices =
     extractNonNullableGroupIndices(newScan.pushDownContext, aggregate, project);
+// ... then proceed with aggregate pushdown
Suggestion importance[1-10]: 3

__

Why: The suggestion asks to verify ordering of operations, but looking at the PR code, extractNonNullableGroupIndices is called on newScan.pushDownContext which is populated via cloneForAggregate earlier — the improved_code is essentially the same as existing_code with just a comment added, so this is more of a verification request than an actionable fix.

Low
General
Use inverse map for efficient index remapping

When multiple project pushdowns are chained, currentColumns is already remapped
iteratively in the loop above this call, so selectedColumns passed here represents
the composed mapping from the original scan field space to the post-project field
space. However, fd.notNullFieldIndices() contains scan-level indices (pre-project),
and selectedColumns.indexOf(oldIdx) correctly finds the new position. This is fine
for a single project, but if currentColumns was built by composing multiple projects
(via selectedColumns.stream().map(currentColumns::get).toList()), the final
currentColumns list contains scan-level indices at positions corresponding to the
aggregate's input. The indexOf lookup is O(n) per index; for correctness and
performance, consider building an inverse map (Map<Integer, Integer>) from
selectedColumns before the loop.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/context/PushDownContext.java [133-144]

 private static PushDownOperation remapFilterDigest(
     PushDownOperation operation, FilterDigest fd, List<Integer> selectedColumns) {
+  Map<Integer, Integer> inverseMap = new HashMap<>();
+  for (int newIdx = 0; newIdx < selectedColumns.size(); newIdx++) {
+    inverseMap.put(selectedColumns.get(newIdx), newIdx);
+  }
   Set<Integer> remapped = new HashSet<>();
   for (int oldIdx : fd.notNullFieldIndices()) {
-    int newIdx = selectedColumns.indexOf(oldIdx);
-    if (newIdx >= 0) {
+    Integer newIdx = inverseMap.get(oldIdx);
+    if (newIdx != null) {
       remapped.add(newIdx);
     }
   }
   FilterDigest newDigest = new FilterDigest(fd.scriptCount(), fd.condition(), remapped);
   return new PushDownOperation(operation.type(), newDigest, operation.action());
 }
Suggestion importance[1-10]: 3

__

Why: The suggestion to use a HashMap inverse map instead of indexOf is a valid performance improvement, converting O(n*m) to O(n+m). However, in practice the number of notNullFieldIndices and selectedColumns is small (typically single digits), making this a minor optimization with negligible real-world impact.

Low

Previous suggestions

Suggestions up to commit 5c5e02b
CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix group index mismatch in non-nullable set

The groupIndex used in result.add(groupIndex) is the position within the group set
list, but isBucketNullable(groupIndex) in AggregateAnalyzer is called with the
actual group field index from aggregate.getGroupSet(). The group set list position
and the actual field index may differ when the group set is not contiguous (e.g.,
{0, 2}). You should add the actual field index from the group set, not the list
position.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/CalciteLogicalIndexScan.java [472-479]

-List<Integer> groupList = aggregate.getGroupSet().asList();
+ImmutableBitSet groupSet = aggregate.getGroupSet();
+List<Integer> groupList = groupSet.asList();
 Set<Integer> result = new HashSet<>();
 for (int groupIndex : groupList) {
   int scanIndex = resolveGroupToScanIndex(groupIndex, project);
   if (scanIndex >= 0 && allNotNullIndices.contains(scanIndex)) {
     result.add(groupIndex);
   }
 }
Suggestion importance[1-10]: 3

__

Why: The suggestion claims groupIndex (the actual field index from getGroupSet().asList()) differs from the list position, but asList() on an ImmutableBitSet already returns the actual set bit indices (e.g., for {0, 2} it returns [0, 2]), so groupIndex is already the correct field index. The improved code adds ImmutableBitSet groupSet = aggregate.getGroupSet(); but doesn't actually use it, making the change a no-op. The suggestion is based on a misunderstanding.

Low
General
Fix misleading comment about filtered documents

The comment says "3 short (len ~12)" but the short descriptions in the test data are
"Short desc 1/2/3" which are 12 characters, and the test also inserts an empty
string document. The empty string has length 0, so it is filtered out by
length(description) > 5. However, the 3 long documents have null keyword values and
should be excluded from the aggregation result. The expected count of 3 (only the 3
short descriptions) appears correct, but the comment should clarify that the empty
string is also excluded by the length filter.

integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLAggregationIT.java [1766-1767]

-// 3 short (len ~12) + 3 long (len >50) pass length>5; null keyword bucket excluded → 3
+// 3 short (len ~12) pass length>5; empty string (len 0) filtered out; null keyword bucket excluded → 3
 assertEquals(3, actual.getInt("total"));
Suggestion importance[1-10]: 2

__

Why: This is a minor comment improvement that clarifies the empty string document is also excluded by the length > 5 filter. The logic and assertion are correct; only the comment is slightly misleading. Low impact as it doesn't affect correctness.

Low
Suggestions up to commit eaaa6e4
CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix group index vs positional index mismatch

The groupIndex used in result.add(groupIndex) is the position within the group list
(0-based iteration index), not the actual group field index from
aggregate.getGroupSet(). The groupList contains the actual field indices from the
aggregate's group set, so you should be adding groupIndex (the field index from the
group set) to result, but the loop variable groupIndex already holds the correct
value from groupList. However, the result set is later used in
isBucketNullable(groupIndex) where groupIndex is the position in
createCompositeBucket, which iterates over group fields positionally. Verify that
the indices stored in result (from aggregate.getGroupSet().asList()) match the
positional indices used in isBucketNullable calls to avoid off-by-one mismatches.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/CalciteLogicalIndexScan.java [472-479]

 List<Integer> groupList = aggregate.getGroupSet().asList();
 Set<Integer> result = new HashSet<>();
-for (int groupIndex : groupList) {
-  int scanIndex = resolveGroupToScanIndex(groupIndex, project);
+for (int pos = 0; pos < groupList.size(); pos++) {
+  int fieldIndex = groupList.get(pos);
+  int scanIndex = resolveGroupToScanIndex(fieldIndex, project);
   if (scanIndex >= 0 && allNotNullIndices.contains(scanIndex)) {
-    result.add(groupIndex);
+    result.add(pos); // store positional index used in isBucketNullable(groupIndex)
   }
 }
Suggestion importance[1-10]: 7

__

Why: This is a valid concern: aggregate.getGroupSet().asList() returns the actual field indices (e.g., {2, 5}[2, 5]), while isBucketNullable is called with a positional index (0, 1, 2...) in createCompositeBucket. If the group set contains non-contiguous indices, storing the field index instead of the positional index would cause a mismatch. The suggested fix of using pos instead of groupIndex (field index) is likely correct and important for correctness.

Medium
General
Ensure immutability of mutable set in record

The notNullFieldIndices field is a mutable Set stored in a record, which can be
mutated externally after construction. Consider storing it as an unmodifiable set to
preserve immutability guarantees of the record.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/context/FilterDigest.java [11]

 public record FilterDigest(int scriptCount, RexNode condition, Set<Integer> notNullFieldIndices) {
+  public FilterDigest {
+    notNullFieldIndices = Collections.unmodifiableSet(new HashSet<>(notNullFieldIndices));
+  }
Suggestion importance[1-10]: 4

__

Why: While making notNullFieldIndices unmodifiable is a good defensive practice for record immutability, the callers in this PR use Set.of() or Collections.emptySet() (already unmodifiable) or HashSet instances that aren't shared after construction, so the practical risk is low. It's a minor quality improvement.

Low
Suggestions up to commit 8d27262
CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix group index mapping for non-contiguous group sets

The groupIndex used as the key in result is the position within the group set, but
isBucketNullable(groupIndex) in AggregateBuilderHelper is called with the index
passed to createCompositeBucket, which iterates over group fields sequentially (0,
1, 2...). If aggregate.getGroupSet() contains non-contiguous bit indices (e.g., {0,
2}), groupList returns [0, 2], but the composite bucket loop uses sequential indices
0, 1. You should store the sequential position within the group list, not the raw
bit index.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/CalciteLogicalIndexScan.java [472-479]

 List<Integer> groupList = aggregate.getGroupSet().asList();
 Set<Integer> result = new HashSet<>();
-for (int groupIndex : groupList) {
-  int scanIndex = resolveGroupToScanIndex(groupIndex, project);
+for (int seqIndex = 0; seqIndex < groupList.size(); seqIndex++) {
+  int groupBitIndex = groupList.get(seqIndex);
+  int scanIndex = resolveGroupToScanIndex(groupBitIndex, project);
   if (scanIndex >= 0 && allNotNullIndices.contains(scanIndex)) {
-    result.add(groupIndex);
+    result.add(seqIndex);
   }
 }
Suggestion importance[1-10]: 7

__

Why: This is a valid concern: if aggregate.getGroupSet() contains non-contiguous bit indices (e.g., {0, 2}), the raw bit indices stored in result won't match the sequential indices used in the composite bucket loop. However, the actual behavior depends on how createCompositeBucket is called and whether group sets are always contiguous in practice. This could be a real bug for non-contiguous group sets.

Medium
Guard against out-of-bounds project list access

When project is not null, project.getProjects().get(groupIndex) uses the raw bit
index from the group set as a list index into the project's expressions. However,
the project list is indexed by its own output position, not by the aggregate's group
bit index. If the project has fewer outputs than the group bit index value, this
will throw an IndexOutOfBoundsException. The method should use the sequential
position within the group list, which should be passed in rather than the raw bit
index.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/CalciteLogicalIndexScan.java [509-518]

-private static int resolveGroupToScanIndex(int groupIndex, @Nullable Project project) {
+private static int resolveGroupToScanIndex(int groupBitIndex, @Nullable Project project) {
   if (project == null) {
-    return groupIndex;
+    return groupBitIndex;
   }
-  RexNode projected = project.getProjects().get(groupIndex);
+  if (groupBitIndex >= project.getProjects().size()) {
+    return -1;
+  }
+  RexNode projected = project.getProjects().get(groupBitIndex);
   if (projected instanceof RexInputRef scanRef) {
     return scanRef.getIndex();
   }
   return -1;
 }
Suggestion importance[1-10]: 4

__

Why: Adding a bounds check on project.getProjects().size() is a defensive improvement, but in practice the aggregate's group bit indices should always be valid indices into the project's output list since the aggregate sits on top of the project. The improved_code doesn't address the root concern from suggestion 1 about using sequential vs. bit indices.

Low
General
Ensure immutability of stored index set

The notNullFieldIndices field is a mutable Set stored in a record, which can be
mutated externally after construction. Consider using Set.copyOf() in the canonical
constructor to ensure immutability and prevent accidental modification of the stored
set.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/context/FilterDigest.java [11]

 public record FilterDigest(int scriptCount, RexNode condition, Set<Integer> notNullFieldIndices) {
+  public FilterDigest {
+    notNullFieldIndices = Set.copyOf(notNullFieldIndices);
+  }
Suggestion importance[1-10]: 4

__

Why: Using Set.copyOf() in the canonical constructor is a good defensive practice to prevent external mutation of the stored set. However, the callers in this PR use Collections.emptySet() or Set.of() which are already immutable, so the practical risk is low.

Low
Suggestions up to commit afcd173
CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix group index mismatch in non-nullable index resolution

The groupIndex used as the key in result is the index within the aggregate's group
set (i.e., the position in the row type of the aggregate's input), but
AggregateBuilderHelper.isBucketNullable is called with the position in groupList
(0-based iteration index). If aggregate.getGroupSet() is not contiguous starting
from 0, the indices stored in result won't match the iteration indices used in
createCompositeBucket. Consider storing the position in groupList (i.e.,
groupList.indexOf(groupIndex) or iterating with an explicit counter) rather than the
raw groupIndex value.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/CalciteLogicalIndexScan.java [466-473]

 List<Integer> groupList = aggregate.getGroupSet().asList();
 Set<Integer> result = new HashSet<>();
-for (int groupIndex : groupList) {
+for (int pos = 0; pos < groupList.size(); pos++) {
+  int groupIndex = groupList.get(pos);
   int scanIndex = resolveGroupToScanIndex(groupIndex, project);
   if (scanIndex >= 0 && allNotNullIndices.contains(scanIndex)) {
-    result.add(groupIndex);
+    result.add(pos);
   }
 }
Suggestion importance[1-10]: 7

__

Why: This is a valid concern: if aggregate.getGroupSet() is not contiguous from 0, the groupIndex stored in result may not match the iteration position used in createCompositeBucket. However, in practice Calcite's ImmutableBitSet.asList() for a standard group-by typically produces contiguous indices starting from 0, so this may not manifest in common cases. Still, the fix is logically sound and prevents a subtle index mismatch bug.

Medium
General
Strengthen test to assert no null bucket values

The test only asserts the total count of rows but does not verify that none of the
returned rows have a null description value. A regression could still pass this
check if, for example, one short description is missing and a null bucket appears
instead. Add an assertion that iterates over the result rows and confirms no null
description values are present.

integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLAggregationIT.java [1713-1726]

 try {
-  // isnotnull + non-empty filter with stats group-by should exclude null keyword buckets
   JSONObject actual =
       executeQuery(
           "source="
               + index
               + " | where isnotnull(description) and description != ''"
               + " | stats count() by description");
-  // Should return exactly 3 rows (one per short description), no null bucket
   assertEquals(3, actual.getInt("total"));
+  // Verify no null description bucket is present
+  org.json.JSONArray dataRows = actual.getJSONArray("datarows");
+  for (int i = 0; i < dataRows.length(); i++) {
+    org.json.JSONArray row = dataRows.getJSONArray(i);
+    // description column should not be null
+    assertNotNull(row.get(1));
+  }
 } finally {
-  // Cleanup
   client().performRequest(new Request("DELETE", "/" + index));
 }
Suggestion importance[1-10]: 4

__

Why: Adding a check that no row contains a null description value would make the regression test more robust. However, the current assertEquals(3, actual.getInt("total")) already catches the primary regression (null bucket appearing as an extra row), so this is a moderate improvement to test thoroughness.

Low
Document index validity assumption for pushed filter indices

The notNullFieldIndices stored in FilterDigest are indices relative to the scan's
row type at the time the filter was pushed down. However, if a Project was pushed
down before the filter, the scan's row type may have changed, making these indices
stale or misaligned. Ensure that the indices are resolved against the correct row
type, or document clearly that this assumption holds.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/CalciteLogicalIndexScan.java [456-461]

+// Note: notNullFieldIndices are relative to the scan row type at filter push-down time.
+// This is correct only if no project reordering occurred before the filter push-down.
 Set<Integer> allNotNullIndices = new HashSet<>();
 for (PushDownOperation op : context) {
   if (op.type() == PushDownType.FILTER && op.digest() instanceof FilterDigest fd) {
     allNotNullIndices.addAll(fd.notNullFieldIndices());
   }
 }
Suggestion importance[1-10]: 2

__

Why: This suggestion only adds a comment to document an assumption rather than fixing actual code. The improved_code is functionally identical to the existing_code, making this a documentation-only change with minimal impact.

Low

…undary tests

- Read FilterDigest from both FILTER and SCRIPT PushDownType operations,
  so isnotnull + script predicates (e.g. length()) still propagate
  NOT NULL constraints to the aggregate phase.
- Extract NOT NULL indices from digestCondition (the actually-pushed
  condition) instead of the original filter.getCondition(), ensuring
  partial pushdown consistency.
- Add boundary tests: isnotnull-only, isnotnull + script predicate,
  DSL-level missing_bucket:false verification via explain API.

Signed-off-by: Songkan Tang <songkant@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 8d27262

Non-nullable group-by fields now emit missing_bucket:false in the
composite aggregation DSL. Update 6 YAML expected-output files to
reflect the changed serialization.

Signed-off-by: Songkan Tang <songkant@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit eaaa6e4

…gate

When a project pushdown occurs between a filter and an aggregate, the
FilterDigest.notNullFieldIndices become stale because the project
re-numbers the fields. This caused NfwPplDashboardIT failures where
valid non-null rows were incorrectly excluded from aggregation results.

Remap notNullFieldIndices through the project column mapping in
PushDownContext.cloneForAggregate() and update YAML expected outputs
for the corrected missing_bucket values.

Signed-off-by: Songkan Tang <songkant@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 5c5e02b

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit a00fb07

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] isnotnull() filter not applied in aggregation pushdown with text fields having keyword subfields

1 participant