Add SelectivityTracker adaptive filter cost model by adriangb · Pull Request #22236 · apache/datafusion

adriangb · 2026-05-15T21:54:23Z

Which issue does this PR close?

Part of [Experiment] Adaptive filter pushdown #22144 (Adaptive filter pushdown), split into a reviewable stack. This is PR 3 of 4.

Rationale for this change

The cost model that decides where each filter conjunct runs (row-level, post-scan, or dropped) is large enough to review on its own, separate from the scan plumbing that consumes it.

What changes are included in this PR?

SelectivityTracker: a cross-file cost model that accumulates per-filter selectivity and throughput statistics and, using a confidence interval, partitions filter conjuncts into row-level / post-scan / dropped buckets.
total_compressed_bytes helper in row_filter (column-byte sizing used by the tracker).
A criterion benchmark for the tracker.

Nothing wires the tracker into the parquet scan yet — that is the final PR in the stack. A few pub(crate) items only exercised by that integration carry a temporary #[expect(dead_code)], removed in PR 4.

Are these changes tested?

Yes — ~45 unit tests cover the partition / promote / demote / drop logic.

Are there any user-facing changes?

New pub module datafusion-datasource-parquet::selectivity. No behavior change — no production code path uses it yet.

Stacked PR — diff is cumulative against main. Review the top commit "feat: add SelectivityTracker adaptive filter cost model"; the commits below it are PRs #22234 and #22235.

Stack (review/merge in order):

Add OptionalFilterPhysicalExpr wrapper + proto support #22234 — OptionalFilterPhysicalExpr + proto
Per-conjunct pruning statistics for PruningPredicate #22235 — Per-conjunct pruning statistics
this PR — SelectivityTracker cost model
Adaptive parquet scan integration

Introduce `OptionalFilterPhysicalExpr`, a transparent `PhysicalExpr` wrapper that marks a filter as *optional* — droppable without affecting query correctness. It delegates every `PhysicalExpr` method to the inner expression, so it is behavior-neutral until a consumer explicitly checks for the marker. This is the foundation for adaptive filter scheduling: a scan can detect the wrapper and drop a performance-hint filter (e.g. a hash-join dynamic filter) when it is not cost-effective, knowing correctness is enforced elsewhere. Also adds proto serialization (`PhysicalOptionalFilterNode`) so physical plans containing the wrapper round-trip faithfully. No caller wraps anything yet — that arrives with the adaptive parquet scan later in the stack. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add an opt-in way to learn, per individual conjunct, how effective each predicate was during pruning — without running any extra pruning passes. - `PruningPredicate::try_new_tagged_conjuncts` builds a predicate from AND-conjuncts, each carrying a caller-supplied tag. - `PruningPredicate::prune_per_conjunct` returns the usual prune mask plus per-conjunct `PerConjunctPruneStats` (rows/containers seen vs. skipped) as a side effect of the pruning iteration that already runs. - `RowGroupAccessPlanFilter::prune_by_statistics_with_per_conjunct_stats` and `PagePruningAccessPlanFilter::prune_plan_with_per_conjunct_stats` surface those stats for row-group and page-index pruning respectively. The existing untagged `prune` / `prune_by_statistics` / `prune_plan_with_page_index` paths are preserved and unchanged; the new methods return empty stats on the untagged path. No in-tree caller uses the tagged path yet — the adaptive parquet scan consumes it later in the stack as a selectivity prior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Introduce `SelectivityTracker`, the cross-file cost model behind adaptive filter pushdown. It accumulates per-filter selectivity and throughput statistics and, given a confidence interval, decides whether each filter conjunct should be evaluated at row level, deferred to post-scan, or dropped entirely (for optional filters). This commit adds the module, its `total_compressed_bytes` helper in `row_filter`, a criterion benchmark, and ~45 unit tests covering the partitioning / promote / demote / drop logic. Nothing wires it into the parquet scan yet — that integration is the final commit in this stack. A couple of `pub(crate)` items (`count_skippable_bytes`, `skip_flag`, `is_filter_skipped`) are only exercised by that integration and carry a temporary `#[expect(dead_code)]` until then. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-15T22:00:44Z

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details

     Cloning apache/main
    Building datafusion-datasource-parquet v53.1.0 (current)
       Built [  43.475s] (current)
     Parsing datafusion-datasource-parquet v53.1.0 (current)
      Parsed [   0.030s] (current)
    Building datafusion-datasource-parquet v53.1.0 (baseline)
       Built [  42.628s] (baseline)
     Parsing datafusion-datasource-parquet v53.1.0 (baseline)
      Parsed [   0.028s] (baseline)
    Checking datafusion-datasource-parquet v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.223s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  88.790s] datafusion-datasource-parquet
    Building datafusion-physical-expr-common v53.1.0 (current)
       Built [  20.076s] (current)
     Parsing datafusion-physical-expr-common v53.1.0 (current)
      Parsed [   0.021s] (current)
    Building datafusion-physical-expr-common v53.1.0 (baseline)
       Built [  19.647s] (baseline)
     Parsing datafusion-physical-expr-common v53.1.0 (baseline)
      Parsed [   0.022s] (baseline)
    Checking datafusion-physical-expr-common v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.278s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  41.207s] datafusion-physical-expr-common
    Building datafusion-proto v53.1.0 (current)
       Built [  54.983s] (current)
     Parsing datafusion-proto v53.1.0 (current)
      Parsed [   0.147s] (current)
    Building datafusion-proto v53.1.0 (baseline)
       Built [  57.029s] (baseline)
     Parsing datafusion-proto v53.1.0 (baseline)
      Parsed [   0.145s] (baseline)
    Checking datafusion-proto v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   2.276s] 222 checks: 221 pass, 1 fail, 0 warn, 30 skip

--- failure enum_variant_added: enum variant added on exhaustive enum ---

Description:
A publicly-visible enum without #[non_exhaustive] has a new variant.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#enum-variant-new
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/enum_variant_added.ron

Failed in:
  variant ExprType:OptionalFilter in /home/runner/work/datafusion/datafusion/datafusion/proto/src/generated/prost.rs:1397
  variant ExprType:OptionalFilter in /home/runner/work/datafusion/datafusion/datafusion/proto/src/generated/prost.rs:1397

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [ 116.830s] datafusion-proto
    Building datafusion-pruning v53.1.0 (current)
       Built [  37.877s] (current)
     Parsing datafusion-pruning v53.1.0 (current)
      Parsed [   0.013s] (current)
    Building datafusion-pruning v53.1.0 (baseline)
       Built [  37.354s] (baseline)
     Parsing datafusion-pruning v53.1.0 (baseline)
      Parsed [   0.013s] (baseline)
    Checking datafusion-pruning v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.097s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  76.811s] datafusion-pruning

adriangb and others added 3 commits May 15, 2026 14:12

github-actions Bot added physical-expr Changes to the physical-expr crates proto Related to proto crate datasource Changes to the datasource crate labels May 15, 2026

This was referenced May 15, 2026

Adaptive filter pushdown for the parquet scan #22237

Draft

[Experiment] Adaptive filter pushdown #22144

Draft

github-actions Bot added the auto detected api change Auto detected API change label May 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SelectivityTracker adaptive filter cost model#22236

Add SelectivityTracker adaptive filter cost model#22236
adriangb wants to merge 3 commits into
apache:mainfrom
adriangb:pr3-selectivity-tracker

adriangb commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

adriangb commented May 15, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant