feat(datafusion): enable parallel file-level scanning via one partition per file by toutane · Pull Request #2298 · apache/iceberg-rust

toutane · 2026-03-31T11:43:03Z

Which issue does this PR close?

Closes: Enable parallel file-level scanning for IcebergTableScan Datafusion Integration #2220
Related: EPIC: Support parallel scan in iceberg-datafusion #1604 (DataFusion read performance bottlenecked by single-threaded execution), Plan file scan task according scan file size. #128 (size-based planning, proposed long-term solution)

What changes are included in this PR?

Approach

The issue proposed modifying IcebergTableScan directly to accept Vec<Vec<FileScanTask>> and return UnknownPartitioning(n). This PR takes a different approach: rather than changing the existing scan path, it introduces two new types. This preserves full backward compatibility with IcebergTableProvider / IcebergTableScan and lets users explicitly choose parallel file scanning when they need it.

Adds two new public types to iceberg-datafusion:

IcebergPartitionedScan: a DataFusion ExecutionPlan where each FileScanTask maps to exactly one partition, enabling DataFusion to dispatch file reads in parallel
IcebergPartitionedTableProvider: a catalog-backed TableProvider that builds an IcebergPartitionedScan on every query, always fetching the latest snapshot

Design choices

One file = one partition
IcebergTableScan uses UnknownPartitioning(1) and streams all files sequentially through a single partition. IcebergPartitionedScan uses UnknownPartitioning(n_files), giving DataFusion the information it needs to schedule execute(i) calls concurrently, one per file.
Table reloaded on every scan
IcebergPartitionedTableProvider loads the table twice: once at construction to snapshot the Arrow schema for DataFusion planning, and once at scan time to guarantee the freshest snapshot. This mirrors the behavior of IcebergTableProvider.
No stored projection/predicate fields
The struct is intentionally self-contained: its full state is (tasks, file_io, schema).

Known limitations

No limit pushdown: _limit is not forwarded to IcebergPartitionedScan. DataFusion inserts a GlobalLimitExec above any leaf that does not implement pushdown, so correctness is maintained
No writes: insert_into returns FeatureUnsupported. Use IcebergTableProvider for write operations
Schema staleness on projection: projection indices are resolved against the schema captured at construction time. This is inherited behavior from IcebergTableProvider

Are these changes tested?

Two unit tests are added in table/partitioned.rs:

test_empty_table_zero_partitions: verifies that an empty table produces a zero-partition scan, guarding against an out-of-bounds panic on execute(0)
test_one_partition_per_file: verifies that N data files produce exactly N DataFusion partitions in IcebergPartitionedScan

timsaucer

I'm no expert on Iceberg but I've worked a lot on DataFusion, particularly table providers. I wrote a blog on the datafusion site recently, but since you first put this PR up. In case it's in any way useful: https://datafusion.apache.org/blog/2026/03/31/writing-table-providers/

Overall I think the approach here is definitely reasonable. My comments are mostly around opportunities to squeeze out a little more performance based on having done something similar at my work.

timsaucer · 2026-04-20T12:23:30Z

+        self: Arc<Self>,
+        _children: Vec<Arc<dyn ExecutionPlan>>,
+    ) -> DFResult<Arc<dyn ExecutionPlan>> {
+        Ok(self)


Since this doesn't support children, I'd recommend an error if _children is not empty. Not a blocker for merge.

Yes, you're right thanks! Pushed a fix that returns a DataFusionError::Internal, matching the pattern used in IcebergCommitExec::with_new_children.

Side note: IcebergTableScan::with_new_children has the same issue. This could be the subject of another PR.

timsaucer · 2026-04-20T12:27:55Z

+        &self,
+        filters: &[&Expr],
+    ) -> DFResult<Vec<TableProviderFilterPushDown>> {
+        Ok(vec![TableProviderFilterPushDown::Inexact; filters.len()])


Can we do better than this? If we have partitioned scan and the filter is on the partitions I would expect to be able to get an exact pushdown. That would entirely remove a filter operation for cases where it matches, and I think that's a big win and common use case I've seen in other work.

Yes, you're right there's something to do here, I agree.

I'd prefer to tackle this in a follow-up PR: doing it correctly requires a per-filter conversion API (currently convert_filters_to_predicate collapses everything into a single combined predicate and silently drops non-convertible filters) and, in a partition-spec-aware check, only Identity-transformed partition columns can be safely marked Exact; bucket, truncate, year/month/etc. are lossy and must stay Inexact to avoid incorrect results.

Happy to open a tracking issue. However, if you think it's simple enough, I can go ahead and make the changes directly in the PR.

timsaucer · 2026-04-20T12:30:53Z

+            .map_err(to_datafusion_error)?
+            .try_collect::<Vec<_>>()
+            .await
+            .map_err(to_datafusion_error)?;


It looks like the number of output partitions will be the number of files, right? I'm wondering if there's an opportunity to do better than that. We're specifying that the output partitioning in the exec is unknown, but don't we have information about the partitioning we could utilize?

By better I mean could we be more performant if we were to go ahead and get the target partitions from the session and output in those number of partitions already with hashing?

Thanks for raising this, please push back if any of the below is off.

For context, the long-term direction for this is tracked in the EPIC #1604 (row-group-based parallel scan with a GroupPruner that can split/merge FileScanTask below the file grain). What I was hoping to land with this PR is a more immediate, scoped optimization that stays within the current file-grain contract, so we don't preempt the design choices in #1604. The file-grouping step you're pointing at is essentially what #2220 describes as the intermediate improvement on the path toward #1604.

If you think it's appropriate, I'd be happy to pick up a short-term follow-up along these lines:

Switch IcebergPartitionedScan from tasks: Vec<FileScanTask> to file_groups: Vec<Vec<FileScanTask>>, to follow the convention used by DataFusion's own FileScanConfig, each group = one DataFusion partition that streams its files sequentially through ArrowReaderBuilder::read.

In IcebergPartitionedTableProvider::scan, read state.config().target_partitions() and group tasks into min(n_files, target_partitions) buckets.

When n_files < target_partitions, parallelism is still capped at n_files. I think that's inherent to the file grain, but let me know if I'm missing something.

I'm happy to open the follow-up issue/PR myself, or defer to you if you'd rather frame it, whatever works best.

I suppose I'd need to understand those conversations. I think I mentioned in one of the other comments on this PR, but I found the whole discussion difficult to track. Maybe I can find some time this weekend to look through that sized based partitioning they mention.

mbutrovich · 2026-04-20T14:43:00Z

Thanks for the PR, @toutane! One thing I noticed: IcebergPartitionedScan::execute() creates a bare ArrowReaderBuilder::new(file_io).build() with no configuration. The existing path through IcebergTableScan wires through row group filtering, row selection, concurrency limits, and batch size. Might be worth plumbing those through here too so users don't silently lose those optimizations when switching to the partitioned scan.

…itionedScan for parallel file scanning

Co-authored-by: Tim Saucer <timsaucer@gmail.com>

…:with_new_children

timsaucer · 2026-04-21T11:27:00Z

More broadly, is adding in a second path really the best answer? It seems like now you're going to increase your maintenance load. Is there any reason not to have a single path and the fallback be that it's a partitioned scan of N=1?

I am going to spend a little more time trying to understand the issues. It's difficult because some of them are marked as unplanned or stale and some of the links do not have good descriptions. I suppose I'll need to look at the java source to get a better idea of what the long term goal is.

toutane · 2026-04-22T12:45:40Z

Hey Tim, I think you're absolutely right about consolidating everything into a single TableProvider long term.

The only reason I kept separate paths was to avoid introducing breaking changes. I am going to explore a design where partitioned file scan becomes the default behavior, with the current provider's logic as a fallback as you suggested.

On a related note, it could be worth thinking about the next step: exposing Partitioning::Hash as output-partitioned when the Iceberg data uses bucket partitioning. Do you think that fits naturally in the same path, or would a separate provider be a better fit?

timsaucer · 2026-04-23T17:36:23Z

Hey Tim, I think you're absolutely right about consolidating everything into a single TableProvider long term.

The only reason I kept separate paths was to avoid introducing breaking changes. I am going to explore a design where partitioned file scan becomes the default behavior, with the current provider's logic as a fallback as you suggested.

On a related note, it could be worth thinking about the next step: exposing Partitioning::Hash as output-partitioned when the Iceberg data uses bucket partitioning. Do you think that fits naturally in the same path, or would a separate provider be a better fit?

I understand a desire to not introduce breaking changes. Is the concern that the API is changing or do you have implementation concerns? If it's just the API change, then I think a good upgrade documentation is often sufficient, especially since it looks like the change would be fairly straightforward for a downstream consumer. Please correct me if that's not correct.

If it's concern about the implementation, then I think the real solution is to make sure there's robust testing both in the repo and against some real life workloads to verify performance at different scales and partitioning structures.

With respect to the question about output partitioning, I think any time you can do that you should. Any time we can give more information about these kinds of things we're going to see performance gains, and sometimes significant gains.

toutane mentioned this pull request Mar 31, 2026

Enable parallel file-level scanning for IcebergTableScan Datafusion Integration #2220

Open

toutane marked this pull request as ready for review March 31, 2026 14:12

mbutrovich self-requested a review March 31, 2026 14:17

timsaucer reviewed Apr 20, 2026

View reviewed changes

t3hw mentioned this pull request Apr 20, 2026

perf(reader): parallelize Parquet decompression across tokio tasks #2342

Closed

toutane marked this pull request as draft April 21, 2026 09:35

toutane and others added 6 commits April 21, 2026 11:56

feat(datafusion): add IcebergPartitionedTableProvider and IcebergPart…

caf17ce

…itionedScan for parallel file scanning

docs(datafusion): update comment in IcebergPartitionedScan

2f9f6bc

Update crates/integrations/datafusion/src/table/mod.rs

03b8daf

Co-authored-by: Tim Saucer <timsaucer@gmail.com>

fix(datafusion): reject non-empty children in IcebergPartitionedScan:…

f03efda

…:with_new_children

fix(datafusion): use ArrowReaderBuilder existing configuration path

9fc923b

format

fde61f6

toutane force-pushed the draft/partitioned-file-scanning-contribution branch from 0a7af45 to fde61f6 Compare April 21, 2026 09:56

Conversation

toutane commented Mar 31, 2026

Which issue does this PR close?

What changes are included in this PR?

Approach

Design choices

Known limitations

Are these changes tested?

Uh oh!

timsaucer left a comment

Choose a reason for hiding this comment

Uh oh!

timsaucer Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

toutane Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

timsaucer Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

toutane Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timsaucer Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

timsaucer Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

toutane Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

timsaucer Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

mbutrovich commented Apr 20, 2026

Uh oh!

timsaucer commented Apr 21, 2026

Uh oh!

toutane commented Apr 22, 2026

Uh oh!

timsaucer commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

toutane Apr 20, 2026 •

edited

Loading