fix: propagate inner-field metadata through composite-type constructors#21984
fix: propagate inner-field metadata through composite-type constructors#21984adriangb wants to merge 5 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes loss of Arrow inner-Field metadata (notably ARROW:extension:*) when DataFusion UDFs/UDAFs construct composite types like List, Struct, and Map, ensuring extension-type identity survives through planning and runtime construction.
Changes:
- Add
list_inner_field_from/struct_inner_fields_fromhelpers and extendSingleRowListArrayBuilder::with_fieldto propagate metadata. - Update composite constructors (
make_array,array_repeat,range/generate_series,arrays_zip,map,struct) andarray_aggaccumulators to thread metadata-bearingFieldRefs through runtime array building. - Add unit tests plus an end-to-end sqllogictest covering metadata propagation across affected functions.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| datafusion/common/src/utils/mod.rs | Adds helpers for metadata-preserving list/struct field construction; extends SingleRowListArrayBuilder to carry metadata. |
| datafusion/functions-nested/src/make_array.rs | Uses planning-time inner FieldRef in return-field and runtime construction to preserve metadata. |
| datafusion/functions-nested/src/repeat.rs | Threads inner FieldRef through array_repeat runtime paths to preserve list inner-field metadata. |
| datafusion/functions-nested/src/range.rs | Adds return_field_from_args and runtime grafting of the planned inner field for range/generate_series. |
| datafusion/functions-nested/src/arrays_zip.rs | Preserves element-field metadata in planned struct members and reuses planned struct fields at runtime. |
| datafusion/functions-nested/src/map.rs | Builds map entries/key/value fields using metadata-bearing planned fields when available. |
| datafusion/functions/src/core/struct.rs | Adds return_field_from_args to preserve per-member metadata in struct(...). |
| datafusion/functions-aggregate/src/array_agg.rs | Preserves metadata by storing a FieldRef in accumulators and using it at all list-construction sites. |
| datafusion/spark/src/function/aggregate/collect.rs | Updates Spark collect accumulators to pass FieldRef into array_agg accumulators. |
| datafusion/functions-aggregate/benches/array_agg.rs | Adjusts bench to new accumulator constructor signature (&FieldRef). |
| datafusion/sqllogictest/test_files/array_metadata_propagation.slt | New SLT regression coverage for metadata propagation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| fn accumulator(&self, acc_args: AccumulatorArgs) -> Result<Box<dyn Accumulator>> { | ||
| let field = &acc_args.expr_fields[0]; | ||
| let data_type = field.data_type().clone(); | ||
| let ignore_nulls = true; | ||
| Ok(Box::new(NullToEmptyListAccumulator::new( | ||
| ArrayAggAccumulator::try_new(&data_type, ignore_nulls)?, | ||
| ArrayAggAccumulator::try_new(field, ignore_nulls)?, | ||
| data_type, | ||
| ))) |
| fn accumulator(&self, acc_args: AccumulatorArgs) -> Result<Box<dyn Accumulator>> { | ||
| let field = &acc_args.expr_fields[0]; | ||
| let data_type = field.data_type().clone(); | ||
| let ignore_nulls = true; | ||
| Ok(Box::new(NullToEmptyListAccumulator::new( | ||
| DistinctArrayAggAccumulator::try_new(&data_type, None, ignore_nulls)?, | ||
| DistinctArrayAggAccumulator::try_new(field, None, ignore_nulls)?, | ||
| data_type, | ||
| ))) |
|
@paleolimbot I wonder if you’d have time to review this PR? I think these are places where we should preserve metadata |
paleolimbot
left a comment
There was a problem hiding this comment.
Thank you for the ping...I'm always happy to review this kind of thing. I hadn't hit these yet but the day was coming.
A few questions inline but in general the approach is great and I'm glad these are getting fixed!
| /// (`ARROW:extension:name` / `ARROW:extension:metadata`) to round-trip | ||
| /// through SQL list constructors (e.g. `make_array`, `array_agg`). | ||
| pub fn with_field(mut self, field: &Field) -> Self { | ||
| self.field_name = Some(field.name().to_owned()); |
There was a problem hiding this comment.
Should this use the default list item name ("item") canonically? (Using a different name can sometimes cause equality issues)
| let inner_field = match args.return_field.data_type() { | ||
| List(field) | LargeList(field) | FixedSizeList(field, _) => { | ||
| Some(Arc::clone(field)) | ||
| } | ||
| _ => None, | ||
| }; |
There was a problem hiding this comment.
Should this return internal_err!() for the case where the return field is not a DataType::List?
| // Prefer the planning-time struct fields (which preserve input metadata) | ||
| // when available; fall back to building bare fields from element types. | ||
| let struct_fields: Fields = match inner_field.as_ref().map(|f| f.data_type()) { |
There was a problem hiding this comment.
If this can ever happen it would be helpful for future readers to elaborate on how (as far as I know this is not supposed to happen?)
There was a problem hiding this comment.
Yes this should not happen unless people are calling from rust and doing somewhat weird things.
| // Create a MutableArrayData builder per column. For None (Null-typed) | ||
| // args we only need extend_nulls, so we track them separately. | ||
| let mut builders: Vec<Option<MutableArrayData>> = values_data |
There was a problem hiding this comment.
Not here, but for the perfect zip (all value arrays the same length, no nulls with non-zero element slot lengths, no null padding needed) this should ideally be just clones of the original arrayrefs
| let a_inner = with_meta("ARROW:extension:name", "arrow.uuid", DataType::Int64); | ||
| let b_inner = with_meta("ARROW:extension:name", "arrow.json", DataType::Utf8); |
There was a problem hiding this comment.
If this uses official extension names it should probably use valid storage to avoid confusing future humans or LLMs. Probably arbitrary key/value metadata is equally illustrative here.
There was a problem hiding this comment.
Replaced with arbitrary metadata
| let key_inner: FieldRef = | ||
| Arc::new(Field::new("e", DataType::Utf8, false).with_metadata( | ||
| std::collections::HashMap::from([( | ||
| "ARROW:extension:name".to_string(), | ||
| "arrow.uuid".to_string(), | ||
| )]), | ||
| )); |
There was a problem hiding this comment.
This is my favourite way so far of constructing the test fields (no helper function).
Also worth using arbitrary k/v metadata or a valid UUID storage here
| fn return_field_from_args(&self, args: ReturnFieldArgs) -> Result<FieldRef> { | ||
| if args.arg_fields.iter().any(|f| f.data_type().is_null()) { | ||
| return Ok(Arc::new(Field::new(self.name(), DataType::Null, true))); | ||
| } | ||
| // Reuse the data type computed by `return_type` so the date64→date32 | ||
| // coercion and timezone preservation logic stays in one place. Then | ||
| // walk the result and replace the inner field with a metadata-bearing | ||
| // copy of the start arg's field so extension types flow through. | ||
| let arg_types: Vec<_> = args |
There was a problem hiding this comment.
Do you want range() to propagate field metadata? The one extension type example I have for this is a discrete grid system (S2) which is an extension on top of a u64 where adding the step would result in an invalid or meaninglessly related identifier. For extension types specifically this probably only makes sense where the "+" operation is meaningful (and identical to the storage type). I would personally rather this didn't propagate metadata.
There was a problem hiding this comment.
You are right. I removed range()
| let metadata = HashMap::from([( | ||
| "ARROW:extension:name".to_string(), | ||
| "arrow.uuid".to_string(), | ||
| )]); |
There was a problem hiding this comment.
This should probably use a valid storage type or arbitrary key/value metadata
| .arg_fields | ||
| .iter() | ||
| .enumerate() | ||
| .map(|(pos, f)| nullable_inner_field_from(f, &format!("c{pos}"))) |
There was a problem hiding this comment.
Should the inner fields always be nullable or inherit the nullability of the argument?
| ---- | ||
| List(Int64, metadata: {"k": "v"}) | ||
|
|
||
| # array_agg DISTINCT preserves inner field metadata |
There was a problem hiding this comment.
FWIW "distinct" has varying degrees of meaningful (byte-for-byte storage equality is usually fine though, if perhaps conservative in some cases)
…rs (apache#21982) `make_array`, `array_agg`, `array_repeat`, `arrays_zip`, `map`, `range` / `generate_series`, and `struct` only overrode `return_type` and synthesized fresh inner fields at runtime, so each one silently dropped the input field's metadata when wrapping it into a `List`/`Struct`/`Map`. In practice this broke Arrow extension types (`ARROW:extension:name` / `ARROW:extension:metadata`) round-tripping through SQL list/struct/map constructors. Add two helpers in `datafusion-common::utils` — `list_inner_field_from` and `struct_inner_fields_from` — that wrap a source `Field` into the inner field of a list/struct while preserving its metadata, and extend `SingleRowListArrayBuilder::with_field` to copy metadata too. Use these helpers consistently from each affected function's `return_field_from_args` / `return_field`, and thread the resulting metadata-bearing inner field into the runtime construction paths (including the `array_agg` accumulators, which now carry a `FieldRef` instead of a bare `DataType`). Adds Rust unit tests for each affected function plus an end-to-end `array_metadata_propagation.slt` that asserts metadata survives every constructor by string-matching the rendered data type. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace `list_inner_field_from` and `struct_inner_fields_from` with a single `nullable_inner_field_from(inner, name)` that renames + forces nullable to match Arrow's list/struct member conventions while preserving metadata. `nullable_list_item_field_from` is a thin wrapper using `Field::LIST_FIELD_DEFAULT_NAME`. The map function uses `Field::with_name` / `with_nullable` directly since key/value need different nullability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lure CI fix: rewrite the broken intra-doc link to `make_map_batch` (a `#[cfg(test)]` helper) as a regular block doc comment so `cargo doc -D warnings` passes. Review feedback from @paleolimbot: - map.rs: make `entries` non-optional (`FieldRef`) across all local helpers — `make_map_batch_with_entries`, `make_map_batch_internal`, `make_map_array_internal`, `make_map_array_from_fixed_size_list`, `build_map_array`, and `key_value_fields_from_entries` — so future contributors can't silently drop metadata. The `MapFunc::invoke_with_args` entry point now returns `internal_err!` if the return field isn't a `Map` (it always should be). `get_element_field` accepts `ListView` and `LargeListView` so support is wired up when those become valid `map(...)` inputs. - arrays_zip.rs / make_array.rs: `invoke_with_args` now returns `internal_err!` for non-list return fields and threads a non-optional `FieldRef` through the runtime path. Test fields use arbitrary key/value metadata (`unit`/`ms`, `source`/`sensor`) instead of fake `ARROW:extension:name` values that wouldn't be valid extension-type storage anyway. - range.rs: revert metadata propagation. `range(start, stop, step)`'s output values are constructed by adding `step` to `start`, which is meaningless for many extension types (e.g. discrete grid systems on top of `u64`), so propagating the start field's metadata onto the output list's inner field would surface invalid identifiers. Drop the corresponding SLT cases. - repeat.rs: same test-metadata cleanup. - collect.rs (spark): `NullToEmptyListAccumulator` now stores the input `FieldRef` instead of a bare `DataType`, so the empty list it synthesizes in the all-NULL case carries the same inner-field metadata as the non-NULL path. Both `SparkCollectList` and `SparkCollectSet` override `return_field` and propagate inner metadata through `state_fields`. - array_metadata_propagation.slt: each constructor is now exercised twice — once with literal arguments (constant-folded at planning) and once over a `VALUES` table (physical execution) — since the two paths can synthesize fields differently. - Update memory-size assertion in `does_not_over_account_memory` (282 → 266) to reflect `FieldRef` (Arc<Field>, pointer-sized) being smaller than the inlined `DataType`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0cc687c to
bcfad17
Compare
… list `NullToEmptyListAccumulator::evaluate` builds the all-NULL empty list via `SingleRowListArrayBuilder::with_field`, which copies the field name verbatim. It was passing the raw input element field (e.g. named `"a"` for column `a`), so the empty list's inner field was `List(item: "a")` while `return_field` / `state_fields` declare the canonical `List(item: "item")`. The mismatch surfaced as Arrow "column types must match schema types" / concat errors for `collect_set` GROUP BY queries that mix all-NULL and non-NULL groups (regression in spark `collect.slt`). Canonicalize the stored field once in `NullToEmptyListAccumulator::new` via `nullable_list_item_field_from` (renames to `"item"`, keeps metadata) so the empty list matches the declared type. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`new` accepted `value_field` by value but only borrowed it (passed `&value_field` to `nullable_list_item_field_from`), tripping `clippy::needless_pass_by_value`. Take `&FieldRef` and let the callers pass the `&acc_args.expr_fields[0]` reference directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Which issue does this PR close?
Rationale for this change
Several built-in UDFs and UDAFs that wrap an input column into a composite type (
List,Struct,Map, …) drop the input field's metadata when constructing the output's inner Field. In practice this matters most for Arrow extension types (ARROW:extension:name/ARROW:extension:metadata— e.g.arrow.json,arrow.uuid), because SQL-constructed lists/structs/maps silently lose extension-type identity. Any downstream operation that compares the producedDataTypeagainst a type carrying inner-field metadata (union, aggregate merging, IPC roundtrip) sees them as different types.Each affected function has the same shape: only
return_type(DataType-only) is overridden, and the runtime path synthesizes a fresh innerFieldwith no metadata. The fix is therefore systematic rather than per-function.What changes are included in this PR?
Two new helpers in
datafusion-common::utils:list_inner_field_from(&Field) -> FieldRef— builds the canonical inner field of aList/LargeList/FixedSizeListfrom a source field, preserving the source's data type and metadata.struct_inner_fields_from(...) -> Fields— builds named struct member fields from a sequence of(name, &Field)pairs, preserving each input's metadata.SingleRowListArrayBuilder::with_fieldwas extended to also propagate metadata.These helpers are then used consistently from each affected function's
return_field_from_args(UDF) /return_field(UDAF), and the metadata-bearing inner field is threaded into the runtime construction paths:make_arrayreturn_field_from_args; runtime usesargs.return_field's inner fieldarray_agg(incl. distinct, ordered, groups accumulator)return_fieldandstate_fields; accumulators now carry aFieldRefinstead of a bareDataTypeand uselist_inner_field_fromat every list-construction sitearray_repeatreturn_field_from_args; runtime threads inner field througharrays_zipreturn_field_from_args(preserves struct member metadata); runtime uses planning-time struct fieldsmapreturn_field_from_args;entriesfield threaded throughmake_map_array_internal/make_map_array_from_fixed_size_list/build_map_arrayrange/generate_seriesreturn_field_from_args; runtime grafts the planning-time inner field onto results fromListBuilder-based pathsstructreturn_field_from_args(preserves member field metadata)spark.collect_list/collect_set(which reuseArrayAggAccumulator/DistinctArrayAggAccumulator) were updated to pass the inputFieldRefthrough.Are these changes tested?
Yes:
return_field_from_args/return_field(e.g.make_array_preserves_inner_field_metadata,array_agg_preserves_inner_field_metadata,struct_preserves_member_metadata,arrays_zip_preserves_struct_member_metadata,array_repeat_preserves_inner_field_metadata,map_preserves_key_value_field_metadata,range_preserves_inner_field_metadata).datafusion/sqllogictest/test_files/array_metadata_propagation.sltexercises every affected constructor throughwith_metadata→ wrapping function →arrow_field(...)['data_type'], asserting the rendered data type contains the expected inner-field metadata. This coversmake_array,array_repeat,range,generate_series,struct,arrays_zip,map, andarray_agg(default,DISTINCT, andORDER BYpaths).datafusion-common,datafusion-functions,datafusion-functions-aggregate,datafusion-functions-nestedcontinue to pass; the full SLT suite (465 files) passes; clippy is clean.Are there any user-facing changes?
Yes — but they are bug fixes rather than breaking changes: SQL-constructed lists/structs/maps now retain Arrow extension-type identity from their input fields. The accumulator constructors (
ArrayAggAccumulator::try_new,DistinctArrayAggAccumulator::try_new,OrderSensitiveArrayAggAccumulator::try_new,ArrayAggGroupsAccumulator::new) now take&FieldRefinstead of&DataType; this is apubAPI change for downstream code that constructs these accumulators directly.🤖 Generated with Claude Code