Skip to content

fix: propagate inner-field metadata through composite-type constructors#21984

Open
adriangb wants to merge 5 commits into
apache:mainfrom
pydantic:fix-list-field-metadata
Open

fix: propagate inner-field metadata through composite-type constructors#21984
adriangb wants to merge 5 commits into
apache:mainfrom
pydantic:fix-list-field-metadata

Conversation

@adriangb
Copy link
Copy Markdown
Contributor

@adriangb adriangb commented May 2, 2026

Which issue does this PR close?

Rationale for this change

Several built-in UDFs and UDAFs that wrap an input column into a composite type (List, Struct, Map, …) drop the input field's metadata when constructing the output's inner Field. In practice this matters most for Arrow extension types (ARROW:extension:name / ARROW:extension:metadata — e.g. arrow.json, arrow.uuid), because SQL-constructed lists/structs/maps silently lose extension-type identity. Any downstream operation that compares the produced DataType against a type carrying inner-field metadata (union, aggregate merging, IPC roundtrip) sees them as different types.

Each affected function has the same shape: only return_type (DataType-only) is overridden, and the runtime path synthesizes a fresh inner Field with no metadata. The fix is therefore systematic rather than per-function.

What changes are included in this PR?

Two new helpers in datafusion-common::utils:

  • list_inner_field_from(&Field) -> FieldRef — builds the canonical inner field of a List/LargeList/FixedSizeList from a source field, preserving the source's data type and metadata.
  • struct_inner_fields_from(...) -> Fields — builds named struct member fields from a sequence of (name, &Field) pairs, preserving each input's metadata.

SingleRowListArrayBuilder::with_field was extended to also propagate metadata.

These helpers are then used consistently from each affected function's return_field_from_args (UDF) / return_field (UDAF), and the metadata-bearing inner field is threaded into the runtime construction paths:

Function Change
make_array new return_field_from_args; runtime uses args.return_field's inner field
array_agg (incl. distinct, ordered, groups accumulator) new return_field and state_fields; accumulators now carry a FieldRef instead of a bare DataType and use list_inner_field_from at every list-construction site
array_repeat new return_field_from_args; runtime threads inner field through
arrays_zip new return_field_from_args (preserves struct member metadata); runtime uses planning-time struct fields
map new return_field_from_args; entries field threaded through make_map_array_internal / make_map_array_from_fixed_size_list / build_map_array
range / generate_series new return_field_from_args; runtime grafts the planning-time inner field onto results from ListBuilder-based paths
struct new return_field_from_args (preserves member field metadata)

spark.collect_list / collect_set (which reuse ArrayAggAccumulator / DistinctArrayAggAccumulator) were updated to pass the input FieldRef through.

Are these changes tested?

Yes:

  • Rust unit tests in each affected file directly assert metadata propagation through return_field_from_args / return_field (e.g. make_array_preserves_inner_field_metadata, array_agg_preserves_inner_field_metadata, struct_preserves_member_metadata, arrays_zip_preserves_struct_member_metadata, array_repeat_preserves_inner_field_metadata, map_preserves_key_value_field_metadata, range_preserves_inner_field_metadata).
  • New end-to-end SLT datafusion/sqllogictest/test_files/array_metadata_propagation.slt exercises every affected constructor through with_metadata → wrapping function → arrow_field(...)['data_type'], asserting the rendered data type contains the expected inner-field metadata. This covers make_array, array_repeat, range, generate_series, struct, arrays_zip, map, and array_agg (default, DISTINCT, and ORDER BY paths).
  • All existing tests in datafusion-common, datafusion-functions, datafusion-functions-aggregate, datafusion-functions-nested continue to pass; the full SLT suite (465 files) passes; clippy is clean.

Are there any user-facing changes?

Yes — but they are bug fixes rather than breaking changes: SQL-constructed lists/structs/maps now retain Arrow extension-type identity from their input fields. The accumulator constructors (ArrayAggAccumulator::try_new, DistinctArrayAggAccumulator::try_new, OrderSensitiveArrayAggAccumulator::try_new, ArrayAggGroupsAccumulator::new) now take &FieldRef instead of &DataType; this is a pub API change for downstream code that constructs these accumulators directly.

🤖 Generated with Claude Code

@github-actions github-actions Bot added sqllogictest SQL Logic Tests (.slt) common Related to common crate functions Changes to functions implementation spark labels May 2, 2026
@adriangb adriangb marked this pull request as draft May 2, 2026 06:39
@adriangb adriangb requested a review from Copilot May 2, 2026 06:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes loss of Arrow inner-Field metadata (notably ARROW:extension:*) when DataFusion UDFs/UDAFs construct composite types like List, Struct, and Map, ensuring extension-type identity survives through planning and runtime construction.

Changes:

  • Add list_inner_field_from / struct_inner_fields_from helpers and extend SingleRowListArrayBuilder::with_field to propagate metadata.
  • Update composite constructors (make_array, array_repeat, range/generate_series, arrays_zip, map, struct) and array_agg accumulators to thread metadata-bearing FieldRefs through runtime array building.
  • Add unit tests plus an end-to-end sqllogictest covering metadata propagation across affected functions.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
datafusion/common/src/utils/mod.rs Adds helpers for metadata-preserving list/struct field construction; extends SingleRowListArrayBuilder to carry metadata.
datafusion/functions-nested/src/make_array.rs Uses planning-time inner FieldRef in return-field and runtime construction to preserve metadata.
datafusion/functions-nested/src/repeat.rs Threads inner FieldRef through array_repeat runtime paths to preserve list inner-field metadata.
datafusion/functions-nested/src/range.rs Adds return_field_from_args and runtime grafting of the planned inner field for range/generate_series.
datafusion/functions-nested/src/arrays_zip.rs Preserves element-field metadata in planned struct members and reuses planned struct fields at runtime.
datafusion/functions-nested/src/map.rs Builds map entries/key/value fields using metadata-bearing planned fields when available.
datafusion/functions/src/core/struct.rs Adds return_field_from_args to preserve per-member metadata in struct(...).
datafusion/functions-aggregate/src/array_agg.rs Preserves metadata by storing a FieldRef in accumulators and using it at all list-construction sites.
datafusion/spark/src/function/aggregate/collect.rs Updates Spark collect accumulators to pass FieldRef into array_agg accumulators.
datafusion/functions-aggregate/benches/array_agg.rs Adjusts bench to new accumulator constructor signature (&FieldRef).
datafusion/sqllogictest/test_files/array_metadata_propagation.slt New SLT regression coverage for metadata propagation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 83 to 90
fn accumulator(&self, acc_args: AccumulatorArgs) -> Result<Box<dyn Accumulator>> {
let field = &acc_args.expr_fields[0];
let data_type = field.data_type().clone();
let ignore_nulls = true;
Ok(Box::new(NullToEmptyListAccumulator::new(
ArrayAggAccumulator::try_new(&data_type, ignore_nulls)?,
ArrayAggAccumulator::try_new(field, ignore_nulls)?,
data_type,
)))
Comment on lines 141 to 148
fn accumulator(&self, acc_args: AccumulatorArgs) -> Result<Box<dyn Accumulator>> {
let field = &acc_args.expr_fields[0];
let data_type = field.data_type().clone();
let ignore_nulls = true;
Ok(Box::new(NullToEmptyListAccumulator::new(
DistinctArrayAggAccumulator::try_new(&data_type, None, ignore_nulls)?,
DistinctArrayAggAccumulator::try_new(field, None, ignore_nulls)?,
data_type,
)))
@adriangb
Copy link
Copy Markdown
Contributor Author

@paleolimbot I wonder if you’d have time to review this PR? I think these are places where we should preserve metadata

Copy link
Copy Markdown
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the ping...I'm always happy to review this kind of thing. I hadn't hit these yet but the day was coming.

A few questions inline but in general the approach is great and I'm glad these are getting fixed!

/// (`ARROW:extension:name` / `ARROW:extension:metadata`) to round-trip
/// through SQL list constructors (e.g. `make_array`, `array_agg`).
pub fn with_field(mut self, field: &Field) -> Self {
self.field_name = Some(field.name().to_owned());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use the default list item name ("item") canonically? (Using a different name can sometimes cause equality issues)

Comment on lines +173 to +178
let inner_field = match args.return_field.data_type() {
List(field) | LargeList(field) | FixedSizeList(field, _) => {
Some(Arc::clone(field))
}
_ => None,
};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this return internal_err!() for the case where the return field is not a DataType::List?

Comment on lines +267 to +269
// Prefer the planning-time struct fields (which preserve input metadata)
// when available; fall back to building bare fields from element types.
let struct_fields: Fields = match inner_field.as_ref().map(|f| f.data_type()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this can ever happen it would be helpful for future readers to elaborate on how (as far as I know this is not supposed to happen?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this should not happen unless people are calling from rust and doing somewhat weird things.

Comment on lines 279 to 281
// Create a MutableArrayData builder per column. For None (Null-typed)
// args we only need extend_nulls, so we track them separately.
let mut builders: Vec<Option<MutableArrayData>> = values_data
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not here, but for the perfect zip (all value arrays the same length, no nulls with non-zero element slot lengths, no null padding needed) this should ideally be just clones of the original arrayrefs

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I opened #22225

Comment on lines +393 to +394
let a_inner = with_meta("ARROW:extension:name", "arrow.uuid", DataType::Int64);
let b_inner = with_meta("ARROW:extension:name", "arrow.json", DataType::Utf8);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this uses official extension names it should probably use valid storage to avoid confusing future humans or LLMs. Probably arbitrary key/value metadata is equally illustrative here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced with arbitrary metadata

Comment on lines +782 to +788
let key_inner: FieldRef =
Arc::new(Field::new("e", DataType::Utf8, false).with_metadata(
std::collections::HashMap::from([(
"ARROW:extension:name".to_string(),
"arrow.uuid".to_string(),
)]),
));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my favourite way so far of constructing the test fields (no helper function).

Also worth using arbitrary k/v metadata or a valid UUID storage here

Comment on lines +250 to +258
fn return_field_from_args(&self, args: ReturnFieldArgs) -> Result<FieldRef> {
if args.arg_fields.iter().any(|f| f.data_type().is_null()) {
return Ok(Arc::new(Field::new(self.name(), DataType::Null, true)));
}
// Reuse the data type computed by `return_type` so the date64→date32
// coercion and timezone preservation logic stays in one place. Then
// walk the result and replace the inner field with a metadata-bearing
// copy of the start arg's field so extension types flow through.
let arg_types: Vec<_> = args
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want range() to propagate field metadata? The one extension type example I have for this is a discrete grid system (S2) which is an extension on top of a u64 where adding the step would result in an invalid or meaninglessly related identifier. For extension types specifically this probably only makes sense where the "+" operation is meaningful (and identical to the storage type). I would personally rather this didn't propagate metadata.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I removed range()

Comment on lines +370 to +373
let metadata = HashMap::from([(
"ARROW:extension:name".to_string(),
"arrow.uuid".to_string(),
)]);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably use a valid storage type or arbitrary key/value metadata

.arg_fields
.iter()
.enumerate()
.map(|(pos, f)| nullable_inner_field_from(f, &format!("c{pos}")))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the inner fields always be nullable or inherit the nullability of the argument?

----
List(Int64, metadata: {"k": "v"})

# array_agg DISTINCT preserves inner field metadata
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW "distinct" has varying degrees of meaningful (byte-for-byte storage equality is usually fine though, if perhaps conservative in some cases)

adriangb and others added 3 commits May 15, 2026 00:08
…rs (apache#21982)

`make_array`, `array_agg`, `array_repeat`, `arrays_zip`, `map`, `range` /
`generate_series`, and `struct` only overrode `return_type` and synthesized
fresh inner fields at runtime, so each one silently dropped the input
field's metadata when wrapping it into a `List`/`Struct`/`Map`. In practice
this broke Arrow extension types (`ARROW:extension:name` /
`ARROW:extension:metadata`) round-tripping through SQL list/struct/map
constructors.

Add two helpers in `datafusion-common::utils` — `list_inner_field_from`
and `struct_inner_fields_from` — that wrap a source `Field` into the inner
field of a list/struct while preserving its metadata, and extend
`SingleRowListArrayBuilder::with_field` to copy metadata too. Use these
helpers consistently from each affected function's
`return_field_from_args` / `return_field`, and thread the resulting
metadata-bearing inner field into the runtime construction paths
(including the `array_agg` accumulators, which now carry a `FieldRef`
instead of a bare `DataType`).

Adds Rust unit tests for each affected function plus an end-to-end
`array_metadata_propagation.slt` that asserts metadata survives every
constructor by string-matching the rendered data type.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace `list_inner_field_from` and `struct_inner_fields_from` with a single
`nullable_inner_field_from(inner, name)` that renames + forces nullable to
match Arrow's list/struct member conventions while preserving metadata.
`nullable_list_item_field_from` is a thin wrapper using
`Field::LIST_FIELD_DEFAULT_NAME`. The map function uses `Field::with_name` /
`with_nullable` directly since key/value need different nullability.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lure

CI fix: rewrite the broken intra-doc link to `make_map_batch` (a
`#[cfg(test)]` helper) as a regular block doc comment so
`cargo doc -D warnings` passes.

Review feedback from @paleolimbot:

- map.rs: make `entries` non-optional (`FieldRef`) across all local
  helpers — `make_map_batch_with_entries`, `make_map_batch_internal`,
  `make_map_array_internal`, `make_map_array_from_fixed_size_list`,
  `build_map_array`, and `key_value_fields_from_entries` — so future
  contributors can't silently drop metadata. The `MapFunc::invoke_with_args`
  entry point now returns `internal_err!` if the return field isn't a
  `Map` (it always should be). `get_element_field` accepts `ListView`
  and `LargeListView` so support is wired up when those become valid
  `map(...)` inputs.

- arrays_zip.rs / make_array.rs: `invoke_with_args` now returns
  `internal_err!` for non-list return fields and threads a non-optional
  `FieldRef` through the runtime path. Test fields use arbitrary
  key/value metadata (`unit`/`ms`, `source`/`sensor`) instead of fake
  `ARROW:extension:name` values that wouldn't be valid extension-type
  storage anyway.

- range.rs: revert metadata propagation. `range(start, stop, step)`'s
  output values are constructed by adding `step` to `start`, which is
  meaningless for many extension types (e.g. discrete grid systems on
  top of `u64`), so propagating the start field's metadata onto the
  output list's inner field would surface invalid identifiers. Drop
  the corresponding SLT cases.

- repeat.rs: same test-metadata cleanup.

- collect.rs (spark): `NullToEmptyListAccumulator` now stores the input
  `FieldRef` instead of a bare `DataType`, so the empty list it
  synthesizes in the all-NULL case carries the same inner-field
  metadata as the non-NULL path. Both `SparkCollectList` and
  `SparkCollectSet` override `return_field` and propagate inner
  metadata through `state_fields`.

- array_metadata_propagation.slt: each constructor is now exercised
  twice — once with literal arguments (constant-folded at planning) and
  once over a `VALUES` table (physical execution) — since the two
  paths can synthesize fields differently.

- Update memory-size assertion in `does_not_over_account_memory`
  (282 → 266) to reflect `FieldRef` (Arc<Field>, pointer-sized) being
  smaller than the inlined `DataType`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
adriangb and others added 2 commits May 15, 2026 15:38
… list

`NullToEmptyListAccumulator::evaluate` builds the all-NULL empty list via
`SingleRowListArrayBuilder::with_field`, which copies the field name
verbatim. It was passing the raw input element field (e.g. named `"a"`
for column `a`), so the empty list's inner field was `List(item: "a")`
while `return_field` / `state_fields` declare the canonical
`List(item: "item")`. The mismatch surfaced as Arrow "column types must
match schema types" / concat errors for `collect_set` GROUP BY queries
that mix all-NULL and non-NULL groups (regression in spark
`collect.slt`).

Canonicalize the stored field once in `NullToEmptyListAccumulator::new`
via `nullable_list_item_field_from` (renames to `"item"`, keeps
metadata) so the empty list matches the declared type.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`new` accepted `value_field` by value but only borrowed it (passed
`&value_field` to `nullable_list_item_field_from`), tripping
`clippy::needless_pass_by_value`. Take `&FieldRef` and let the callers
pass the `&acc_args.expr_fields[0]` reference directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@adriangb adriangb marked this pull request as ready for review May 16, 2026 06:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate functions Changes to functions implementation spark sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

make_array / array_agg drop inner-Field metadata when constructing List<T>

3 participants