Skip to content

add native union type casting support#9544

Open
friendlymatthew wants to merge 2 commits intoapache:mainfrom
pydantic:friendlymatthew/cast-union
Open

add native union type casting support#9544
friendlymatthew wants to merge 2 commits intoapache:mainfrom
pydantic:friendlymatthew/cast-union

Conversation

@friendlymatthew
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

This PR adds native support for casting Union arrays in the cast kernel. Previously, can_cast_types and cast_with_options had no handling for union types at all

@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 12, 2026
Copy link
Contributor Author

@friendlymatthew friendlymatthew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self review

Comment on lines +1209 to +1211
(_, Union(_, _)) => Err(ArrowError::CastError(format!(
"Casting from {from_type} to {to_type} not supported"
))),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR does not support casting scalar types into a Union type

It's not a feature that we currently need and will probably need more complex work to get it correct. Though, I think this is a cool/valid feature. I can file an issue if liked

Comment on lines +2316 to +2333
let type_ids = array.type_ids().clone();
let offsets = array.offsets().cloned();

let new_children: Vec<ArrayRef> = from_fields
.iter()
.map(|(from_id, _from_field)| {
let (_, to_field) = to_fields
.iter()
.find(|(to_id, _)| *to_id == from_id)
.ok_or_else(|| {
ArrowError::CastError(format!(
"Cannot cast union: type_id {from_id} not found in target union"
))
})?;
let child = array.child(from_id);
cast_with_options(child.as_ref(), to_field.data_type(), cast_options)
})
.collect::<Result<_, _>>()?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look ups like this can go from n^2 to nlogn if we implement #8937, since sorting by type_id will give us logn searching

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/cast-union branch 2 times, most recently from 9c19cb1 to 8dbe39f Compare March 12, 2026 18:24
@friendlymatthew
Copy link
Contributor Author

cc @Jefffrey @alamb curious to get your thoughts

This is a feature we use in production at Pydantic, and we actively maintain a vendored version that we're eager to upstream

@alamb
Copy link
Contributor

alamb commented Mar 16, 2026

Thanks -- I hope to go through arrow prs more carefully starting tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @friendlymatthew

I like where this is headed.

I left some suggestions - let me know if it makes sense

array: &UnionArray,
from_fields: &UnionFields,
to_fields: &UnionFields,
_to_mode: UnionMode,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the _to_mode parameter ignored?


#[test]
fn test_cast_union_prefers_exact_type_match() {
// Union(Int32, Int64) → Int64: Int64 is an exact match, so only
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please document the expected union casting behavior in the documentation?

Specifically, somewhere here:

https://docs.rs/arrow/latest/arrow/compute/fn.cast_with_options.html#durations-and-intervals

That will make it easier to understand the intended semantics for union casting as well as verify that this code implements the semantics correctly

/// Since union extraction inherently introduces nulls for non-matching rows,
/// the target type's inner fields are made nullable to avoid validation errors
/// (e.g., casting to `List(non-nullable Utf8)` becomes `List(nullable Utf8)`).
fn cast_union_to_type(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do wonder on the semantics being introduced here; specifically about how we apparently allow casting if at least one field can be cast, but we null the other fields. It might not seem intuitive; I do wonder if we have some existing precedent in other systems we could follow 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another potential option is to introduce the behavior into union_extract: https://docs.rs/arrow/latest/arrow/compute/kernels/union_extract/fn.union_extract.html

For example, maybe union_extract_with_option that takes some sort of options struct that allows better control over the extraction

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/cast-union branch 2 times, most recently from 096984a to 3056c06 Compare March 23, 2026 16:08
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/cast-union branch from 3056c06 to 1cf2085 Compare March 23, 2026 16:13
Copy link
Contributor Author

@friendlymatthew friendlymatthew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jefffrey @alamb

I restarted this PR to cut down on the scope.

This PR adds support for casting union arrays to scalar types only. The casting doesn't reimplement the extraction logic, but now builds on the existing union_extract kernel

The core addition is union_extract_by_type which selects the best-matching variant from a union for a given target type.

The variant selection uses a 3-pass heuristic: exact type match, same type family, then any castable variant

Curious to hear your thoughts!

Copy link
Contributor Author

@friendlymatthew friendlymatthew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self review

Comment on lines +28 to +46
// this is used during variant selection to prefer a "close" type over a distant cast
// for example: when targeting Utf8View, a Utf8 variant is preferred over Int32 despite both being castable
fn same_type_family(a: &DataType, b: &DataType) -> bool {
use DataType::*;
matches!(
(a, b),
(Utf8 | LargeUtf8 | Utf8View, Utf8 | LargeUtf8 | Utf8View)
| (
Binary | LargeBinary | BinaryView,
Binary | LargeBinary | BinaryView
)
| (Int8 | Int16 | Int32 | Int64, Int8 | Int16 | Int32 | Int64)
| (
UInt8 | UInt16 | UInt32 | UInt64,
UInt8 | UInt16 | UInt32 | UInt64
)
| (Float16 | Float32 | Float64, Float16 | Float32 | Float64)
)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is something that would be useful as a DataType api...

Equivalence classes let you answer "are these the same logical data" without caring about the physical repr

Comment on lines +48 to +57
// variant selection heuristic — 3 passes with decreasing specificity:
//
// first pass: field type == target type
// second pass: field and target are in the same equivalence class
// (e.g., Utf8 and Utf8View are both strings)
// third pass: field can be cast to target
// note: this is the most permissive and may lose information
// also, the matching logic is greedy so it will pick the first 'castable' variant
//
// each pass picks the first matching variant by type_id order.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the main heuristic!

I think pass 2 is necessary because can_cast_types is permissive enough that without it, Union(Int32, Utf8) cast to a Utf8View would pick Int32 over Utf8...

Though we can fix this by defining the Utf8 variant first. But I wonder if we should leave it up to the user...

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/cast-union branch from 0dc9c7b to b47a20a Compare March 23, 2026 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Union type casting support to arrow-cast

3 participants