add native union type casting support#9544
add native union type casting support#9544friendlymatthew wants to merge 2 commits intoapache:mainfrom
Conversation
friendlymatthew
left a comment
There was a problem hiding this comment.
self review
arrow-cast/src/cast/mod.rs
Outdated
| (_, Union(_, _)) => Err(ArrowError::CastError(format!( | ||
| "Casting from {from_type} to {to_type} not supported" | ||
| ))), |
There was a problem hiding this comment.
This PR does not support casting scalar types into a Union type
It's not a feature that we currently need and will probably need more complex work to get it correct. Though, I think this is a cool/valid feature. I can file an issue if liked
arrow-cast/src/cast/mod.rs
Outdated
| let type_ids = array.type_ids().clone(); | ||
| let offsets = array.offsets().cloned(); | ||
|
|
||
| let new_children: Vec<ArrayRef> = from_fields | ||
| .iter() | ||
| .map(|(from_id, _from_field)| { | ||
| let (_, to_field) = to_fields | ||
| .iter() | ||
| .find(|(to_id, _)| *to_id == from_id) | ||
| .ok_or_else(|| { | ||
| ArrowError::CastError(format!( | ||
| "Cannot cast union: type_id {from_id} not found in target union" | ||
| )) | ||
| })?; | ||
| let child = array.child(from_id); | ||
| cast_with_options(child.as_ref(), to_field.data_type(), cast_options) | ||
| }) | ||
| .collect::<Result<_, _>>()?; |
There was a problem hiding this comment.
look ups like this can go from n^2 to nlogn if we implement #8937, since sorting by type_id will give us logn searching
9c19cb1 to
8dbe39f
Compare
|
Thanks -- I hope to go through arrow prs more carefully starting tomorrow |
alamb
left a comment
There was a problem hiding this comment.
thank you @friendlymatthew
I like where this is headed.
I left some suggestions - let me know if it makes sense
arrow-cast/src/cast/mod.rs
Outdated
| array: &UnionArray, | ||
| from_fields: &UnionFields, | ||
| to_fields: &UnionFields, | ||
| _to_mode: UnionMode, |
There was a problem hiding this comment.
why is the _to_mode parameter ignored?
arrow-cast/src/cast/mod.rs
Outdated
|
|
||
| #[test] | ||
| fn test_cast_union_prefers_exact_type_match() { | ||
| // Union(Int32, Int64) → Int64: Int64 is an exact match, so only |
There was a problem hiding this comment.
Can you please document the expected union casting behavior in the documentation?
Specifically, somewhere here:
https://docs.rs/arrow/latest/arrow/compute/fn.cast_with_options.html#durations-and-intervals
That will make it easier to understand the intended semantics for union casting as well as verify that this code implements the semantics correctly
arrow-cast/src/cast/mod.rs
Outdated
| /// Since union extraction inherently introduces nulls for non-matching rows, | ||
| /// the target type's inner fields are made nullable to avoid validation errors | ||
| /// (e.g., casting to `List(non-nullable Utf8)` becomes `List(nullable Utf8)`). | ||
| fn cast_union_to_type( |
There was a problem hiding this comment.
I do wonder on the semantics being introduced here; specifically about how we apparently allow casting if at least one field can be cast, but we null the other fields. It might not seem intuitive; I do wonder if we have some existing precedent in other systems we could follow 🤔
There was a problem hiding this comment.
Another potential option is to introduce the behavior into union_extract: https://docs.rs/arrow/latest/arrow/compute/kernels/union_extract/fn.union_extract.html
For example, maybe union_extract_with_option that takes some sort of options struct that allows better control over the extraction
096984a to
3056c06
Compare
3056c06 to
1cf2085
Compare
friendlymatthew
left a comment
There was a problem hiding this comment.
I restarted this PR to cut down on the scope.
This PR adds support for casting union arrays to scalar types only. The casting doesn't reimplement the extraction logic, but now builds on the existing union_extract kernel
The core addition is union_extract_by_type which selects the best-matching variant from a union for a given target type.
The variant selection uses a 3-pass heuristic: exact type match, same type family, then any castable variant
Curious to hear your thoughts!
friendlymatthew
left a comment
There was a problem hiding this comment.
self review
| // this is used during variant selection to prefer a "close" type over a distant cast | ||
| // for example: when targeting Utf8View, a Utf8 variant is preferred over Int32 despite both being castable | ||
| fn same_type_family(a: &DataType, b: &DataType) -> bool { | ||
| use DataType::*; | ||
| matches!( | ||
| (a, b), | ||
| (Utf8 | LargeUtf8 | Utf8View, Utf8 | LargeUtf8 | Utf8View) | ||
| | ( | ||
| Binary | LargeBinary | BinaryView, | ||
| Binary | LargeBinary | BinaryView | ||
| ) | ||
| | (Int8 | Int16 | Int32 | Int64, Int8 | Int16 | Int32 | Int64) | ||
| | ( | ||
| UInt8 | UInt16 | UInt32 | UInt64, | ||
| UInt8 | UInt16 | UInt32 | UInt64 | ||
| ) | ||
| | (Float16 | Float32 | Float64, Float16 | Float32 | Float64) | ||
| ) | ||
| } |
There was a problem hiding this comment.
I wonder if this is something that would be useful as a DataType api...
Equivalence classes let you answer "are these the same logical data" without caring about the physical repr
| // variant selection heuristic — 3 passes with decreasing specificity: | ||
| // | ||
| // first pass: field type == target type | ||
| // second pass: field and target are in the same equivalence class | ||
| // (e.g., Utf8 and Utf8View are both strings) | ||
| // third pass: field can be cast to target | ||
| // note: this is the most permissive and may lose information | ||
| // also, the matching logic is greedy so it will pick the first 'castable' variant | ||
| // | ||
| // each pass picks the first matching variant by type_id order. |
There was a problem hiding this comment.
the main heuristic!
I think pass 2 is necessary because can_cast_types is permissive enough that without it, Union(Int32, Utf8) cast to a Utf8View would pick Int32 over Utf8...
Though we can fix this by defining the Utf8 variant first. But I wonder if we should leave it up to the user...
0dc9c7b to
b47a20a
Compare
Which issue does this PR close?
arrow-cast#9225Rationale for this change
This PR adds native support for casting Union arrays in the cast kernel. Previously,
can_cast_typesandcast_with_optionshad no handling for union types at all