[Variant] Support Binary/LargeBinary children #9610
[Variant] Support Binary/LargeBinary children #9610AdamGS wants to merge 4 commits intoapache:mainfrom
Conversation
| impl PartialEq for VariantArray { | ||
| fn eq(&self, other: &Self) -> bool { | ||
| self.inner == other.inner | ||
| } | ||
| } |
There was a problem hiding this comment.
This replaces the derived implementation, I think it should be correct as all other members are derived from it.
scovich
left a comment
There was a problem hiding this comment.
I wonder... should we introduce a new BinaryLikeArray (similar to existing ListLikeArray) and use Arc<dyn BinaryLikeArray> here? That would give better type safety while hopefully also harmonizing array operations behind a single interface, simplifying the code that interacts with it.
The one bummer is, I don't think there's an easy way to convert that back to a normal ArrayRef, other than a round trip through ArrayData (see #8794). But the same was true for BinaryViewArray this code was already using.
| DataType::Binary => array.as_binary::<i32>().value(index), | ||
| DataType::LargeBinary => array.as_binary::<i64>().value(index), | ||
| DataType::BinaryView => array.as_binary_view().value(index), | ||
| other => panic!("Expected Binary, LargeBinary, or BinaryView array, got {other}"), |
There was a problem hiding this comment.
I don't love panics... is there something that makes this one structurally impossible to trigger?
There was a problem hiding this comment.
Instead of a panic, could we return Option<&[u8]> and let the caller deal with it?
In many cases they could simply .unwrap() with a safety note about match arms etc guaranteeing safety?
There was a problem hiding this comment.
.expect is still a panic if the expectation fails... do we have some guarantee that that each call site can never fail the expectation?
| let array = typed_value.as_binary_view(); | ||
| let value = array.value(index); | ||
| DataType::Binary | DataType::LargeBinary | DataType::BinaryView => { | ||
| let value = binary_array_value(typed_value.as_ref(), index); |
There was a problem hiding this comment.
Oh, this one is special. Unlike value and metadata, typed_value is not always a binary-like type. So we couldn't rely on a hypothetical BinaryLikeArray trait for type safety.
There was a problem hiding this comment.
can fit value and metadata right? was looking for something like it, but I didn't know about ListLikeArray, I'll look into it. One annoying difference here is that the spec doesn't seem to accept fixed sized binary, which might be weird but prevents us from fully assuming its a binary datatype.
4bf9db8 to
c5afc06
Compare
scovich
left a comment
There was a problem hiding this comment.
Getting close. We just need to get rid of the .expect calls and it should be ready to go.
| } else { | ||
| let variant = Variant::new_with_metadata(metadata.clone(), self.value.value(index)); | ||
| let value_bytes = binary_array_value(self.value.as_ref(), index) | ||
| .expect("value field must be a binary-like array"); |
There was a problem hiding this comment.
This will panic if the expectation fails. Do we have a guarantee that the expectation always holds?
Also: this method returns Result, why not take advantage of ??
| DataType::Binary => array.as_binary::<i32>().value(index), | ||
| DataType::LargeBinary => array.as_binary::<i64>().value(index), | ||
| DataType::BinaryView => array.as_binary_view().value(index), | ||
| other => panic!("Expected Binary, LargeBinary, or BinaryView array, got {other}"), |
There was a problem hiding this comment.
.expect is still a panic if the expectation fails... do we have some guarantee that that each call site can never fail the expectation?
| let value = value.map(|v| Variant::new_with_metadata($metadata.clone(), v.value($index))); | ||
| let value = value.map(|v| { | ||
| let bytes = binary_array_value(v.as_ref(), $index) | ||
| .expect("value field must be a binary-like array"); |
There was a problem hiding this comment.
handle_unshredded_case! is expected to return Err(...) in case of a problem. See L374 below, for example.
| let value = array.value(index); | ||
| DataType::Binary | DataType::LargeBinary | DataType::BinaryView => { | ||
| let value = binary_array_value(typed_value.as_ref(), index) | ||
| .expect("match arm guarantees the array is binary-like"); |
There was a problem hiding this comment.
another panic where Err would suffice
Tho in this case it is true (albeit somewhat brittle) that the panic should be unreachable.
There was a problem hiding this comment.
I think my general instinct is to panic in invalid states, but honestly not sure how similar code paths are treated elsewhere, so I'll change everything to errors.
There was a problem hiding this comment.
in this case that's completely avoidable, no need to use the binary_array_value here.
c5afc06 to
a695fbd
Compare
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
a695fbd to
cf0d40e
Compare
|
I think I addressed all comments, and added a test and a fix for a bug with how top-level nullability was propagated. |
scovich
left a comment
There was a problem hiding this comment.
This looks great!
Three things to consider before merge (see comments for details):
- The perfect shredding null handling bug should ideally be fixed in a separate PR (separation of concerns, etc)
- The new binary value get produces a repetitive boilerplate in both prod and test code that should ideally be encapsulated with a helper function.
Arc<T: Foo>coerces toArc<dyn Foo>(casting seldom needed)
| Ok(VariantArray::from_parts( | ||
| array.metadata_field().clone(), | ||
| Some(value), | ||
| Some(Arc::new(value) as ArrayRef), |
There was a problem hiding this comment.
nit: I'm pretty sure the cast is unnecessary (automatic coercion by compiler)
(several more below)
| Variant::new(metadata.value(3), value.value(3)), | ||
| Variant::new( | ||
| binary_array_value(metadata.as_ref(), 3).unwrap(), | ||
| binary_array_value(value.as_ref(), 3).unwrap() | ||
| ), |
There was a problem hiding this comment.
There are a lot of these, spread across several test modules. Do you think it would be helpful to define a test helper function that takes a pair of arrays (applying deref) plus an index, and encapsulates the binary array value+unwrap calls?
assert_eq!(
variant_from_arrays_at(metadata, value, 3),
Ok(Variant::Null)
);| assert_eq!( | ||
| expected_score_value, | ||
| get_value(i, metadata, score_value) | ||
| get_value(i, metadata.as_ref(), score_value.as_ref()) |
There was a problem hiding this comment.
Does deref coercion somehow not kick in here?
metadata: &Arc<dyn Array> (returned by VariantArray::metadata_field) should coerce automatically to &dyn Array.
There was a problem hiding this comment.
Ah... that only works for concrete types, not dyn trait refs. Blech.
| })?; | ||
| Result::Ok(Variant::new_with_metadata($metadata.clone(), bytes)) | ||
| }) | ||
| .transpose()?; |
There was a problem hiding this comment.
Out of curiosity, why not just return the error directly? What does the map+transpose pair accomplish?
| if matches!( | ||
| array.data_type(), | ||
| DataType::Binary | DataType::LargeBinary | DataType::BinaryView | ||
| ) { | ||
| Ok(()) | ||
| } else { | ||
| Err(ArrowError::InvalidArgumentError(format!( | ||
| "VariantArray '{field_name}' field must be Binary, LargeBinary, or BinaryView, got {}", | ||
| array.data_type() | ||
| ))) | ||
| } |
There was a problem hiding this comment.
nit: Simpler as an actual match?
| if matches!( | |
| array.data_type(), | |
| DataType::Binary | DataType::LargeBinary | DataType::BinaryView | |
| ) { | |
| Ok(()) | |
| } else { | |
| Err(ArrowError::InvalidArgumentError(format!( | |
| "VariantArray '{field_name}' field must be Binary, LargeBinary, or BinaryView, got {}", | |
| array.data_type() | |
| ))) | |
| } | |
| match array.data_type() { | |
| DataType::Binary | DataType::LargeBinary | DataType::BinaryView => Ok(()), | |
| _ => Err(ArrowError::InvalidArgumentError(format!( | |
| "VariantArray '{field_name}' field must be Binary, LargeBinary, or BinaryView, got {}", | |
| array.data_type() | |
| ))) | |
| } |
| let metadata = | ||
| binary_array_value(self.metadata.as_ref(), index).ok_or_else(|| { | ||
| ArrowError::InvalidArgumentError(format!( | ||
| "metadata field must be a binary-like array, instead got {}", | ||
| self.metadata.data_type(), | ||
| )) | ||
| })?; | ||
| let value = binary_array_value(value.as_ref(), index).ok_or_else(|| { | ||
| ArrowError::InvalidArgumentError(format!( | ||
| "value field must be a binary-like array, instead got {}", | ||
| value.data_type(), | ||
| )) | ||
| })?; | ||
| Ok(Variant::new(metadata, value)) |
There was a problem hiding this comment.
This matches the pattern I flagged in unit tests earlier. Should we consider adding a new Variant::try_new_from_arrays_at constructor instead of the test helper I had suggested there?
There was a problem hiding this comment.
Oh... arrow-array is not available in the parquet-variant crate, and we can't define new constructors on Variant in this parquet-variant-compute crate. So we'd need to keep the stand-alone variant_from_arrays_at helper after all, unless we got fancy with extension traits.
| let Some(binary_view) = value_col.as_binary_view_opt() else { | ||
| return Err(ArrowError::NotYetImplemented(format!( | ||
| "VariantArray 'value' field must be BinaryView, got {}", | ||
| value_col.data_type() | ||
| ))); | ||
| }; | ||
| Some(binary_view) | ||
| validate_binary_array(value_col.as_ref(), "value")?; |
There was a problem hiding this comment.
Nice side effect, seeing this repeated code pattern simplified and encapsulated!
| // This is a perfect shredding, where the value is entirely shredded out, | ||
| // so we can just return the typed value. | ||
| return Some(typed_value.clone()); | ||
| // so we can just return the typed value after merging the accumulated nulls. |
There was a problem hiding this comment.
Yelp! This is the bug you mentioned finding? We may want to consider merging the fix+test as a separate PR, since unrelated to binary-like arrays?
Which issue does this PR close?
BinaryArray/LargeBinaryArrayin addition toBinaryViewArrayfor Variant #8387Rationale for this change
Improves spec compliance and improves performance, allowing more zero-copy copies between parquet and/or other implementation.
What changes are included in this PR?
The main change is replacing
VariantArray's children with opaqueArrayRef, and handling validation accordingly.Are these changes tested?
All existing tests are still in place.
Are there any user-facing changes?
Yes, some functions on
VariantArraynow take or returnArrayRef,from_parts,value_field,metadata_field.