Skip to content

[Variant] Support Binary/LargeBinary children #9610

Open
AdamGS wants to merge 4 commits intoapache:mainfrom
AdamGS:adamg/variant-all-binary-children
Open

[Variant] Support Binary/LargeBinary children #9610
AdamGS wants to merge 4 commits intoapache:mainfrom
AdamGS:adamg/variant-all-binary-children

Conversation

@AdamGS
Copy link
Copy Markdown
Contributor

@AdamGS AdamGS commented Mar 24, 2026

Which issue does this PR close?

Rationale for this change

Improves spec compliance and improves performance, allowing more zero-copy copies between parquet and/or other implementation.

What changes are included in this PR?

The main change is replacing VariantArray's children with opaque ArrayRef, and handling validation accordingly.

Are these changes tested?

All existing tests are still in place.

Are there any user-facing changes?

Yes, some functions on VariantArray now take or return ArrayRef, from_parts, value_field, metadata_field.

@github-actions github-actions bot added the parquet-variant parquet-variant* crates label Mar 24, 2026
Comment on lines +484 to +488
impl PartialEq for VariantArray {
fn eq(&self, other: &Self) -> bool {
self.inner == other.inner
}
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This replaces the derived implementation, I think it should be correct as all other members are derived from it.

@sdf-jkl
Copy link
Copy Markdown
Contributor

sdf-jkl commented Mar 27, 2026

LGTM, this will be useful for #8354.

@scovich please take a look

Copy link
Copy Markdown
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder... should we introduce a new BinaryLikeArray (similar to existing ListLikeArray) and use Arc<dyn BinaryLikeArray> here? That would give better type safety while hopefully also harmonizing array operations behind a single interface, simplifying the code that interacts with it.

The one bummer is, I don't think there's an easy way to convert that back to a normal ArrayRef, other than a round trip through ArrayData (see #8794). But the same was true for BinaryViewArray this code was already using.

DataType::Binary => array.as_binary::<i32>().value(index),
DataType::LargeBinary => array.as_binary::<i64>().value(index),
DataType::BinaryView => array.as_binary_view().value(index),
other => panic!("Expected Binary, LargeBinary, or BinaryView array, got {other}"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love panics... is there something that makes this one structurally impossible to trigger?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a panic, could we return Option<&[u8]> and let the caller deal with it?
In many cases they could simply .unwrap() with a safety note about match arms etc guaranteeing safety?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.expect is still a panic if the expectation fails... do we have some guarantee that that each call site can never fail the expectation?

let array = typed_value.as_binary_view();
let value = array.value(index);
DataType::Binary | DataType::LargeBinary | DataType::BinaryView => {
let value = binary_array_value(typed_value.as_ref(), index);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this one is special. Unlike value and metadata, typed_value is not always a binary-like type. So we couldn't rely on a hypothetical BinaryLikeArray trait for type safety.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can fit value and metadata right? was looking for something like it, but I didn't know about ListLikeArray, I'll look into it. One annoying difference here is that the spec doesn't seem to accept fixed sized binary, which might be weird but prevents us from fully assuming its a binary datatype.

@AdamGS AdamGS force-pushed the adamg/variant-all-binary-children branch from 4bf9db8 to c5afc06 Compare March 30, 2026 13:09
Copy link
Copy Markdown
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting close. We just need to get rid of the .expect calls and it should be ready to go.

} else {
let variant = Variant::new_with_metadata(metadata.clone(), self.value.value(index));
let value_bytes = binary_array_value(self.value.as_ref(), index)
.expect("value field must be a binary-like array");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will panic if the expectation fails. Do we have a guarantee that the expectation always holds?

Also: this method returns Result, why not take advantage of ??

DataType::Binary => array.as_binary::<i32>().value(index),
DataType::LargeBinary => array.as_binary::<i64>().value(index),
DataType::BinaryView => array.as_binary_view().value(index),
other => panic!("Expected Binary, LargeBinary, or BinaryView array, got {other}"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.expect is still a panic if the expectation fails... do we have some guarantee that that each call site can never fail the expectation?

let value = value.map(|v| Variant::new_with_metadata($metadata.clone(), v.value($index)));
let value = value.map(|v| {
let bytes = binary_array_value(v.as_ref(), $index)
.expect("value field must be a binary-like array");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle_unshredded_case! is expected to return Err(...) in case of a problem. See L374 below, for example.

let value = array.value(index);
DataType::Binary | DataType::LargeBinary | DataType::BinaryView => {
let value = binary_array_value(typed_value.as_ref(), index)
.expect("match arm guarantees the array is binary-like");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another panic where Err would suffice

Tho in this case it is true (albeit somewhat brittle) that the panic should be unreachable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my general instinct is to panic in invalid states, but honestly not sure how similar code paths are treated elsewhere, so I'll change everything to errors.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case that's completely avoidable, no need to use the binary_array_value here.

@AdamGS AdamGS force-pushed the adamg/variant-all-binary-children branch from c5afc06 to a695fbd Compare April 1, 2026 12:32
AdamGS added 4 commits April 9, 2026 13:16
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
@AdamGS AdamGS force-pushed the adamg/variant-all-binary-children branch from a695fbd to cf0d40e Compare April 9, 2026 12:53
@AdamGS
Copy link
Copy Markdown
Contributor Author

AdamGS commented Apr 9, 2026

I think I addressed all comments, and added a test and a fix for a bug with how top-level nullability was propagated.

Copy link
Copy Markdown
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!

Three things to consider before merge (see comments for details):

  • The perfect shredding null handling bug should ideally be fixed in a separate PR (separation of concerns, etc)
  • The new binary value get produces a repetitive boilerplate in both prod and test code that should ideally be encapsulated with a helper function.
  • Arc<T: Foo> coerces to Arc<dyn Foo> (casting seldom needed)

Ok(VariantArray::from_parts(
array.metadata_field().clone(),
Some(value),
Some(Arc::new(value) as ArrayRef),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'm pretty sure the cast is unnecessary (automatic coercion by compiler)

(several more below)

Comment on lines -1409 to +1440
Variant::new(metadata.value(3), value.value(3)),
Variant::new(
binary_array_value(metadata.as_ref(), 3).unwrap(),
binary_array_value(value.as_ref(), 3).unwrap()
),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of these, spread across several test modules. Do you think it would be helpful to define a test helper function that takes a pair of arrays (applying deref) plus an index, and encapsulates the binary array value+unwrap calls?

assert_eq!(
    variant_from_arrays_at(metadata, value, 3), 
    Ok(Variant::Null)
);

assert_eq!(
expected_score_value,
get_value(i, metadata, score_value)
get_value(i, metadata.as_ref(), score_value.as_ref())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does deref coercion somehow not kick in here?
metadata: &Arc<dyn Array> (returned by VariantArray::metadata_field) should coerce automatically to &dyn Array.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah... that only works for concrete types, not dyn trait refs. Blech.

})?;
Result::Ok(Variant::new_with_metadata($metadata.clone(), bytes))
})
.transpose()?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why not just return the error directly? What does the map+transpose pair accomplish?

Comment on lines +56 to +66
if matches!(
array.data_type(),
DataType::Binary | DataType::LargeBinary | DataType::BinaryView
) {
Ok(())
} else {
Err(ArrowError::InvalidArgumentError(format!(
"VariantArray '{field_name}' field must be Binary, LargeBinary, or BinaryView, got {}",
array.data_type()
)))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Simpler as an actual match?

Suggested change
if matches!(
array.data_type(),
DataType::Binary | DataType::LargeBinary | DataType::BinaryView
) {
Ok(())
} else {
Err(ArrowError::InvalidArgumentError(format!(
"VariantArray '{field_name}' field must be Binary, LargeBinary, or BinaryView, got {}",
array.data_type()
)))
}
match array.data_type() {
DataType::Binary | DataType::LargeBinary | DataType::BinaryView => Ok(()),
_ => Err(ArrowError::InvalidArgumentError(format!(
"VariantArray '{field_name}' field must be Binary, LargeBinary, or BinaryView, got {}",
array.data_type()
)))
}

Comment on lines +395 to +408
let metadata =
binary_array_value(self.metadata.as_ref(), index).ok_or_else(|| {
ArrowError::InvalidArgumentError(format!(
"metadata field must be a binary-like array, instead got {}",
self.metadata.data_type(),
))
})?;
let value = binary_array_value(value.as_ref(), index).ok_or_else(|| {
ArrowError::InvalidArgumentError(format!(
"value field must be a binary-like array, instead got {}",
value.data_type(),
))
})?;
Ok(Variant::new(metadata, value))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matches the pattern I flagged in unit tests earlier. Should we consider adding a new Variant::try_new_from_arrays_at constructor instead of the test helper I had suggested there?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... arrow-array is not available in the parquet-variant crate, and we can't define new constructors on Variant in this parquet-variant-compute crate. So we'd need to keep the stand-alone variant_from_arrays_at helper after all, unless we got fancy with extension traits.

Comment on lines -865 to +900
let Some(binary_view) = value_col.as_binary_view_opt() else {
return Err(ArrowError::NotYetImplemented(format!(
"VariantArray 'value' field must be BinaryView, got {}",
value_col.data_type()
)));
};
Some(binary_view)
validate_binary_array(value_col.as_ref(), "value")?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice side effect, seeing this repeated code pattern simplified and encapsulated!

// This is a perfect shredding, where the value is entirely shredded out,
// so we can just return the typed value.
return Some(typed_value.clone());
// so we can just return the typed value after merging the accumulated nulls.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yelp! This is the bug you mentioned finding? We may want to consider merging the fix+test as a separate PR, since unrelated to binary-like arrays?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] Support BinaryArray/LargeBinaryArray in addition to BinaryViewArray for Variant

3 participants