Support extracting struct fields as Variant using ExtensionType#9598
Support extracting struct fields as Variant using ExtensionType#9598codephage2020 wants to merge 5 commits intoapache:mainfrom
Conversation
scovich
left a comment
There was a problem hiding this comment.
Thanks for looking into this. A bunch of questions.
|
Thank you for the review. Looks like #9612 needs to be merged first. |
|
@codephage2020 #9612 was merged |
Thanks! I'll pick this up again later tonight. |
1439645 to
39bc464
Compare
39bc464 to
677ceca
Compare
677ceca to
97e687e
Compare
scovich
left a comment
There was a problem hiding this comment.
This PR is turning out to be more subtle than I initially realized. A couple big points to consider in the comments.
| // followed further. Fall back to the value column if present, | ||
| // otherwise the path is missing. |
There was a problem hiding this comment.
I don't understand this comment. What value column fallback does it refer to? The only caller of this follow_shredded_path_element function simply creates an all-null array if we return missing_path_step().
There was a problem hiding this comment.
Update: I think I finally figured it out... this code is attempting to traverse through a shredded variant object. Say we're following the path x.y.z and the current path step is y:
- (L72) The
typed_valuewe start with should ideally be a struct. If it's any other type, thenxwas not shredded as struct. Two sub-cases:x.valueis present. Some entries may beVariant::Object. In fact, it's possible that every entry isVariant::Object(orVariant::Null), e.g. if the data match the query schema but the wrong shredding schema was applied when writing it out. So it's actually incorrect to blow up here, regardless of JSONpath semantics, because we didn't prove there's a problem yet (have to row-shredx.valueand see what it contains).x.valuecolumn is just plain missing. This means the column shredded perfectly as some other type, and we know there are no nulls and no structs. This is currently treated as an error case, but we should really just returnmissing_path_step()instead (variant_get should follow JSONpath semantics #9606).
- (L86)
x.typed_valuewas a struct as expected, butx.typed_value.ydoesn't exist =>missing_path_step() - (L91) The
x.typed_value.yexists, but it's not a struct. This is a malformed data scenario because we're in a shredded variant object and the spec mandates that every field must be a struct (containingvalueand/ortyped_valuefields). So the original code was correct to error out upon discovering the problem, regardless of JSONpath semantics (it just failed to explain why it was correct). - (L101)
x.typed_value.yis a struct, so we attempt to return a shredding state for itsvalueand/ortyped_valuefields.
There was a problem hiding this comment.
Great analysis, thank you. In a shredded variant object, every field must be a struct containing value and/or typed_value sub-fields per the spec.
A non-struct field here means malformed data, and the original error was correct.
I've reverted to returning an InvalidArgumentError:
let struct_array = field.as_struct_opt().ok_or_else(|| {
ArrowError::InvalidArgumentError(format!(
"Expected shredded variant struct for field '{}', got {}",
name,
field.data_type(),
))
})?;
The error message now names the field for easier debugging.
There was a problem hiding this comment.
I agree with @codephage2020. This is malformed data rather than a JSONPath-handling case.
Spark appears to treat this as malformedVariant as well
What Codex spewed out for me:
Path-extraction route (closest to variant_get path-step behavior):
ParquetRowConverter.endcallsassembleVariantStruct(...)
https://github.com/apache/spark/blob/c8695495569e4058a758b111ebc942fc4906494c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala#L1036-L1043assembleVariantStructcallsextractField(...)
https://github.com/apache/spark/blob/c8695495569e4058a758b111ebc942fc4906494c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/SparkShreddingUtils.scala#L827-L844- In
extractField, a null object-field slot triggers malformed variant
https://github.com/apache/spark/blob/c8695495569e4058a758b111ebc942fc4906494c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/SparkShreddingUtils.scala#L795-L799 - That throws
QueryExecutionErrors.malformedVariant()
https://github.com/apache/spark/blob/c8695495569e4058a758b111ebc942fc4906494c/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala#L2995-L2998
Rebuild route (directly via common/variant/.../ShreddingUtils.java):
ParquetRowConverter.endcallsassembleVariant(...)
https://github.com/apache/spark/blob/c8695495569e4058a758b111ebc942fc4906494c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala#L1039-L1040assembleVariantcallsShreddingUtils.rebuild(...)
https://github.com/apache/spark/blob/c8695495569e4058a758b111ebc942fc4906494c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/SparkShreddingUtils.scala#L821-L823ShreddingUtils.rebuildthrowsmalformedVariant()on invalid shredded layout
https://github.com/apache/spark/blob/c8695495569e4058a758b111ebc942fc4906494c/common/variant/src/main/java/org/apache/spark/types/variant/ShreddingUtils.java#L49-L63
https://github.com/apache/spark/blob/c8695495569e4058a758b111ebc942fc4906494c/common/variant/src/main/java/org/apache/spark/types/variant/ShreddingUtils.java#L131-L135
https://github.com/apache/spark/blob/c8695495569e4058a758b111ebc942fc4906494c/common/variant/src/main/java/org/apache/spark/types/variant/ShreddingUtils.java#L167-L170malformedVariant()definition
https://github.com/apache/spark/blob/c8695495569e4058a758b111ebc942fc4906494c/common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java#L177-L180
| let mut new_field = field | ||
| .as_ref() | ||
| .clone() | ||
| .with_data_type(child.data_type().clone()); |
There was a problem hiding this comment.
Rescuing #9598 (comment) from github oblivion:
Why needed? Doesn't it already have that type?
For example, a VariantArray's internal StructArray has concrete fields like
metadataandvalue, not empty fields. So we needwith_data_typeto align the field's declared type with the actual child array's type, otherwiseStructArray::try_newwill reject the mismatch.
I think there are two different scenarios here:
- Caller requested a strongly typed struct. Then the input schema and child data type already match. Even for missing columns, the shredding code returns a properly typed array full of nulls, rather than a
NullArray. No variant extension type attached. - Caller requested a (possibly shredded) variant. Then the input schema and child data type may not match and we have to re-apply it and re-attach the variant extension type.
Right now, the code treats everything as case 2/, which is extra work but otherwise (mostly?) harmless.
Re "mostly harmless": How do we handle recursive variant shredding? For example, if I construct a shredding schema corresponding to x.y.z::INT, the resulting "happy path" is x.typed_value.y.typed_value.z.typed_value: INT. If I then pass that schema to variant_get, I have specifically requested that specific shredding schema, even if that means shredding or unshred-then-reshredding the underlying data whose own shredding schema doesn't match. But only the top-level column x has the variant extension type. It is incorrect for x.typed_value.y to be annotated as variant. Perhaps we still pass that annotation in order to guide the recursion, but the schema that comes back to the user afterward must not include it.
I'm not sure if we're handling these cases correctly right now?
There was a problem hiding this comment.
Good question. VariantType does NOT leak to intermediate fields. Here's why:
The is_variant_field check is driven by the user's requested schema, not by the data's shredding structure. When a user marks field "metadata" with VariantType, we pass as_type = None to shredded_get_path, which returns a VariantArray directly without recursing into the struct branch. So the recursion stops at that level — inner fields of the VariantArray's internal structure never go through the VariantType annotation path.
For case 1 (strongly typed struct), I've also addressed your observation that with_data_type is redundant — strongly typed fields now use the original field directly, since the child data type already matches the requested type.
To make this clearer, I split the handling into two explicit branches:
if is_variant_field {
// Update data type + attach VariantType extension
// (or build all-null VariantArray when field is absent)
} else {
// Strongly typed: use original field as-is
}
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…riant field handling - Restore InvalidArgumentError for malformed shredded data (non-struct field in shredded variant) - Separate variant vs strongly-typed field handling for clarity - Skip redundant with_data_type for strongly-typed fields - Remove test belonging to upstream PR apache#9599 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
52376bc to
9c6ee55
Compare
sdf-jkl
left a comment
There was a problem hiding this comment.
Added a few suggestion, but overall LGTM.
Add a unit test in parquet-variant-compute/src/variant_get.rs that verifies VariantType field extraction distinguishes between an explicit JSON null (which should yield a present Variant::Null) and a missing field (which should be SQL NULL). Co-Authored-By: Konstantin Tarasov <33369833+sdf-jkl@users.noreply.github.com>
| if target.typed_value_field().is_none() { | ||
| let has_variant_fields = fields | ||
| .iter() | ||
| .any(|f| f.try_extension_type::<VariantType>().is_ok()); |
scovich
left a comment
There was a problem hiding this comment.
I'm getting all twisted up trying to grok specific requests for shredded or binary variant (vs. "generic" variant that is equivalent to as_field=None). Left a bunch of thoughts but I don't have any good answers at the moment.
| // prepared to handle? | ||
| ArrowError::InvalidArgumentError(format!( | ||
| "Expected Struct array while following path, got {}", | ||
| "Expected shredded variant struct for field '{}', got {}", |
There was a problem hiding this comment.
Is this perhaps clearer?
| "Expected shredded variant struct for field '{}', got {}", | |
| "Physical layout of shredded variant field '{}' should be a struct, got {}", |
I'm struggling with the right wording to convey that this particular struct is part of shredding's internal layout, unrelated to the logical data type the user supplied. Similar to how maps are physically represented as structs.
(the code comment is very nice and clear btw -- I'm only worried about this error message)
| if target.typed_value_field().is_none() { | ||
| let has_variant_fields = fields | ||
| .iter() | ||
| .any(|f| f.try_extension_type::<VariantType>().is_ok()); |
There was a problem hiding this comment.
I'm fine either way. Whichever moves first, the other will follow.
|
|
||
| if is_variant_field { | ||
| // Variant field: the child's actual data type (VariantArray's internal | ||
| // StructArray) differs from the user's declared type (empty Struct), |
There was a problem hiding this comment.
Wait... is it ever (supposed to be) an empty struct?
The extension type blindly accepts any struct, so I guess that does work. And I don't see any code that would help a user create a correct physical variant struct (binary variant, shredded, etc).
Which opens up an interesting design question, because a request for variant_get to return variant could be any of three "flavors":
- Generic
- "give me back variant, in whatever form it takes"
- If the underlying path is already variant, return it unchanged
- If strongly typed, there's a very fast path to shred it -- especially if the column contains no nulls
- Probably the most common, and also what this PR does (I think)
- Unshredded
- "give me back binary variant no matter what form the data takes"
- If the underlying path is variant, unshred it as needed
- If strongly typed, convert to binary variant
- Useful for a caller who doesn't want to deal with (or cannot handle) the complexity and physical schema heterogeneity of shredding
- Shredded
- "give me back shredded variant following the specific shredding schema I provided"
- If the underlying path is variant, shred or reshred as needed
- If the underlying path is strongly typed, shred it using the provided shredding schema
- Useful for a caller who expects a specific format to the data but needs to handle outliers
I guess the best way to handle this is for users to send an empty struct for 1/, with a non-empty struct triggering 2/ and 3/ as appropriate? With the usual recursive pathing to handle each leaf field as efficiently as possible? But we don't currently have a way to pass a shredding schema, so everyone is forced to 1/.
There was a problem hiding this comment.
This relates to #9598 (comment), where I was worried that variant extension type would "leak" -- because I had been assuming we covered all three cases above, and that variant extension type was used recursively as the marker to distinguish between strongly typed vs. shredding schemas (which becomes a problem once we recurse into a shredded schema).
In effect, we have several different "flavors" of calls to shredded_path_get:
- No specific type requested (return whatever we find)
- Strong type requested (parse out the variant values as needed)
- Specific variant type requested (provided field is a struct with variant extension type metadata)
- Recursing into the shredded field of a specific requested variant type
The problem is, a strongly typed struct (2/) is indistinguishable from a shredded variant field (4/) -- both fields are structs lacking variant extension type metadata.
| for _ in 0..child.len() { | ||
| builder.append_null(); | ||
| } |
There was a problem hiding this comment.
There's been a recent bunch of PR adding (or expanding) support for bulk-appending NULL values to array builders, e.g. PrimitiveArrayBuilder::append_nulls, StructBuilder::append_nulls, etc. Should we take a follow-on item to add that support to variant array builder as well? (Some builders have append_values methods as well).
| let new_field = field | ||
| .as_ref() | ||
| .clone() | ||
| .with_data_type(child.data_type().clone()) | ||
| .with_extension_type(VariantType); | ||
| updated_fields.push(new_field); |
There was a problem hiding this comment.
nit: We're cloning and then discarding the field's existing data type.
| let new_field = field | |
| .as_ref() | |
| .clone() | |
| .with_data_type(child.data_type().clone()) | |
| .with_extension_type(VariantType); | |
| updated_fields.push(new_field); | |
| let new_field = Field::new( | |
| field.name().clone(), | |
| child.data_type().clone(), | |
| field.is_nullable(), | |
| ); | |
| updated_fields.push(Arc::new(new_field.with_extension_type(VariantType))); |
| // StructArray) differs from the user's declared type (empty Struct), | ||
| // so we must update the field's data type and re-attach VariantType. | ||
| // | ||
| // When the field is entirely absent, shredded_get_path returns a |
There was a problem hiding this comment.
| // When the field is entirely absent, shredded_get_path returns a | |
| // Additionally, when the field is entirely absent, shredded_get_path returns a |
(it took me a minute to grok that these are two mostly-separate issues)
Which issue does this PR close?
Rationale for this change
variant_get cannot extract a struct where some fields remain as VariantArrays. Callers must call variant_get per leaf and assemble results manually.
What changes are included in this PR?
Support extracting struct fields as VariantArray via VariantType extension metadata in variant_get, and fix panics on absent fields and silent data loss on non-struct typed values.
Are these changes tested?
Yes — test_struct_extraction_with_variant_fields and test_struct_extraction_missing_variant_field_no_panic.
Are there any user-facing changes?
No breaking changes. Fields marked with VariantType extension metadata are extracted as VariantArrays in struct extraction.