Skip to content

Support extracting struct fields as Variant using ExtensionType#9598

Open
codephage2020 wants to merge 5 commits intoapache:mainfrom
codephage2020:issue-9519
Open

Support extracting struct fields as Variant using ExtensionType#9598
codephage2020 wants to merge 5 commits intoapache:mainfrom
codephage2020:issue-9519

Conversation

@codephage2020
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

variant_get cannot extract a struct where some fields remain as VariantArrays. Callers must call variant_get per leaf and assemble results manually.

What changes are included in this PR?

Support extracting struct fields as VariantArray via VariantType extension metadata in variant_get, and fix panics on absent fields and silent data loss on non-struct typed values.

Are these changes tested?

Yes — test_struct_extraction_with_variant_fields and test_struct_extraction_missing_variant_field_no_panic.

Are there any user-facing changes?

No breaking changes. Fields marked with VariantType extension metadata are extracted as VariantArrays in struct extraction.

@github-actions github-actions bot added the parquet-variant parquet-variant* crates label Mar 21, 2026
@codephage2020
Copy link
Copy Markdown
Contributor Author

@scovich @sdf-jkl @klion26 Feel free to give me some advice when you have time. Thanks.

Copy link
Copy Markdown
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this. A bunch of questions.

@codephage2020
Copy link
Copy Markdown
Contributor Author

Thank you for the review. Looks like #9612 needs to be merged first.

@codephage2020 codephage2020 marked this pull request as draft March 25, 2026 06:45
@sdf-jkl
Copy link
Copy Markdown
Contributor

sdf-jkl commented Mar 30, 2026

@codephage2020 #9612 was merged

@codephage2020
Copy link
Copy Markdown
Contributor Author

@codephage2020 #9612 was merged

Thanks! I'll pick this up again later tonight.

@codephage2020 codephage2020 marked this pull request as ready for review March 31, 2026 11:30
@github-actions github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Mar 31, 2026
@codephage2020 codephage2020 marked this pull request as draft March 31, 2026 11:38
@github-actions github-actions bot removed parquet Changes to the parquet crate arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Mar 31, 2026
@codephage2020 codephage2020 marked this pull request as ready for review March 31, 2026 11:42
Copy link
Copy Markdown
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is turning out to be more subtle than I initially realized. A couple big points to consider in the comments.

Comment on lines +95 to +96
// followed further. Fall back to the value column if present,
// otherwise the path is missing.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment. What value column fallback does it refer to? The only caller of this follow_shredded_path_element function simply creates an all-null array if we return missing_path_step().

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I think I finally figured it out... this code is attempting to traverse through a shredded variant object. Say we're following the path x.y.z and the current path step is y:

  1. (L72) The typed_value we start with should ideally be a struct. If it's any other type, then x was not shredded as struct. Two sub-cases:
    1. x.value is present. Some entries may be Variant::Object. In fact, it's possible that every entry is Variant::Object (or Variant::Null), e.g. if the data match the query schema but the wrong shredding schema was applied when writing it out. So it's actually incorrect to blow up here, regardless of JSONpath semantics, because we didn't prove there's a problem yet (have to row-shred x.value and see what it contains).
    2. x.value column is just plain missing. This means the column shredded perfectly as some other type, and we know there are no nulls and no structs. This is currently treated as an error case, but we should really just return missing_path_step() instead (variant_get should follow JSONpath semantics #9606).
  2. (L86) x.typed_value was a struct as expected, but x.typed_value.y doesn't exist => missing_path_step()
  3. (L91) The x.typed_value.y exists, but it's not a struct. This is a malformed data scenario because we're in a shredded variant object and the spec mandates that every field must be a struct (containing value and/or typed_value fields). So the original code was correct to error out upon discovering the problem, regardless of JSONpath semantics (it just failed to explain why it was correct).
  4. (L101) x.typed_value.y is a struct, so we attempt to return a shredding state for its value and/or typed_value fields.

Copy link
Copy Markdown
Contributor Author

@codephage2020 codephage2020 Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great analysis, thank you. In a shredded variant object, every field must be a struct containing value and/or typed_value sub-fields per the spec.
A non-struct field here means malformed data, and the original error was correct.

I've reverted to returning an InvalidArgumentError:

let struct_array = field.as_struct_opt().ok_or_else(|| {
    ArrowError::InvalidArgumentError(format!(
        "Expected shredded variant struct for field '{}', got {}",
        name,
        field.data_type(),
    ))
})?;

The error message now names the field for easier debugging.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @codephage2020. This is malformed data rather than a JSONPath-handling case.

Spark appears to treat this as malformedVariant as well

What Codex spewed out for me:

Path-extraction route (closest to variant_get path-step behavior):

  1. ParquetRowConverter.end calls assembleVariantStruct(...)
    https://github.com/apache/spark/blob/c8695495569e4058a758b111ebc942fc4906494c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala#L1036-L1043
  2. assembleVariantStruct calls extractField(...)
    https://github.com/apache/spark/blob/c8695495569e4058a758b111ebc942fc4906494c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/SparkShreddingUtils.scala#L827-L844
  3. In extractField, a null object-field slot triggers malformed variant
    https://github.com/apache/spark/blob/c8695495569e4058a758b111ebc942fc4906494c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/SparkShreddingUtils.scala#L795-L799
  4. That throws QueryExecutionErrors.malformedVariant()
    https://github.com/apache/spark/blob/c8695495569e4058a758b111ebc942fc4906494c/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala#L2995-L2998

Rebuild route (directly via common/variant/.../ShreddingUtils.java):

  1. ParquetRowConverter.end calls assembleVariant(...)
    https://github.com/apache/spark/blob/c8695495569e4058a758b111ebc942fc4906494c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala#L1039-L1040
  2. assembleVariant calls ShreddingUtils.rebuild(...)
    https://github.com/apache/spark/blob/c8695495569e4058a758b111ebc942fc4906494c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/SparkShreddingUtils.scala#L821-L823
  3. ShreddingUtils.rebuild throws malformedVariant() on invalid shredded layout
    https://github.com/apache/spark/blob/c8695495569e4058a758b111ebc942fc4906494c/common/variant/src/main/java/org/apache/spark/types/variant/ShreddingUtils.java#L49-L63
    https://github.com/apache/spark/blob/c8695495569e4058a758b111ebc942fc4906494c/common/variant/src/main/java/org/apache/spark/types/variant/ShreddingUtils.java#L131-L135
    https://github.com/apache/spark/blob/c8695495569e4058a758b111ebc942fc4906494c/common/variant/src/main/java/org/apache/spark/types/variant/ShreddingUtils.java#L167-L170
  4. malformedVariant() definition
    https://github.com/apache/spark/blob/c8695495569e4058a758b111ebc942fc4906494c/common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java#L177-L180

let mut new_field = field
.as_ref()
.clone()
.with_data_type(child.data_type().clone());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rescuing #9598 (comment) from github oblivion:

Why needed? Doesn't it already have that type?

For example, a VariantArray's internal StructArray has concrete fields like metadata and value, not empty fields. So we need with_data_type to align the field's declared type with the actual child array's type, otherwise StructArray::try_new will reject the mismatch.

I think there are two different scenarios here:

  1. Caller requested a strongly typed struct. Then the input schema and child data type already match. Even for missing columns, the shredding code returns a properly typed array full of nulls, rather than a NullArray. No variant extension type attached.
  2. Caller requested a (possibly shredded) variant. Then the input schema and child data type may not match and we have to re-apply it and re-attach the variant extension type.

Right now, the code treats everything as case 2/, which is extra work but otherwise (mostly?) harmless.

Re "mostly harmless": How do we handle recursive variant shredding? For example, if I construct a shredding schema corresponding to x.y.z::INT, the resulting "happy path" is x.typed_value.y.typed_value.z.typed_value: INT. If I then pass that schema to variant_get, I have specifically requested that specific shredding schema, even if that means shredding or unshred-then-reshredding the underlying data whose own shredding schema doesn't match. But only the top-level column x has the variant extension type. It is incorrect for x.typed_value.y to be annotated as variant. Perhaps we still pass that annotation in order to guide the recursion, but the schema that comes back to the user afterward must not include it.

I'm not sure if we're handling these cases correctly right now?

Copy link
Copy Markdown
Contributor Author

@codephage2020 codephage2020 Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. VariantType does NOT leak to intermediate fields. Here's why:

The is_variant_field check is driven by the user's requested schema, not by the data's shredding structure. When a user marks field "metadata" with VariantType, we pass as_type = None to shredded_get_path, which returns a VariantArray directly without recursing into the struct branch. So the recursion stops at that level — inner fields of the VariantArray's internal structure never go through the VariantType annotation path.

For case 1 (strongly typed struct), I've also addressed your observation that with_data_type is redundant — strongly typed fields now use the original field directly, since the child data type already matches the requested type.

To make this clearer, I split the handling into two explicit branches:

if is_variant_field {
    // Update data type + attach VariantType extension
    // (or build all-null VariantArray when field is absent)
} else {
    // Strongly typed: use original field as-is
}

codephage2020 and others added 2 commits April 2, 2026 10:32
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…riant field handling

- Restore InvalidArgumentError for malformed shredded data (non-struct field in shredded variant)
- Separate variant vs strongly-typed field handling for clarity
- Skip redundant with_data_type for strongly-typed fields
- Remove test belonging to upstream PR apache#9599

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@sdf-jkl sdf-jkl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few suggestion, but overall LGTM.

codephage2020 and others added 2 commits April 9, 2026 14:13
Add a unit test in parquet-variant-compute/src/variant_get.rs that verifies VariantType field extraction distinguishes between an explicit JSON null (which should yield a present Variant::Null) and a missing field (which should be SQL NULL).

Co-Authored-By: Konstantin Tarasov <33369833+sdf-jkl@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@sdf-jkl sdf-jkl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

if target.typed_value_field().is_none() {
let has_variant_fields = fields
.iter()
.any(|f| f.try_extension_type::<VariantType>().is_ok());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A reminder to clean this once #9677 is closed.

Copy link
Copy Markdown
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting all twisted up trying to grok specific requests for shredded or binary variant (vs. "generic" variant that is equivalent to as_field=None). Left a bunch of thoughts but I don't have any good answers at the moment.

// prepared to handle?
ArrowError::InvalidArgumentError(format!(
"Expected Struct array while following path, got {}",
"Expected shredded variant struct for field '{}', got {}",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this perhaps clearer?

Suggested change
"Expected shredded variant struct for field '{}', got {}",
"Physical layout of shredded variant field '{}' should be a struct, got {}",

I'm struggling with the right wording to convey that this particular struct is part of shredding's internal layout, unrelated to the logical data type the user supplied. Similar to how maps are physically represented as structs.

(the code comment is very nice and clear btw -- I'm only worried about this error message)

if target.typed_value_field().is_none() {
let has_variant_fields = fields
.iter()
.any(|f| f.try_extension_type::<VariantType>().is_ok());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This races "silently" with #9677 -- no obvious (physical) merge conflict.
Should we wait for the other to merge and take advantage of the new API?
Or coordinate with @sdf-jkl to merge this PR first and update the other PR to optimize this new call site?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine either way. Whichever moves first, the other will follow.


if is_variant_field {
// Variant field: the child's actual data type (VariantArray's internal
// StructArray) differs from the user's declared type (empty Struct),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait... is it ever (supposed to be) an empty struct?

The extension type blindly accepts any struct, so I guess that does work. And I don't see any code that would help a user create a correct physical variant struct (binary variant, shredded, etc).

Which opens up an interesting design question, because a request for variant_get to return variant could be any of three "flavors":

  1. Generic
    • "give me back variant, in whatever form it takes"
    • If the underlying path is already variant, return it unchanged
    • If strongly typed, there's a very fast path to shred it -- especially if the column contains no nulls
    • Probably the most common, and also what this PR does (I think)
  2. Unshredded
    • "give me back binary variant no matter what form the data takes"
    • If the underlying path is variant, unshred it as needed
    • If strongly typed, convert to binary variant
    • Useful for a caller who doesn't want to deal with (or cannot handle) the complexity and physical schema heterogeneity of shredding
  3. Shredded
    • "give me back shredded variant following the specific shredding schema I provided"
    • If the underlying path is variant, shred or reshred as needed
    • If the underlying path is strongly typed, shred it using the provided shredding schema
    • Useful for a caller who expects a specific format to the data but needs to handle outliers

I guess the best way to handle this is for users to send an empty struct for 1/, with a non-empty struct triggering 2/ and 3/ as appropriate? With the usual recursive pathing to handle each leaf field as efficiently as possible? But we don't currently have a way to pass a shredding schema, so everyone is forced to 1/.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This relates to #9598 (comment), where I was worried that variant extension type would "leak" -- because I had been assuming we covered all three cases above, and that variant extension type was used recursively as the marker to distinguish between strongly typed vs. shredding schemas (which becomes a problem once we recurse into a shredded schema).

In effect, we have several different "flavors" of calls to shredded_path_get:

  1. No specific type requested (return whatever we find)
  2. Strong type requested (parse out the variant values as needed)
  3. Specific variant type requested (provided field is a struct with variant extension type metadata)
  4. Recursing into the shredded field of a specific requested variant type

The problem is, a strongly typed struct (2/) is indistinguishable from a shredded variant field (4/) -- both fields are structs lacking variant extension type metadata.

Comment on lines +262 to +264
for _ in 0..child.len() {
builder.append_null();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's been a recent bunch of PR adding (or expanding) support for bulk-appending NULL values to array builders, e.g. PrimitiveArrayBuilder::append_nulls, StructBuilder::append_nulls, etc. Should we take a follow-on item to add that support to variant array builder as well? (Some builders have append_values methods as well).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed here #9684. Should be a quick one.

Comment on lines +269 to +274
let new_field = field
.as_ref()
.clone()
.with_data_type(child.data_type().clone())
.with_extension_type(VariantType);
updated_fields.push(new_field);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We're cloning and then discarding the field's existing data type.

Suggested change
let new_field = field
.as_ref()
.clone()
.with_data_type(child.data_type().clone())
.with_extension_type(VariantType);
updated_fields.push(new_field);
let new_field = Field::new(
field.name().clone(),
child.data_type().clone(),
field.is_nullable(),
);
updated_fields.push(Arc::new(new_field.with_extension_type(VariantType)));

// StructArray) differs from the user's declared type (empty Struct),
// so we must update the field's data type and re-attach VariantType.
//
// When the field is entirely absent, shredded_get_path returns a
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// When the field is entirely absent, shredded_get_path returns a
// Additionally, when the field is entirely absent, shredded_get_path returns a

(it took me a minute to grok that these are two mostly-separate issues)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] Support extracting struct fields as Variant using ExtensionType

4 participants