Skip to content

fix: use writer types in Skipper for resolved named record types#9605

Open
ariel-miculas wants to merge 2 commits intoapache:mainfrom
ariel-miculas:fix-named-type-ref-fields
Open

fix: use writer types in Skipper for resolved named record types#9605
ariel-miculas wants to merge 2 commits intoapache:mainfrom
ariel-miculas:fix-named-type-ref-fields

Conversation

@ariel-miculas
Copy link
Copy Markdown

Which issue does this PR close?

Rationale for this change

When a writer-only field references a named Avro type that was previously resolved against a reader schema, parse_type returns the registered reader-resolved type from the shared resolver. This caused two problems:

  1. The Skipper built its struct sub-skippers from the reader's field list, which omits writer-only fields. Their bytes were never consumed, leaving the cursor at the wrong position for all subsequent records.

  2. Reader fields carry resolution-induced nullability (e.g. a writer plain long matched against a reader ["null", long] gains nullability = Some(NullFirst)). The Skipper read a union-tag byte that was never written, causing "Unexpected EOF" errors.

Fix: store the writer's data type in ResolvedField::ToReader alongside the reader index. The Skipper's Codec::Struct arm now iterates rec.writer_fields and uses the writer type from every entry - both ToReader(_, wdt) and Skip(wdt) - so it always follows the writer's wire format.

What changes are included in this PR?

Are these changes tested?

Yes, added unit tests.

Are there any user-facing changes?

No

@github-actions github-actions bot added arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Mar 23, 2026
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Mar 26, 2026

FYI @jecsand838

When a writer-only field references a named Avro type that was previously
resolved against a reader schema, `parse_type` returns the registered
reader-resolved type from the shared resolver. This caused two problems:

1. The Skipper built its struct sub-skippers from the reader's field list,
which omits writer-only fields. Their bytes were never consumed, leaving
the cursor at the wrong position for all subsequent records.

2. Reader fields carry resolution-induced nullability (e.g. a writer plain
`long` matched against a reader `["null", long]` gains `nullability =
Some(NullFirst)`). The Skipper read a union-tag byte that was never
written, causing "Unexpected EOF" errors.

Fix: store the writer's data type in `ResolvedField::ToReader` alongside
the reader index. The Skipper's `Codec::Struct` arm now iterates
`rec.writer_fields` and uses the writer type from every entry - both
`ToReader(_, wdt)` and `Skip(wdt)` - so it always follows the writer's
wire format.
@ariel-miculas ariel-miculas force-pushed the fix-named-type-ref-fields branch from a768913 to 1902d38 Compare March 31, 2026 10:53
@ariel-miculas
Copy link
Copy Markdown
Author

fixed formatting and rebased onto main;
@jecsand838 can you please take a look?

@jecsand838
Copy link
Copy Markdown
Contributor

jecsand838 commented Apr 1, 2026

@ariel-miculas

I think the correctness fix here is right.

One improvement I'd recommend is pushing the writer-wire planning fully into codec.rs, instead of carrying full AvroDataTypes down into record.rs and having Skipper::from_avro reconstruct the skip tree there. However this can always be done in a follow-up as well.

@ariel-miculas
Copy link
Copy Markdown
Author

However this can always be done in a follow-up as well.

I'd prefer it this way, since refactoring would take a bit more consideration.

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR @ariel-miculas 🙏

I defer to @jecsand838 's opinion on code structure

Before approving this PR I would would request that we:

  1. Try and make the tests easier to understand (see comemnts below)
  2. File a ticket that explains what the end user of this crate would see before this code fix (I can't quite tell from the PR description which focuses on what the code does, not the end user visible behavior)

It would also be nice to file a ticket to track @jecsand838 's suggestion for a different structure

/// writer's wire format for that type. Here the reader wraps the Timestamp's scalar
/// fields in nullable unions, but the writer wrote them as plain values; the Skipper
/// must not add a union-tag read for each field.
#[test]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tests seem pretty repetitive and thus it is hard for me to understand what they are supposed to be testing and validate what is different between tehm

Can you please refactor the common functionality into helper functions so the tests themselves are easier to understand and validate ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate arrow-avro arrow-avro crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants