fix: use writer types in Skipper for resolved named record types#9605
fix: use writer types in Skipper for resolved named record types#9605ariel-miculas wants to merge 2 commits intoapache:mainfrom
Conversation
|
FYI @jecsand838 |
When a writer-only field references a named Avro type that was previously resolved against a reader schema, `parse_type` returns the registered reader-resolved type from the shared resolver. This caused two problems: 1. The Skipper built its struct sub-skippers from the reader's field list, which omits writer-only fields. Their bytes were never consumed, leaving the cursor at the wrong position for all subsequent records. 2. Reader fields carry resolution-induced nullability (e.g. a writer plain `long` matched against a reader `["null", long]` gains `nullability = Some(NullFirst)`). The Skipper read a union-tag byte that was never written, causing "Unexpected EOF" errors. Fix: store the writer's data type in `ResolvedField::ToReader` alongside the reader index. The Skipper's `Codec::Struct` arm now iterates `rec.writer_fields` and uses the writer type from every entry - both `ToReader(_, wdt)` and `Skip(wdt)` - so it always follows the writer's wire format.
a768913 to
1902d38
Compare
|
fixed formatting and rebased onto main; |
|
I think the correctness fix here is right. One improvement I'd recommend is pushing the writer-wire planning fully into |
I'd prefer it this way, since refactoring would take a bit more consideration. |
alamb
left a comment
There was a problem hiding this comment.
Thank you for this PR @ariel-miculas 🙏
I defer to @jecsand838 's opinion on code structure
Before approving this PR I would would request that we:
- Try and make the tests easier to understand (see comemnts below)
- File a ticket that explains what the end user of this crate would see before this code fix (I can't quite tell from the PR description which focuses on what the code does, not the end user visible behavior)
It would also be nice to file a ticket to track @jecsand838 's suggestion for a different structure
| /// writer's wire format for that type. Here the reader wraps the Timestamp's scalar | ||
| /// fields in nullable unions, but the writer wrote them as plain values; the Skipper | ||
| /// must not add a union-tag read for each field. | ||
| #[test] |
There was a problem hiding this comment.
these tests seem pretty repetitive and thus it is hard for me to understand what they are supposed to be testing and validate what is different between tehm
Can you please refactor the common functionality into helper functions so the tests themselves are easier to understand and validate ?
Which issue does this PR close?
Rationale for this change
When a writer-only field references a named Avro type that was previously resolved against a reader schema,
parse_typereturns the registered reader-resolved type from the shared resolver. This caused two problems:The Skipper built its struct sub-skippers from the reader's field list, which omits writer-only fields. Their bytes were never consumed, leaving the cursor at the wrong position for all subsequent records.
Reader fields carry resolution-induced nullability (e.g. a writer plain
longmatched against a reader["null", long]gainsnullability = Some(NullFirst)). The Skipper read a union-tag byte that was never written, causing "Unexpected EOF" errors.Fix: store the writer's data type in
ResolvedField::ToReaderalongside the reader index. The Skipper'sCodec::Structarm now iteratesrec.writer_fieldsand uses the writer type from every entry - bothToReader(_, wdt)andSkip(wdt)- so it always follows the writer's wire format.What changes are included in this PR?
Are these changes tested?
Yes, added unit tests.
Are there any user-facing changes?
No