parquet: use flatbuffers to store metadata (WIP)#9042
Draft
rok wants to merge 7 commits intoapache:mainfrom
Draft
parquet: use flatbuffers to store metadata (WIP)#9042rok wants to merge 7 commits intoapache:mainfrom
rok wants to merge 7 commits intoapache:mainfrom
Conversation
Contributor
090b5ba to
e54de63
Compare
d70534b to
757f8c7
Compare
42b135d to
4575c3a
Compare
etseidl
reviewed
Mar 30, 2026
| &(&flatbuf_bytes, schema_descr.clone()), | ||
| |b, (bytes, schema)| { | ||
| b.iter(|| { | ||
| black_box(flatbuf_to_parquet_metadata(bytes, schema.clone()).unwrap()); |
Contributor
There was a problem hiding this comment.
It seems odd to me that you pass the schema descriptor into the decoder, when the schema is contained in the encoded bytes. Shouldn't flatbuf_to_parquet_metadata be able to operate without it? Otherwise I'd say pass the same descriptor as an option to the thrift decoder as well to compare apples to apples. When doing so, I see times like:
wide: Thrift size = 984448 bytes, FlatBuf size = 1076320 bytes, ratio = 0.91x
metadata_serialization/thrift_decode/wide
time: [3.6127 ms 3.6221 ms 3.6319 ms]
thrpt: [275.34 elem/s 276.08 elem/s 276.80 elem/s]
change:
time: [−1.4257% −0.8600% −0.3612%] (p = 0.00 < 0.05)
thrpt: [+0.3625% +0.8675% +1.4464%]
Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
metadata_serialization/flatbuf_decode/wide
time: [10.182 ms 10.206 ms 10.233 ms]
thrpt: [97.726 elem/s 97.979 elem/s 98.209 elem/s]
change:
time: [−1.7999% −1.3809% −0.9771%] (p = 0.00 < 0.05)
thrpt: [+0.9867% +1.4002% +1.8329%]
Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
5 (5.00%) high mild
2 (2.00%) high severe
| String::from_utf8_lossy(&max[..prefix_len]).to_string() | ||
| } else { | ||
| String::new() | ||
| }; |
Contributor
There was a problem hiding this comment.
Why do we assume byte arrays are UTF-8 strings? FIXED_LEN_BYTE_ARRAY will almost certainly not be. I think prefix should be &[u8].
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
See
What changes are included in this PR?
Rough draft. WIP. Do not merge. This adds a fbs file, thirft -> flatbuffers, flatbuffers -> thrift converters.
FBS File is from here (from parquet-format)
Are these changes tested?
Not yet. WIP.
Are there any user-facing changes?
This change should provide an option to store the flatbuffers metadata field to field, but otherwise be opaque to the user.