Skip to content

parquet: use flatbuffers to store metadata (WIP)#9042

Draft
rok wants to merge 7 commits intoapache:mainfrom
rok:flatbuffers_parquet_metadata
Draft

parquet: use flatbuffers to store metadata (WIP)#9042
rok wants to merge 7 commits intoapache:mainfrom
rok:flatbuffers_parquet_metadata

Conversation

@rok
Copy link
Copy Markdown
Member

@rok rok commented Dec 24, 2025

Which issue does this PR close?

Rationale for this change

See

What changes are included in this PR?

Rough draft. WIP. Do not merge. This adds a fbs file, thirft -> flatbuffers, flatbuffers -> thrift converters.

FBS File is from here (from parquet-format)

Are these changes tested?

Not yet. WIP.

Are there any user-facing changes?

This change should provide an option to store the flatbuffers metadata field to field, but otherwise be opaque to the user.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Dec 24, 2025
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Dec 29, 2025

FYI @etseidl

Thank you @rok

@rok rok force-pushed the flatbuffers_parquet_metadata branch from 090b5ba to e54de63 Compare January 20, 2026 14:48
@rok rok force-pushed the flatbuffers_parquet_metadata branch from d70534b to 757f8c7 Compare February 11, 2026 01:45
@rok rok force-pushed the flatbuffers_parquet_metadata branch from 42b135d to 4575c3a Compare March 11, 2026 02:31
Copy link
Copy Markdown
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rok. I think having a real world implementation will help decide whether thrift serialization is the real bottleneck in Parquet processing.

&(&flatbuf_bytes, schema_descr.clone()),
|b, (bytes, schema)| {
b.iter(|| {
black_box(flatbuf_to_parquet_metadata(bytes, schema.clone()).unwrap());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems odd to me that you pass the schema descriptor into the decoder, when the schema is contained in the encoded bytes. Shouldn't flatbuf_to_parquet_metadata be able to operate without it? Otherwise I'd say pass the same descriptor as an option to the thrift decoder as well to compare apples to apples. When doing so, I see times like:

wide: Thrift size = 984448 bytes, FlatBuf size = 1076320 bytes, ratio = 0.91x
metadata_serialization/thrift_decode/wide
                        time:   [3.6127 ms 3.6221 ms 3.6319 ms]
                        thrpt:  [275.34  elem/s 276.08  elem/s 276.80  elem/s]
                 change:
                        time:   [−1.4257% −0.8600% −0.3612%] (p = 0.00 < 0.05)
                        thrpt:  [+0.3625% +0.8675% +1.4464%]
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
metadata_serialization/flatbuf_decode/wide
                        time:   [10.182 ms 10.206 ms 10.233 ms]
                        thrpt:  [97.726  elem/s 97.979  elem/s 98.209  elem/s]
                 change:
                        time:   [−1.7999% −1.3809% −0.9771%] (p = 0.00 < 0.05)
                        thrpt:  [+0.9867% +1.4002% +1.8329%]
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

String::from_utf8_lossy(&max[..prefix_len]).to_string()
} else {
String::new()
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we assume byte arrays are UTF-8 strings? FIXED_LEN_BYTE_ARRAY will almost certainly not be. I think prefix should be &[u8].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet metadata as flatbuffers

3 participants