Skip to content

GH-531: Add parquet flatbuf schema#544

Open
alkis wants to merge 5 commits intoapache:masterfrom
alkis:patch-2
Open

GH-531: Add parquet flatbuf schema#544
alkis wants to merge 5 commits intoapache:masterfrom
alkis:patch-2

Conversation

@alkis
Copy link
Copy Markdown
Contributor

@alkis alkis commented Dec 12, 2025

Rationale for this change

Improve wide table support.

What changes are included in this PR?

Add parquet flatbuf schema.

Do these changes have PoC implementations?

apache/arrow#48431

codec: CompressionCodec;
num_values: long = null; // only present if not equal to rg.num_rows
total_uncompressed_size: long;
total_compressed_size: long;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to keep total unencoded size here which I think is generally useful? But I suppose it can be added after?

dictionary_page_offset: long = null;
statistics: Statistics;
is_fully_dict_encoded: bool;
bloom_filter_offset: long = null;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we this be made a struct/value to make the bloom filter info more self contained?

row_groups: [RowGroup];
kv: [KV];
created_by: string;
// column_orders: [ColumnOrder]; // moved to SchemaElement
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this row for now?

Copy link
Copy Markdown
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need to add an apache header here, and CI to make sure this compiles?

@Jiayi-Wang-db
Copy link
Copy Markdown

Hi @rok and @emkornfield , could you help to have another look of this pr?

min_lo4: uint;
min_lo8: ulong;
min_hi8: ulong;
min_len: byte = null;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
min_len: byte = null;
min_len: int = null;

Original suffix lenght could exceed int8 range of byte type.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rok , the previous comment is outdated. max_len and min_len store the truncated suffix length, which means the maximum value is 16. We use negative numbers to represent inexact values. I have updated the comment and the example, please take a look.

max_lo4: uint;
max_lo8: ulong;
max_hi8: ulong;
max_len: byte = null;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above:

Suggested change
max_len: byte = null;
max_len: int = null;


/** repetition of the field. The root of the schema does not have a repetition_type.
* All other nodes must have one */
repetition_type: FieldRepetitionType;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To allow for root to not have repetition type. In thrift we have optional:

3: optional FieldRepetitionType repetition_type;

Suggested change
repetition_type: FieldRepetitionType;
repetition_type: FieldRepetitionType = null;

@rok
Copy link
Copy Markdown
Member

rok commented Mar 11, 2026

@Jiayi-Wang-db Feel free to resolve comments you feel were addressed to make this more readable?

@rok
Copy link
Copy Markdown
Member

rok commented Mar 11, 2026

Is parquet3.bfbs required?

@Jiayi-Wang-db
Copy link
Copy Markdown

Is parquet3.bfbs required?

Not at all. I pushed it accidentally. Removed.

@rok
Copy link
Copy Markdown
Member

rok commented Mar 13, 2026

@emkornfield could you do a pass and perhaps close the no longer relevant comments?

@Jiayi-Wang-db
Copy link
Copy Markdown

Hi @rok @emkornfield , it seems the only concern with this FlatBuffers spec is that we deprecated distinct_count. Could we agree to leave it as is for now? If there’s a strong requirement in the future, we can always add it back, since FlatBuffers supports forward and backward compatibility for new fields.

@rok
Copy link
Copy Markdown
Member

rok commented Mar 18, 2026

@Jiayi-Wang-db I think it'd be great if @emkornfield did another pass and I'll try to do one now. Sorry, it's taking a while.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forget if we closed on this but could we just call this parquet.fbs?

//
// Optimization notes:
// 1. Statistics use fixed-width integral types when possible; otherwise they are
// encoded as prefix + truncated suffix. SizeStatistics and Statistics.distinct_count
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

distinct_count came up as something that is used by implementations lets keep this. I also think we should keep the variable with size in bytes from SizeStatistics.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added distinct_count back as optional field.

// GROUP_VAR_INT = 1,

/**
* Deprecated: Dictionary encoding. The values in the dictionary are encoded in the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should maybe comment on why this is kept when we are getting rid of converted type?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented out PLAIN_DICTIONARY.

* width. Usable for definition/repetition levels encoding.
* This encoding is deprecated and is replaced by the RLE/bit-packing hybrid encoding.
*/
// BIT_PACKED = 4,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seemes inconsistent to have this excluded but not PLAIN_DICTIONARY

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented out PLAIN_DICTIONARY.

* Scale must be zero or a positive integer less than or equal to the precision.
* Precision must be a non-zero positive integer.
*
* To maintain forward-compatibility in v1, implementations using this logical
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should remove this constraint if we are cleaning up ConvertedType

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

* LogicalType annotations to replace ConvertedType.
*/
union LogicalType {
StringType:Empty,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it backwards compatible if we need to evolve away from the Empty type for some reason?

* - if it is a primitive type (leaf) then type is defined and num_children is undefined
* the nodes are listed in depth first traversal order.
*/
table SchemaElement {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, this follows directly after type, is it is possible to keep the ordering consistent?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved.

/** optional statistics for this column chunk */
statistics: Statistics;

/** Indicates whether the column chunk pages are fully dictionary encoded. */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should maybe define this more formally.

bloom_filter: BloomFilterInfo;

/** Optional statistics specific for Geometry and Geography logical types */
geospatial_statistics: GeospatialStatistics;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this is placed here an not on Statistics?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is here to match the thrift schema.

* Note: while marked as optional, this field is in fact required by most major
* Parquet implementations. As such, writers MUST populate this field.
**/
meta_data: ColumnMetadata;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
meta_data: ColumnMetadata;
metadata: ColumnMetadata;

typo? or is metadata not allowed for some reason?

**/
columns: [ColumnChunk];

/** Total byte size of all the uncompressed column data in this row group **/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably note this represent uncompressed but encoded size?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is this the unencoded size?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is also come from thrift spec. But I changed it to Sum of total_uncompressed_size across all columns (uncompressed, encoded) .

@emkornfield
Copy link
Copy Markdown
Contributor

Could we agree to leave it as is for now? If there’s a strong requirement in the future, we can always add it back, since FlatBuffers supports forward and backward compatibility for new fields.

This was brought up on at a parquet sync that there are users that rely on the field, I think we should include it.

@emkornfield
Copy link
Copy Markdown
Contributor

We still need to add a CI config that makes sure the flatbuffer file can properly be parsed by the flatbuffer compiler.

// ColumnMetaData.is_fully_dict_encoded.
// 4. ColumnMetaData.path_in_schema is removed since it can be derived from the schema.
// 5. ConvertedType is fully dropped as it is superseded by LogicalType.
// 6. Offset and column indexes are removed since they are small and their offsets
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this exactly?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove this line of comment Offset and column indexes are removed since they are small and their offsets for now, because there’s ongoing discussion about it, and we might add it to the FlatBuffer footer later.

Copy link
Copy Markdown
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have a few more questions.

Current list of blockers:

  1. I think we need to document in the format how this is used (e.g. I believe we are packaging some additional metadata around compression). We can point to the extension doc.
  2. We need CI setup for this file to make sure it compiles.
  3. I think we need to at least add back distinct_count (it hasn't come up but tracking unencoded size is also something that I don't think we should drop).

/**
* Crypto metadata for files with encrypted footer.
*/
table FileCryptoMetaData {
Copy link
Copy Markdown

@adamreeve adamreeve Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some documentation on how this will be used? My understanding is that for non-encrypted files, they will append an unknown Thrift field with a binary footer as defined here, which will use a "PA3" extension magic.

For encrypted footers, we'd need to store this serialized FileCryptoMetaData blob, as well as the serialized and encrypted FileMetaData, and readers would have to know to expect an encrypted footer rather than try to read a FileMetaData directly. For current Thrift footers, the FileCryptoMetaData and FileMetaData are stored one after the other as separate Thrift serialized structures, as described here.

I don't think this approach would work as-is with Flatbuffer footers though, as we'd need to know the offset to the encrypted FileMetaData, and flatbuffers don't serialize their length by default. Although I noticed that at least some Flatbuffer implementations have a FinishSizePrefixed method that does this.

Maybe this has all been figured out, but if not, here's one suggestion for how encrypted footers might work:

  • The extension magic is "P3E"
  • N in the trailer is the size of the serialized FileCryptoMetaData plus the size of the serialized and encrypted FileMetaData, which are written sequentially, with FileCryptoMetaData first.
  • FileCryptoMetaData has an extra field added, encrypted_footer_size, giving the size of the encrypted FileMetaData

When reading, first the FileCryptoMetaData is read, then the encrypted_footer_size can be used to compute the offset to the encypted FileMetaData, which can be decrypted and read.

This doesn't cover compressed then encrypted footers as I'm not sure how footer compression works.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds reasonable to me.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @adamreeve @rok , thanks for flagging this.
Since the FlatBuffer footer is now used as an extension within the original Thrift footer, things have become more complicated.
It doesn’t make sense to add FileCryptoMetadata at this point.
In practice, this would mean reading a larger footer, decrypting a longer Thrift footer, and then extracting the FlatBuffer from it. Additionally, it doesn’t make sense to replicate the same file crypto data inside the FlatBuffer after it has already been decrypted.
We can add the FileCryptoMetadata back once we switch to PAR3. WDYT?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point, yes it doesn't make sense to separately encrypt the flatbuffer footer if it's contained within an encrypted Thrift footer. So only implementing flatbuffer footer encryption for files without a Thrift footer seems like the right approach.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggershinsky could you take a quick look here to confirm this design makes sense regarding encryption?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we don't have to use the same binary encoding layout as for unencrypted footers. We could store separate lengths for the FileCryptoMetadata and the encrypted FileMetaData. That way writers wouldn't need to know the length of the FileMetaData before writing FileCryptoMetadata.

- Rename parquet3.fbs to parquet.fbs
- Comment out deprecated PLAIN_DICTIONARY encoding (like BIT_PACKED)
- Add distinct_count back to Statistics
- Remove ConvertedType forward-compat constraint from DecimalOptions
- Add backward-compat note for LogicalType Empty union types
- Reorder SchemaElement fields to match Thrift ordering
- Expand is_fully_dict_encoded documentation
- Rename meta_data to metadata in ColumnChunk
- Clarify total_byte_size in RowGroup
- Remove FileCryptoMetaData (encrypted footer layout not yet specified)
@Jiayi-Wang-db
Copy link
Copy Markdown

Hi @emkornfield , regarding the CI config, is it something I have the permission to change?

Copy link
Copy Markdown
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, it's good to have something to evaluate. Just a few minor questions. And I do agree with @emkornfield that it would be nice to at least preserve the unencoded size of binary fields from the SizeStatistics. But if those move to an ancillary structure that works too.

* Note: while marked as optional, this field is in fact required by most major
* Parquet implementations. As such, writers MUST populate this field.
**/
metadata: ColumnMetadata;
Copy link
Copy Markdown
Contributor

@etseidl etseidl Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is one thing I've always found odd. The column metadata it appears was originally to be written elsewhere, and the copy in the ColumnChunk was a convenience. We've since moved to making it implicitly required, and deprecated writing it elsewhere. So the question I have is do we really want it here as a separate struct? I think the answer is likely yes due to encryption, but is it easier to maintain this scattershot distribution, or should we instead have a [ColumnMetadata] in the row group?

Just a thought.

/**
* Description for column metadata
*/
table ColumnMetadata {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the encodings list is gone as well as encoding_stats (replaced by the boolean below). It is useful to know from the metadata whether an unknown encoding has been used so one can fail fast. I agree it's unnecessary for that information to be here, but perhaps a global version in the file metadata would be of use? This could be reduced to a bitmask if a list is unpalatable. Same could be said for compression codecs used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants