GH-531: Add parquet flatbuf schema by alkis · Pull Request #544 · apache/parquet-format

alkis · 2025-12-12T07:58:13Z

Rationale for this change

Improve wide table support.

What changes are included in this PR?

Add parquet flatbuf schema.

Do these changes have PoC implementations?

apache/arrow#48431

src/main/flatbuf/parquet3.fbs

src/main/flatbuf/parquet.fbs

src/main/flatbuf/parquet3.fbs

src/main/flatbuf/parquet.fbs

src/main/flatbuf/parquet3.fbs

emkornfield · 2026-01-09T18:29:51Z

src/main/flatbuf/parquet.fbs

+  codec: CompressionCodec;
+  num_values: long = null;  // only present if not equal to rg.num_rows
+  total_uncompressed_size: long;
+  total_compressed_size: long;


It would be nice to keep total unencoded size here which I think is generally useful? But I suppose it can be added after?

emkornfield · 2026-01-09T18:30:26Z

src/main/flatbuf/parquet3.fbs

+  dictionary_page_offset: long = null;
+  statistics: Statistics;
+  is_fully_dict_encoded: bool;
+  bloom_filter_offset: long = null;


Should we this be made a struct/value to make the bloom filter info more self contained?

emkornfield · 2026-01-09T18:30:54Z

src/main/flatbuf/parquet3.fbs

+  row_groups: [RowGroup];
+  kv: [KV];
+  created_by: string;
+  // column_orders: [ColumnOrder];  // moved to SchemaElement


remove this row for now?

emkornfield

I think we also need to add an apache header here, and CI to make sure this compiles?

Jiayi-Wang-db · 2026-03-03T12:16:05Z

Hi @rok and @emkornfield , could you help to have another look of this pr?

src/main/flatbuf/parquet3.fbs

src/main/flatbuf/parquet.fbs

rok · 2026-03-11T00:05:18Z

src/main/flatbuf/parquet3.fbs

+  min_lo4: uint;
+  min_lo8: ulong;
+  min_hi8: ulong;
+  min_len: byte = null;


Suggested change

min_len: byte = null;

min_len: int = null;

Original suffix lenght could exceed int8 range of byte type.

Hi @rok , the previous comment is outdated. max_len and min_len store the truncated suffix length, which means the maximum value is 16. We use negative numbers to represent inexact values. I have updated the comment and the example, please take a look.

rok · 2026-03-11T00:05:34Z

src/main/flatbuf/parquet3.fbs

+  max_lo4: uint;
+  max_lo8: ulong;
+  max_hi8: ulong;
+  max_len: byte = null;


As above:

Suggested change

max_len: byte = null;

max_len: int = null;

src/main/flatbuf/parquet3.fbs

src/main/flatbuf/parquet.fbs

rok · 2026-03-11T00:22:44Z

src/main/flatbuf/parquet3.fbs

+
+  /** repetition of the field. The root of the schema does not have a repetition_type.
+   * All other nodes must have one */
+  repetition_type: FieldRepetitionType;


To allow for root to not have repetition type. In thrift we have optional:

parquet-format/src/main/thrift/parquet.thrift

Line 518 in 38818fa

3: optional FieldRepetitionType repetition_type;

Suggested change

repetition_type: FieldRepetitionType;

repetition_type: FieldRepetitionType = null;

src/main/flatbuf/parquet3.fbs

src/main/flatbuf/parquet.fbs

src/main/flatbuf/parquet3.fbs

rok · 2026-03-11T17:05:33Z

@Jiayi-Wang-db Feel free to resolve comments you feel were addressed to make this more readable?

rok · 2026-03-11T23:49:22Z

Is parquet3.bfbs required?

Jiayi-Wang-db · 2026-03-12T17:22:59Z

Is parquet3.bfbs required?

Not at all. I pushed it accidentally. Removed.

rok · 2026-03-13T17:12:52Z

@emkornfield could you do a pass and perhaps close the no longer relevant comments?

Jiayi-Wang-db · 2026-03-18T11:39:50Z

Hi @rok @emkornfield , it seems the only concern with this FlatBuffers spec is that we deprecated distinct_count. Could we agree to leave it as is for now? If there’s a strong requirement in the future, we can always add it back, since FlatBuffers supports forward and backward compatibility for new fields.

rok · 2026-03-18T11:46:19Z

@Jiayi-Wang-db I think it'd be great if @emkornfield did another pass and I'll try to do one now. Sorry, it's taking a while.

emkornfield · 2026-03-26T23:25:03Z

src/main/flatbuf/parquet.fbs

I forget if we closed on this but could we just call this parquet.fbs?

emkornfield · 2026-03-26T23:26:06Z

src/main/flatbuf/parquet.fbs

+//
+// Optimization notes:
+// 1. Statistics use fixed-width integral types when possible; otherwise they are
+//    encoded as prefix + truncated suffix. SizeStatistics and Statistics.distinct_count


distinct_count came up as something that is used by implementations lets keep this. I also think we should keep the variable with size in bytes from SizeStatistics.

Added distinct_count back as optional field.

emkornfield · 2026-03-26T23:30:00Z

src/main/flatbuf/parquet.fbs

+  // GROUP_VAR_INT = 1,
+
+  /**
+   * Deprecated: Dictionary encoding. The values in the dictionary are encoded in the


We should maybe comment on why this is kept when we are getting rid of converted type?

Commented out PLAIN_DICTIONARY.

emkornfield · 2026-03-26T23:30:54Z

src/main/flatbuf/parquet.fbs

+   * width. Usable for definition/repetition levels encoding.
+   * This encoding is deprecated and is replaced by the RLE/bit-packing hybrid encoding. 
+   */
+  // BIT_PACKED = 4,


seemes inconsistent to have this excluded but not PLAIN_DICTIONARY

Commented out PLAIN_DICTIONARY.

emkornfield · 2026-03-26T23:32:55Z

src/main/flatbuf/parquet3.fbs

+ * Scale must be zero or a positive integer less than or equal to the precision.
+ * Precision must be a non-zero positive integer.
+ *
+ * To maintain forward-compatibility in v1, implementations using this logical


Maybe we should remove this constraint if we are cleaning up ConvertedType

emkornfield · 2026-03-26T23:38:02Z

src/main/flatbuf/parquet3.fbs

+ * LogicalType annotations to replace ConvertedType.
+ */
+union LogicalType {
+  StringType:Empty, 


Is it backwards compatible if we need to evolve away from the Empty type for some reason?

emkornfield · 2026-03-26T23:39:22Z

src/main/flatbuf/parquet.fbs

+ *  - if it is a primitive type (leaf) then type is defined and num_children is undefined
+ * the nodes are listed in depth first traversal order.
+ */
+table SchemaElement {


nit, this follows directly after type, is it is possible to keep the ordering consistent?

emkornfield · 2026-03-26T23:42:43Z

src/main/flatbuf/parquet3.fbs

+  /** optional statistics for this column chunk */
+  statistics: Statistics;
+
+  /** Indicates whether the column chunk pages are fully dictionary encoded. */


we should maybe define this more formally.

emkornfield · 2026-03-26T23:43:07Z

src/main/flatbuf/parquet.fbs

+  bloom_filter: BloomFilterInfo;
+
+  /** Optional statistics specific for Geometry and Geography logical types */
+  geospatial_statistics: GeospatialStatistics;


Is there a reason this is placed here an not on Statistics?

It is here to match the thrift schema.

emkornfield · 2026-03-26T23:44:04Z

src/main/flatbuf/parquet3.fbs

+   * Note: while marked as optional, this field is in fact required by most major
+   * Parquet implementations. As such, writers MUST populate this field.
+   **/
+  meta_data: ColumnMetadata;


Suggested change

meta_data: ColumnMetadata;

metadata: ColumnMetadata;

typo? or is metadata not allowed for some reason?

emkornfield · 2026-03-26T23:45:01Z

src/main/flatbuf/parquet3.fbs

+   **/
+  columns: [ColumnChunk];
+
+  /** Total byte size of all the uncompressed column data in this row group **/


We should probably note this represent uncompressed but encoded size?

Or is this the unencoded size?

This comment is also come from thrift spec. But I changed it to Sum of total_uncompressed_size across all columns (uncompressed, encoded) .

emkornfield · 2026-03-26T23:47:36Z

Could we agree to leave it as is for now? If there’s a strong requirement in the future, we can always add it back, since FlatBuffers supports forward and backward compatibility for new fields.

This was brought up on at a parquet sync that there are users that rely on the field, I think we should include it.

emkornfield · 2026-03-26T23:48:20Z

We still need to add a CI config that makes sure the flatbuffer file can properly be parsed by the flatbuffer compiler.

emkornfield · 2026-03-26T23:49:03Z

src/main/flatbuf/parquet.fbs

+//    ColumnMetaData.is_fully_dict_encoded.
+// 4. ColumnMetaData.path_in_schema is removed since it can be derived from the schema.
+// 5. ConvertedType is fully dropped as it is superseded by LogicalType.
+// 6. Offset and column indexes are removed since they are small and their offsets


I'm not sure I understand this exactly?

I will remove this line of comment Offset and column indexes are removed since they are small and their offsets for now, because there’s ongoing discussion about it, and we might add it to the FlatBuffer footer later.

emkornfield

have a few more questions.

Current list of blockers:

I think we need to document in the format how this is used (e.g. I believe we are packaging some additional metadata around compression). We can point to the extension doc.
We need CI setup for this file to make sure it compiles.
I think we need to at least add back distinct_count (it hasn't come up but tracking unencoded size is also something that I don't think we should drop).

adamreeve · 2026-04-07T04:59:18Z

src/main/flatbuf/parquet3.fbs

+/**
+ * Crypto metadata for files with encrypted footer.
+ */
+table FileCryptoMetaData {


Is there some documentation on how this will be used? My understanding is that for non-encrypted files, they will append an unknown Thrift field with a binary footer as defined here, which will use a "PA3" extension magic.

For encrypted footers, we'd need to store this serialized FileCryptoMetaData blob, as well as the serialized and encrypted FileMetaData, and readers would have to know to expect an encrypted footer rather than try to read a FileMetaData directly. For current Thrift footers, the FileCryptoMetaData and FileMetaData are stored one after the other as separate Thrift serialized structures, as described here.

I don't think this approach would work as-is with Flatbuffer footers though, as we'd need to know the offset to the encrypted FileMetaData, and flatbuffers don't serialize their length by default. Although I noticed that at least some Flatbuffer implementations have a FinishSizePrefixed method that does this.

Maybe this has all been figured out, but if not, here's one suggestion for how encrypted footers might work:

The extension magic is "P3E"

N in the trailer is the size of the serialized FileCryptoMetaData plus the size of the serialized and encrypted FileMetaData, which are written sequentially, with FileCryptoMetaData first.

FileCryptoMetaData has an extra field added, encrypted_footer_size, giving the size of the encrypted FileMetaData

When reading, first the FileCryptoMetaData is read, then the encrypted_footer_size can be used to compute the offset to the encypted FileMetaData, which can be decrypted and read.

This doesn't cover compressed then encrypted footers as I'm not sure how footer compression works.

This sounds reasonable to me.

Hi @adamreeve @rok , thanks for flagging this.
Since the FlatBuffer footer is now used as an extension within the original Thrift footer, things have become more complicated.
It doesn’t make sense to add FileCryptoMetadata at this point.
In practice, this would mean reading a larger footer, decrypting a longer Thrift footer, and then extracting the FlatBuffer from it. Additionally, it doesn’t make sense to replicate the same file crypto data inside the FlatBuffer after it has already been decrypted.
We can add the FileCryptoMetadata back once we switch to PAR3. WDYT?

Ah good point, yes it doesn't make sense to separately encrypt the flatbuffer footer if it's contained within an encrypted Thrift footer. So only implementing flatbuffer footer encryption for files without a Thrift footer seems like the right approach.

@ggershinsky could you take a quick look here to confirm this design makes sense regarding encryption?

I guess we don't have to use the same binary encoding layout as for unencrypted footers. We could store separate lengths for the FileCryptoMetadata and the encrypted FileMetaData. That way writers wouldn't need to know the length of the FileMetaData before writing FileCryptoMetadata.

- Rename parquet3.fbs to parquet.fbs - Comment out deprecated PLAIN_DICTIONARY encoding (like BIT_PACKED) - Add distinct_count back to Statistics - Remove ConvertedType forward-compat constraint from DecimalOptions - Add backward-compat note for LogicalType Empty union types - Reorder SchemaElement fields to match Thrift ordering - Expand is_fully_dict_encoded documentation - Rename meta_data to metadata in ColumnChunk - Clarify total_byte_size in RowGroup - Remove FileCryptoMetaData (encrypted footer layout not yet specified)

Jiayi-Wang-db · 2026-04-08T10:48:20Z

Hi @emkornfield , regarding the CI config, is it something I have the permission to change?

etseidl

Thanks for this, it's good to have something to evaluate. Just a few minor questions. And I do agree with @emkornfield that it would be nice to at least preserve the unencoded size of binary fields from the SizeStatistics. But if those move to an ancillary structure that works too.

etseidl · 2026-04-09T17:08:42Z

src/main/flatbuf/parquet.fbs

+   * Note: while marked as optional, this field is in fact required by most major
+   * Parquet implementations. As such, writers MUST populate this field.
+   **/
+  metadata: ColumnMetadata;


So this is one thing I've always found odd. The column metadata it appears was originally to be written elsewhere, and the copy in the ColumnChunk was a convenience. We've since moved to making it implicitly required, and deprecated writing it elsewhere. So the question I have is do we really want it here as a separate struct? I think the answer is likely yes due to encryption, but is it easier to maintain this scattershot distribution, or should we instead have a [ColumnMetadata] in the row group?

Just a thought.

etseidl · 2026-04-09T17:15:45Z

src/main/flatbuf/parquet.fbs

+/**
+ * Description for column metadata
+ */
+table ColumnMetadata {


I see the encodings list is gone as well as encoding_stats (replaced by the boolean below). It is useful to know from the metadata whether an unknown encoding has been used so one can fail fast. I agree it's unnecessary for that information to be here, but perhaps a global version in the file metadata would be of use? This could be reduced to a bitmask if a list is unpalatable. Same could be said for compression codecs used.

Add parquet flatbuf schema

f951a6d

This was referenced Jan 7, 2026

Parquet metadata as flatbuffers apache/arrow-rs#9041

Open

parquet: use flatbuffers to store metadata (WIP) apache/arrow-rs#9042

Draft