Conversation
| codec: CompressionCodec; | ||
| num_values: long = null; // only present if not equal to rg.num_rows | ||
| total_uncompressed_size: long; | ||
| total_compressed_size: long; |
There was a problem hiding this comment.
It would be nice to keep total unencoded size here which I think is generally useful? But I suppose it can be added after?
src/main/flatbuf/parquet3.fbs
Outdated
| dictionary_page_offset: long = null; | ||
| statistics: Statistics; | ||
| is_fully_dict_encoded: bool; | ||
| bloom_filter_offset: long = null; |
There was a problem hiding this comment.
Should we this be made a struct/value to make the bloom filter info more self contained?
src/main/flatbuf/parquet3.fbs
Outdated
| row_groups: [RowGroup]; | ||
| kv: [KV]; | ||
| created_by: string; | ||
| // column_orders: [ColumnOrder]; // moved to SchemaElement |
There was a problem hiding this comment.
remove this row for now?
emkornfield
left a comment
There was a problem hiding this comment.
I think we also need to add an apache header here, and CI to make sure this compiles?
|
Hi @rok and @emkornfield , could you help to have another look of this pr? |
| min_lo4: uint; | ||
| min_lo8: ulong; | ||
| min_hi8: ulong; | ||
| min_len: byte = null; |
There was a problem hiding this comment.
| min_len: byte = null; | |
| min_len: int = null; |
Original suffix lenght could exceed int8 range of byte type.
There was a problem hiding this comment.
Hi @rok , the previous comment is outdated. max_len and min_len store the truncated suffix length, which means the maximum value is 16. We use negative numbers to represent inexact values. I have updated the comment and the example, please take a look.
| max_lo4: uint; | ||
| max_lo8: ulong; | ||
| max_hi8: ulong; | ||
| max_len: byte = null; |
There was a problem hiding this comment.
As above:
| max_len: byte = null; | |
| max_len: int = null; |
src/main/flatbuf/parquet3.fbs
Outdated
|
|
||
| /** repetition of the field. The root of the schema does not have a repetition_type. | ||
| * All other nodes must have one */ | ||
| repetition_type: FieldRepetitionType; |
There was a problem hiding this comment.
To allow for root to not have repetition type. In thrift we have optional:
parquet-format/src/main/thrift/parquet.thrift
Line 518 in 38818fa
| repetition_type: FieldRepetitionType; | |
| repetition_type: FieldRepetitionType = null; |
|
@Jiayi-Wang-db Feel free to resolve comments you feel were addressed to make this more readable? |
|
Is |
Not at all. I pushed it accidentally. Removed. |
|
@emkornfield could you do a pass and perhaps close the no longer relevant comments? |
|
Hi @rok @emkornfield , it seems the only concern with this FlatBuffers spec is that we deprecated |
|
@Jiayi-Wang-db I think it'd be great if @emkornfield did another pass and I'll try to do one now. Sorry, it's taking a while. |
There was a problem hiding this comment.
I forget if we closed on this but could we just call this parquet.fbs?
| // | ||
| // Optimization notes: | ||
| // 1. Statistics use fixed-width integral types when possible; otherwise they are | ||
| // encoded as prefix + truncated suffix. SizeStatistics and Statistics.distinct_count |
There was a problem hiding this comment.
distinct_count came up as something that is used by implementations lets keep this. I also think we should keep the variable with size in bytes from SizeStatistics.
There was a problem hiding this comment.
Added distinct_count back as optional field.
| // GROUP_VAR_INT = 1, | ||
|
|
||
| /** | ||
| * Deprecated: Dictionary encoding. The values in the dictionary are encoded in the |
There was a problem hiding this comment.
We should maybe comment on why this is kept when we are getting rid of converted type?
| * width. Usable for definition/repetition levels encoding. | ||
| * This encoding is deprecated and is replaced by the RLE/bit-packing hybrid encoding. | ||
| */ | ||
| // BIT_PACKED = 4, |
There was a problem hiding this comment.
seemes inconsistent to have this excluded but not PLAIN_DICTIONARY
src/main/flatbuf/parquet3.fbs
Outdated
| * Scale must be zero or a positive integer less than or equal to the precision. | ||
| * Precision must be a non-zero positive integer. | ||
| * | ||
| * To maintain forward-compatibility in v1, implementations using this logical |
There was a problem hiding this comment.
Maybe we should remove this constraint if we are cleaning up ConvertedType
src/main/flatbuf/parquet3.fbs
Outdated
| * LogicalType annotations to replace ConvertedType. | ||
| */ | ||
| union LogicalType { | ||
| StringType:Empty, |
There was a problem hiding this comment.
Is it backwards compatible if we need to evolve away from the Empty type for some reason?
| * - if it is a primitive type (leaf) then type is defined and num_children is undefined | ||
| * the nodes are listed in depth first traversal order. | ||
| */ | ||
| table SchemaElement { |
There was a problem hiding this comment.
nit, this follows directly after type, is it is possible to keep the ordering consistent?
src/main/flatbuf/parquet3.fbs
Outdated
| /** optional statistics for this column chunk */ | ||
| statistics: Statistics; | ||
|
|
||
| /** Indicates whether the column chunk pages are fully dictionary encoded. */ |
There was a problem hiding this comment.
we should maybe define this more formally.
| bloom_filter: BloomFilterInfo; | ||
|
|
||
| /** Optional statistics specific for Geometry and Geography logical types */ | ||
| geospatial_statistics: GeospatialStatistics; |
There was a problem hiding this comment.
Is there a reason this is placed here an not on Statistics?
There was a problem hiding this comment.
It is here to match the thrift schema.
src/main/flatbuf/parquet3.fbs
Outdated
| * Note: while marked as optional, this field is in fact required by most major | ||
| * Parquet implementations. As such, writers MUST populate this field. | ||
| **/ | ||
| meta_data: ColumnMetadata; |
There was a problem hiding this comment.
| meta_data: ColumnMetadata; | |
| metadata: ColumnMetadata; |
typo? or is metadata not allowed for some reason?
src/main/flatbuf/parquet3.fbs
Outdated
| **/ | ||
| columns: [ColumnChunk]; | ||
|
|
||
| /** Total byte size of all the uncompressed column data in this row group **/ |
There was a problem hiding this comment.
We should probably note this represent uncompressed but encoded size?
There was a problem hiding this comment.
Or is this the unencoded size?
There was a problem hiding this comment.
This comment is also come from thrift spec. But I changed it to Sum of total_uncompressed_size across all columns (uncompressed, encoded) .
This was brought up on at a parquet sync that there are users that rely on the field, I think we should include it. |
|
We still need to add a CI config that makes sure the flatbuffer file can properly be parsed by the flatbuffer compiler. |
src/main/flatbuf/parquet.fbs
Outdated
| // ColumnMetaData.is_fully_dict_encoded. | ||
| // 4. ColumnMetaData.path_in_schema is removed since it can be derived from the schema. | ||
| // 5. ConvertedType is fully dropped as it is superseded by LogicalType. | ||
| // 6. Offset and column indexes are removed since they are small and their offsets |
There was a problem hiding this comment.
I'm not sure I understand this exactly?
There was a problem hiding this comment.
I will remove this line of comment Offset and column indexes are removed since they are small and their offsets for now, because there’s ongoing discussion about it, and we might add it to the FlatBuffer footer later.
emkornfield
left a comment
There was a problem hiding this comment.
have a few more questions.
Current list of blockers:
- I think we need to document in the format how this is used (e.g. I believe we are packaging some additional metadata around compression). We can point to the extension doc.
- We need CI setup for this file to make sure it compiles.
- I think we need to at least add back distinct_count (it hasn't come up but tracking unencoded size is also something that I don't think we should drop).
src/main/flatbuf/parquet3.fbs
Outdated
| /** | ||
| * Crypto metadata for files with encrypted footer. | ||
| */ | ||
| table FileCryptoMetaData { |
There was a problem hiding this comment.
Is there some documentation on how this will be used? My understanding is that for non-encrypted files, they will append an unknown Thrift field with a binary footer as defined here, which will use a "PA3" extension magic.
For encrypted footers, we'd need to store this serialized FileCryptoMetaData blob, as well as the serialized and encrypted FileMetaData, and readers would have to know to expect an encrypted footer rather than try to read a FileMetaData directly. For current Thrift footers, the FileCryptoMetaData and FileMetaData are stored one after the other as separate Thrift serialized structures, as described here.
I don't think this approach would work as-is with Flatbuffer footers though, as we'd need to know the offset to the encrypted FileMetaData, and flatbuffers don't serialize their length by default. Although I noticed that at least some Flatbuffer implementations have a FinishSizePrefixed method that does this.
Maybe this has all been figured out, but if not, here's one suggestion for how encrypted footers might work:
- The extension magic is "P3E"
Nin the trailer is the size of the serializedFileCryptoMetaDataplus the size of the serialized and encryptedFileMetaData, which are written sequentially, withFileCryptoMetaDatafirst.FileCryptoMetaDatahas an extra field added,encrypted_footer_size, giving the size of the encryptedFileMetaData
When reading, first the FileCryptoMetaData is read, then the encrypted_footer_size can be used to compute the offset to the encypted FileMetaData, which can be decrypted and read.
This doesn't cover compressed then encrypted footers as I'm not sure how footer compression works.
There was a problem hiding this comment.
Hi @adamreeve @rok , thanks for flagging this.
Since the FlatBuffer footer is now used as an extension within the original Thrift footer, things have become more complicated.
It doesn’t make sense to add FileCryptoMetadata at this point.
In practice, this would mean reading a larger footer, decrypting a longer Thrift footer, and then extracting the FlatBuffer from it. Additionally, it doesn’t make sense to replicate the same file crypto data inside the FlatBuffer after it has already been decrypted.
We can add the FileCryptoMetadata back once we switch to PAR3. WDYT?
There was a problem hiding this comment.
Ah good point, yes it doesn't make sense to separately encrypt the flatbuffer footer if it's contained within an encrypted Thrift footer. So only implementing flatbuffer footer encryption for files without a Thrift footer seems like the right approach.
There was a problem hiding this comment.
@ggershinsky could you take a quick look here to confirm this design makes sense regarding encryption?
There was a problem hiding this comment.
I guess we don't have to use the same binary encoding layout as for unencrypted footers. We could store separate lengths for the FileCryptoMetadata and the encrypted FileMetaData. That way writers wouldn't need to know the length of the FileMetaData before writing FileCryptoMetadata.
- Rename parquet3.fbs to parquet.fbs - Comment out deprecated PLAIN_DICTIONARY encoding (like BIT_PACKED) - Add distinct_count back to Statistics - Remove ConvertedType forward-compat constraint from DecimalOptions - Add backward-compat note for LogicalType Empty union types - Reorder SchemaElement fields to match Thrift ordering - Expand is_fully_dict_encoded documentation - Rename meta_data to metadata in ColumnChunk - Clarify total_byte_size in RowGroup - Remove FileCryptoMetaData (encrypted footer layout not yet specified)
|
Hi @emkornfield , regarding the CI config, is it something I have the permission to change? |
etseidl
left a comment
There was a problem hiding this comment.
Thanks for this, it's good to have something to evaluate. Just a few minor questions. And I do agree with @emkornfield that it would be nice to at least preserve the unencoded size of binary fields from the SizeStatistics. But if those move to an ancillary structure that works too.
| * Note: while marked as optional, this field is in fact required by most major | ||
| * Parquet implementations. As such, writers MUST populate this field. | ||
| **/ | ||
| metadata: ColumnMetadata; |
There was a problem hiding this comment.
So this is one thing I've always found odd. The column metadata it appears was originally to be written elsewhere, and the copy in the ColumnChunk was a convenience. We've since moved to making it implicitly required, and deprecated writing it elsewhere. So the question I have is do we really want it here as a separate struct? I think the answer is likely yes due to encryption, but is it easier to maintain this scattershot distribution, or should we instead have a [ColumnMetadata] in the row group?
Just a thought.
| /** | ||
| * Description for column metadata | ||
| */ | ||
| table ColumnMetadata { |
There was a problem hiding this comment.
I see the encodings list is gone as well as encoding_stats (replaced by the boolean below). It is useful to know from the metadata whether an unknown encoding has been used so one can fail fast. I agree it's unnecessary for that information to be here, but perhaps a global version in the file metadata would be of use? This could be reduced to a bitmask if a list is unpalatable. Same could be said for compression codecs used.
Rationale for this change
Improve wide table support.
What changes are included in this PR?
Add parquet flatbuf schema.
Do these changes have PoC implementations?
apache/arrow#48431