Describe the enhancement requested
Following up on a discussion on the dev mailing list (https://lists.apache.org/thread/0mp06g0r27s0ynsg3pk54zl5bqc249wg) I'd like to propose making the path_in_schema field in ColumnMetaData optional. As has been pointed out elsewhere, this field carries information that is easily obtainable from the schema, and is repeated on a per-column-chunk basis, so files with many row groups will have many copies of the same information. This leads to a good bit of unnecessary bloat in the Parquet footer. Further, in addition to file bloat, the cost of parsing this field is quite high, due to its list<string> typing.
I think it would be worthwhile to embark on the process of deprecating this field. In the short term (following the advice in CONTRIBUTING.md), we can mark the field optional in parquet.thrift, but with the proviso that writers will continue to emit this field by default for some period of time. Users will be given configuration options allowing them to turn off this wasteful field if they so choose. Then, once a critical mass of implementations and downstream projects have been transitioned, writers will be free to omit the field by default.
It's worth pointing out that @Jiayi-Wang-db has reported on the dev list that 3 out 5 implementations tested were found to already tolerate the field missing, with arrow-rs since 57.0.0 making a 4th implementation. arrow-cpp does not appear to rely on the field beyond needing it for thrift to validate, and parquet-java needs only minor modifications to tolerate its absence.
Describe the enhancement requested
Following up on a discussion on the dev mailing list (https://lists.apache.org/thread/0mp06g0r27s0ynsg3pk54zl5bqc249wg) I'd like to propose making the
path_in_schemafield inColumnMetaDataoptional. As has been pointed out elsewhere, this field carries information that is easily obtainable from the schema, and is repeated on a per-column-chunk basis, so files with many row groups will have many copies of the same information. This leads to a good bit of unnecessary bloat in the Parquet footer. Further, in addition to file bloat, the cost of parsing this field is quite high, due to itslist<string>typing.I think it would be worthwhile to embark on the process of deprecating this field. In the short term (following the advice in CONTRIBUTING.md), we can mark the field
optionalinparquet.thrift, but with the proviso that writers will continue to emit this field by default for some period of time. Users will be given configuration options allowing them to turn off this wasteful field if they so choose. Then, once a critical mass of implementations and downstream projects have been transitioned, writers will be free to omit the field by default.It's worth pointing out that @Jiayi-Wang-db has reported on the dev list that 3 out 5 implementations tested were found to already tolerate the field missing, with arrow-rs since 57.0.0 making a 4th implementation. arrow-cpp does not appear to rely on the field beyond needing it for thrift to validate, and parquet-java needs only minor modifications to tolerate its absence.