Skip to content

Make path_in_schema optional #563

@etseidl

Description

@etseidl

Describe the enhancement requested

Following up on a discussion on the dev mailing list (https://lists.apache.org/thread/0mp06g0r27s0ynsg3pk54zl5bqc249wg) I'd like to propose making the path_in_schema field in ColumnMetaData optional. As has been pointed out elsewhere, this field carries information that is easily obtainable from the schema, and is repeated on a per-column-chunk basis, so files with many row groups will have many copies of the same information. This leads to a good bit of unnecessary bloat in the Parquet footer. Further, in addition to file bloat, the cost of parsing this field is quite high, due to its list<string> typing.

I think it would be worthwhile to embark on the process of deprecating this field. In the short term (following the advice in CONTRIBUTING.md), we can mark the field optional in parquet.thrift, but with the proviso that writers will continue to emit this field by default for some period of time. Users will be given configuration options allowing them to turn off this wasteful field if they so choose. Then, once a critical mass of implementations and downstream projects have been transitioned, writers will be free to omit the field by default.

It's worth pointing out that @Jiayi-Wang-db has reported on the dev list that 3 out 5 implementations tested were found to already tolerate the field missing, with arrow-rs since 57.0.0 making a 4th implementation. arrow-cpp does not appear to rely on the field beyond needing it for thrift to validate, and parquet-java needs only minor modifications to tolerate its absence.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions