Skip to content

[Improvement]: Implement row group merging using ParquetFileWriter#appendFile #4186

@zhangwl9

Description

@zhangwl9

Search before asking

  • I have searched in the issues and found no similar issues.

What would you like to be improved?

Amoro currently uses the IcebergRewriteExecutor to perform small-file merging. This executor is based on a “row-based rewrite” model: read Parquet files -> deserialize into Record objects -> reserialize -> write to new files. The core logic is located in the 'AbstractRewriteFilesExecutor#rewriterDataFiles()' method.
Existing Issues:

  1. CPU-intensive: A significant amount of CPU time is consumed by decompressing Parquet files, creating objects, and re-encoding and compressing them.
  2. Memory pressure: The deserialization process generates a large number of temporary objects, increasing GC pressure.
  3. Inefficiency: For scenarios where files are merely merged (without modifying data), row-based rewrite is a massive waste of resources.

How should we improve?

For Parquet-format files in the iceberg table, in scenarios where deleteFile is not called, this approach bypasses row-level read/write operations.
By utilizing the Hadoop Parquet API’s ParquetFileWriter.appendFile() method to directly merge RowGroups, it skips all encoding and decoding steps, thereby achieving fast compaction of Parquet files.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Subtasks

No response

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions