Search before asking
What would you like to be improved?
Amoro currently uses the IcebergRewriteExecutor to perform small-file merging. This executor is based on a “row-based rewrite” model: read Parquet files -> deserialize into Record objects -> reserialize -> write to new files. The core logic is located in the 'AbstractRewriteFilesExecutor#rewriterDataFiles()' method.
Existing Issues:
- CPU-intensive: A significant amount of CPU time is consumed by decompressing Parquet files, creating objects, and re-encoding and compressing them.
- Memory pressure: The deserialization process generates a large number of temporary objects, increasing GC pressure.
- Inefficiency: For scenarios where files are merely merged (without modifying data), row-based rewrite is a massive waste of resources.
How should we improve?
For Parquet-format files in the iceberg table, in scenarios where deleteFile is not called, this approach bypasses row-level read/write operations.
By utilizing the Hadoop Parquet API’s ParquetFileWriter.appendFile() method to directly merge RowGroups, it skips all encoding and decoding steps, thereby achieving fast compaction of Parquet files.
Are you willing to submit PR?
Subtasks
No response
Code of Conduct
Search before asking
What would you like to be improved?
Amoro currently uses the IcebergRewriteExecutor to perform small-file merging. This executor is based on a “row-based rewrite” model: read Parquet files -> deserialize into Record objects -> reserialize -> write to new files. The core logic is located in the 'AbstractRewriteFilesExecutor#rewriterDataFiles()' method.
Existing Issues:
How should we improve?
For Parquet-format files in the iceberg table, in scenarios where
deleteFileis not called, this approach bypasses row-level read/write operations.By utilizing the Hadoop Parquet API’s
ParquetFileWriter.appendFile()method to directly merge RowGroups, it skips all encoding and decoding steps, thereby achieving fast compaction of Parquet files.Are you willing to submit PR?
Subtasks
No response
Code of Conduct