I recently read the Amoro 10x Efficient blog on Medium:
https://medium.com/@jinsong.zhou1990/10x-efficiency-boost-compared-to-spark-rewritefiles-procedure-how-apache-amoro-efficiently-7e7a993950d7
The performance improvements demonstrated are highly impressive. However, I noticed a few missing nuances in the benchmark setup. We would love to get more details on the following points to better understand the comparison.
- The CREATE TABLE statement shows the schema, but it does not specify the primary key. Was it a single primary key or a composite key?
- What type of CDC operations were applied during the benchmark? Was it update operation or just inserts happening or only deletes happening?
- Since only the schema is provided, could you share what dataset was used for this comparison?
- What were the specific Amoro optimizer parameters and configurations set during the benchmark?
- The blog does not mention partitioning. Was the table partitioned, and if so, what was the partition spec?
- Which exact version of Amoro was used for this benchmarking?
- What was the level of task parallelism configured for both Amoro and Spark during this test?
- Before running the compaction, was the data sorted in the data files based on the primary key column?
- Which specific column or columns were used as the equality identifiers for the equality delete files generated during the CDC workload?
- Were the data files sorted based on the equality delete column?
- Amount of data that used per CDC event? For instance 100k rows updated. If you could inform what kind of data was being changed during the CDC event it would be really helpful.
Any additional context you can provide regarding the test environment and workload would be incredibly helpful for the community. Thank you for the great article and your hard work on the project!
I recently read the Amoro 10x Efficient blog on Medium:
https://medium.com/@jinsong.zhou1990/10x-efficiency-boost-compared-to-spark-rewritefiles-procedure-how-apache-amoro-efficiently-7e7a993950d7
The performance improvements demonstrated are highly impressive. However, I noticed a few missing nuances in the benchmark setup. We would love to get more details on the following points to better understand the comparison.
Any additional context you can provide regarding the test environment and workload would be incredibly helpful for the community. Thank you for the great article and your hard work on the project!