Skip to content

Add Hive and Iceberg Load benchmark#55

Open
PingLiuPing wants to merge 15 commits intoprestodb:mainfrom
PingLiuPing:lpingbj_load_tpch
Open

Add Hive and Iceberg Load benchmark#55
PingLiuPing wants to merge 15 commits intoprestodb:mainfrom
PingLiuPing:lpingbj_load_tpch

Conversation

@PingLiuPing
Copy link
Copy Markdown

@PingLiuPing PingLiuPing commented Apr 24, 2025

loading (insert) benchmark is missing in pbench, this PR add the initial files for loading benchmark. It includes test files for hive and iceberg connector, both native and Java.
The data is loaded from tpch connector on the fly.

Future enhancements are required to make the benchmark run in stage such as prepare stage, main stage, cleanup stage etc.

@PingLiuPing PingLiuPing marked this pull request as ready for review April 24, 2025 16:55
@PingLiuPing PingLiuPing changed the title Add TPCH Load benchmark Add Hive and Iceberg Load benchmark Apr 24, 2025
@PingLiuPing PingLiuPing self-assigned this Aug 4, 2025
@PingLiuPing
Copy link
Copy Markdown
Author

@wanglinsong @ethanyzhang Sorry for the late response, this PR slipped from my mind. I addressed your comments, can you please take another look? Thanks.

Copy link
Copy Markdown
Member

@wanglinsong wanglinsong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the DDL to create tables are the same across all scale factors. Can you parameterize or remove the hardcoded schema name: tpch.sf100.?

FROM tpch.sf100.customer;

@PingLiuPing
Copy link
Copy Markdown
Author

believe the DDL to create tables are the same across all scale factors. Can you parameterize or remove the hardcoded schema name: tpch.sf100.?

FROM tpch.sf100.customer;

Thanks, at the current framework I think this needs lots of work to support that.

@wanglinsong
Copy link
Copy Markdown
Member

wanglinsong commented Aug 21, 2025

believe the DDL to create tables are the same across all scale factors. Can you parameterize or remove the hardcoded schema name: tpch.sf100.?
FROM tpch.sf100.customer;

Thanks, at the current framework I think this needs lots of work to support that.

Oh, this is an embedded connector. This is not an issue at all. Please ignore.

@PingLiuPing
Copy link
Copy Markdown
Author

Hi @wanglinsong Thanks for your comments, do you think this PR is ready to be merged? Anything else you want me to change?

@PingLiuPing
Copy link
Copy Markdown
Author

@wanglinsong can you please have another look, thanks.

@PingLiuPing
Copy link
Copy Markdown
Author

@wanglinsong gentle ping.

@xpengahana
Copy link
Copy Markdown
Contributor

Have we tested this PR for pbench? @PingLiuPing

@PingLiuPing
Copy link
Copy Markdown
Author

Have we tested this PR for pbench? @PingLiuPing

Thanks, I didn't test this after fix the review comments. Before that I have tested it in pbench.
So think this commit might not correct aee3644.
I will verify this.

@ethanyzhang ethanyzhang force-pushed the main branch 2 times, most recently from 6f32a1a to b8de6a7 Compare February 13, 2026 04:02
@ethanyzhang
Copy link
Copy Markdown
Collaborator

ethanyzhang commented Feb 23, 2026

Thanks for adding the TPC-H loading benchmarks! I reviewed the files and have some findings before we integrate.

Issues to fix

1. prepare_sf1000.json description typo
Says "scaling factor 100" — should be "scaling factor 1000".

2. expected_row_counts mismatch in insert stages
All 4 insert stages (insert_sf100_j.json, insert_sf100_n.json, insert_sf1000_j.json, insert_sf1000_n.json) have 10 values but only 8 query files (= 8 queries). The extra 2 values are silently ignored. Should be exactly 8.

3. expected_row_counts off-by-one in prepare stages
schema/create_sf100.sql and schema/create_sf1000.sql each contain 2 SQL statements (CREATE SCHEMA + USE), so total queries per prepare stage = 2 + 8 = 10. But expected_row_counts only has 9 values — the last CREATE TABLE won't be validated. Should be 10.

4. Column type mismatches vs TPC-H spec
These columns should be INTEGER, not BIGINT:

  • lineitem.linenumber
  • orders.shippriority
  • part.size
  • partsupp.availqty

5. DECIMAL(12,2) should be DECIMAL(15,2)
The TPC-H spec defines decimal columns with precision 15. Precision 12 could cause truncation at larger scale factors.

@PingLiuPing @xpengahana Did we generate the right schema?

Suggestions (non-blocking)

6. No IF EXISTS on DROP statements
DROP TABLE and DROP SCHEMA will fail if tables/schema don't exist (e.g. after a partial run). Consider DROP TABLE IF EXISTS / DROP SCHEMA IF EXISTS.

7. Inconsistent SQL casing
drop schema (lowercase) in schema drop files vs DROP TABLE (uppercase) everywhere else.

8. Consider abort_on_error: true
For data loading workflows, it's usually better to fail fast — if a CREATE TABLE or INSERT fails, continuing with missing tables wastes time. None of the stages set abort_on_error.

@ethanyzhang ethanyzhang force-pushed the main branch 2 times, most recently from 4b003e9 to 0d7d29c Compare March 25, 2026 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants