Add Hive and Iceberg Load benchmark by PingLiuPing · Pull Request #55 · prestodb/pbench

PingLiuPing · 2025-04-24T14:58:50Z

loading (insert) benchmark is missing in pbench, this PR add the initial files for loading benchmark. It includes test files for hive and iceberg connector, both native and Java.
The data is loaded from tpch connector on the fly.

Future enhancements are required to make the benchmark run in stage such as prepare stage, main stage, cleanup stage etc.

…stissimo

benchmarks/tpch-load/create-table/customer.sql

benchmarks/tpch-load/cleanup_sf100.json

PingLiuPing · 2025-08-13T08:24:58Z

@wanglinsong @ethanyzhang Sorry for the late response, this PR slipped from my mind. I addressed your comments, can you please take another look? Thanks.

wanglinsong

I believe the DDL to create tables are the same across all scale factors. Can you parameterize or remove the hardcoded schema name: tpch.sf100.?

FROM tpch.sf100.customer;

benchmarks/tpch-load/schema/create_sf100.sql

PingLiuPing · 2025-08-21T19:19:49Z

believe the DDL to create tables are the same across all scale factors. Can you parameterize or remove the hardcoded schema name: tpch.sf100.?

FROM tpch.sf100.customer;

Thanks, at the current framework I think this needs lots of work to support that.

wanglinsong · 2025-08-21T19:30:11Z

believe the DDL to create tables are the same across all scale factors. Can you parameterize or remove the hardcoded schema name: tpch.sf100.?
FROM tpch.sf100.customer;

Thanks, at the current framework I think this needs lots of work to support that.

Oh, this is an embedded connector. This is not an issue at all. Please ignore.

PingLiuPing · 2025-08-25T08:12:26Z

Hi @wanglinsong Thanks for your comments, do you think this PR is ready to be merged? Anything else you want me to change?

PingLiuPing · 2025-08-31T20:56:09Z

@wanglinsong can you please have another look, thanks.

PingLiuPing · 2025-09-30T15:11:39Z

@wanglinsong gentle ping.

xpengahana · 2025-10-06T15:22:30Z

Have we tested this PR for pbench? @PingLiuPing

PingLiuPing · 2025-10-06T15:41:42Z

Have we tested this PR for pbench? @PingLiuPing

Thanks, I didn't test this after fix the review comments. Before that I have tested it in pbench.
So think this commit might not correct aee3644.
I will verify this.

ethanyzhang · 2026-02-23T10:00:19Z

Thanks for adding the TPC-H loading benchmarks! I reviewed the files and have some findings before we integrate.

Issues to fix

1. prepare_sf1000.json description typo
Says "scaling factor 100" — should be "scaling factor 1000".

2. expected_row_counts mismatch in insert stages
All 4 insert stages (insert_sf100_j.json, insert_sf100_n.json, insert_sf1000_j.json, insert_sf1000_n.json) have 10 values but only 8 query files (= 8 queries). The extra 2 values are silently ignored. Should be exactly 8.

3. expected_row_counts off-by-one in prepare stages
schema/create_sf100.sql and schema/create_sf1000.sql each contain 2 SQL statements (CREATE SCHEMA + USE), so total queries per prepare stage = 2 + 8 = 10. But expected_row_counts only has 9 values — the last CREATE TABLE won't be validated. Should be 10.

4. Column type mismatches vs TPC-H spec
These columns should be INTEGER, not BIGINT:

lineitem.linenumber
orders.shippriority
part.size
partsupp.availqty

5. DECIMAL(12,2) should be DECIMAL(15,2)
The TPC-H spec defines decimal columns with precision 15. Precision 12 could cause truncation at larger scale factors.

@PingLiuPing @xpengahana Did we generate the right schema?

Suggestions (non-blocking)

6. No IF EXISTS on DROP statements
DROP TABLE and DROP SCHEMA will fail if tables/schema don't exist (e.g. after a partial run). Consider DROP TABLE IF EXISTS / DROP SCHEMA IF EXISTS.

7. Inconsistent SQL casing
drop schema (lowercase) in schema drop files vs DROP TABLE (uppercase) everywhere else.

8. Consider abort_on_error: true
For data loading workflows, it's usually better to fail fast — if a CREATE TABLE or INSERT fails, continuing with missing tables wastes time. None of the stages set abort_on_error.

PingLiuPing added 10 commits April 24, 2025 15:51

Add workload for hive iceberg insertion

8f5c569

Remove schema

d313dba

Change to hive.tpch_sf10_parquet becaues of column name change in pre…

7152357

…stissimo

Rename files and add workaround for prestissimo tpch column names

123dd9d

Add iceberg benchmark

71b953c

Add iceberg

02d5a42

plit schema

2d82ce2

Add sf1000 for hive

a41c35f

iceberg native

524c962

iceberg native

012e185

PingLiuPing marked this pull request as ready for review April 24, 2025 16:55

PingLiuPing requested review from FelixYBW, ethanyzhang, wanglinsong and xpengahana as code owners April 24, 2025 16:55

PingLiuPing changed the title ~~Add TPCH Load benchmark~~ Add Hive and Iceberg Load benchmark Apr 24, 2025

wanglinsong requested changes Apr 24, 2025

View reviewed changes

benchmarks/tpch-load/create-table/customer.sql Outdated Show resolved Hide resolved

PingLiuPing added 3 commits April 24, 2025 20:29

Add missing files for scaling factor 100

3b883a0

Rename folder

b7364f3

Fix review comments and adding newline to the end of each file

9edb174

PingLiuPing force-pushed the lpingbj_load_tpch branch from a8371c9 to 9edb174 Compare April 24, 2025 19:32

ethanyzhang reviewed Apr 29, 2025

View reviewed changes

benchmarks/tpch-load/cleanup_sf100.json Outdated Show resolved Hide resolved

Remove expected_row_counts when there is no expected row counts

4ab0745

PingLiuPing self-assigned this Aug 4, 2025

Change s3 bucket name and use timestamp as suffix for schema name

aee3644

wanglinsong requested changes Aug 21, 2025

View reviewed changes

benchmarks/tpch-load/schema/create_sf100.sql Outdated Show resolved Hide resolved

ethanyzhang force-pushed the main branch 2 times, most recently from 6f32a1a to b8de6a7 Compare February 13, 2026 04:02

ethanyzhang force-pushed the main branch 2 times, most recently from 4b003e9 to 0d7d29c Compare March 25, 2026 06:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Hive and Iceberg Load benchmark#55

Add Hive and Iceberg Load benchmark#55
PingLiuPing wants to merge 15 commits intoprestodb:mainfrom
PingLiuPing:lpingbj_load_tpch

PingLiuPing commented Apr 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

PingLiuPing commented Aug 13, 2025

Uh oh!

wanglinsong left a comment

Uh oh!

Uh oh!

PingLiuPing commented Aug 21, 2025

Uh oh!

wanglinsong commented Aug 21, 2025 •

edited

Loading

Uh oh!

PingLiuPing commented Aug 25, 2025

Uh oh!

PingLiuPing commented Aug 31, 2025

Uh oh!

PingLiuPing commented Sep 30, 2025

Uh oh!

xpengahana commented Oct 6, 2025

Uh oh!

PingLiuPing commented Oct 6, 2025

Uh oh!

ethanyzhang commented Feb 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

PingLiuPing commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PingLiuPing commented Aug 13, 2025

Uh oh!

wanglinsong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

PingLiuPing commented Aug 21, 2025

Uh oh!

wanglinsong commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PingLiuPing commented Aug 25, 2025

Uh oh!

PingLiuPing commented Aug 31, 2025

Uh oh!

PingLiuPing commented Sep 30, 2025

Uh oh!

xpengahana commented Oct 6, 2025

Uh oh!

PingLiuPing commented Oct 6, 2025

Uh oh!

ethanyzhang commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issues to fix

Suggestions (non-blocking)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

PingLiuPing commented Apr 24, 2025 •

edited

Loading

wanglinsong commented Aug 21, 2025 •

edited

Loading

ethanyzhang commented Feb 23, 2026 •

edited

Loading