Arrow: reject UINT32/UINT64 Parquet integer columns with clear error by drexler-sky · Pull Request #16006 · apache/iceberg

drexler-sky · 2026-04-17T06:19:10Z

The vectorized Arrow reader was silently reading unsigned Parquet integer columns (uint8, uint16, uint32, uint64) as signed, producing incorrect values for any value exceeding the signed maximum for that bit width.

Since Iceberg has no unsigned integer type, throw UnsupportedOperationException when the Arrow reader encounters an unsigned integer logical type annotation, consistent with how the schema conversion layer already rejects uint64.

Fixes #14547

The vectorized Arrow reader was silently reading unsigned Parquet integer columns (uint8, uint16, uint32, uint64) as signed, producing incorrect values for any value exceeding the signed maximum for that bit width. Since Iceberg has no unsigned integer type, throw UnsupportedOperationException when the Arrow reader encounters an unsigned integer logical type annotation, consistent with how the schema conversion layer already rejects uint64. Fixes apache#14547

pvary · 2026-04-17T07:29:46Z

How do we behave when we use non-vectorized readers in Spark? Are we able to use read them?
How hard would it be to more lenient and implement a conversion? What would be the performance cost?

anoopj · 2026-04-17T14:52:47Z

How do we behave when we use non-vectorized readers in Spark? Are we able to use read them? How hard would it be to more lenient and implement a conversion? What would be the performance cost?

This is a good point. Looks like the non-vectorized readers are more lenient. It allows the conversion, as long as it's not lossy. e.g. Allow reading a uint32 as Iceberg long, uint16 as int etc. The PR is blanket rejecting them. Let's try to be consistent with the non-vectorized reader.

anoopj · 2026-04-17T14:54:43Z

+  public void testUnsignedIntegerColumnThrowsException() throws Exception {
+    tables = new HadoopTables();
+
+    for (int[] spec : new int[][] {{8, 32}, {16, 32}, {32, 32}, {64, 64}}) {


Use a @ParameterizedTest instead of this loop such that we get better error messages if there are test failures.

Changed to @ParameterizedTest

anoopj · 2026-04-17T14:55:48Z

+                  .id(1)
+                  .named("col"));
+
+      File testFile = new File(tempDir, "unsigned-int" + unsignedBitWidth + ".parquet");


Minor: Generally a good idea to do append a System.nanoTime() to temp file names created in tests to guarantee uniqueness

added System.nanoTime()

drexler-sky · 2026-04-18T02:51:41Z

Thanks @pvary @anoopj — agreed, the blanket rejection was too aggressive. I've updated the PR to mirror BaseParquetReaders exactly:

uint8 / uint16 → IntegerType: accepted (lossless, same IntVector path — no perf cost)
uint32 → IntegerType: rejected with IllegalArgumentException("Cannot read UINT32 as an int value")
uint64 → LongType: rejected with IllegalArgumentException("Cannot read UINT64 as a long value")

Tests updated accordingly: dropped uint8/uint16 from the rejection matrix, and added a positive test that round-trips uint8=250 / uint16=50000.

drexler-sky · 2026-04-18T07:00:48Z

CI failed with

TestDynamicIcebergSink > testNoShuffleTopology() FAILED
    org.opentest4j.AssertionFailedError: 
    Expecting value to be true but was false
        at app//org.apache.iceberg.flink.sink.dynamic.TestDynamicIcebergSink.testNoShuffleTopology(TestDynamicIcebergSink.java:330)

The failure is unrelated to the changes in this PR.

I submitted PR #16026 to fix the CI failure.

github-actions bot added the arrow label Apr 17, 2026

Apply spotless formatting

2e66572

anoopj reviewed Apr 17, 2026

View reviewed changes

address comments

0411644

drexler-sky changed the title ~~Arrow reader: reject unsigned Parquet integer columns with clear error~~ Arrow: reject UINT32/UINT64 Parquet integer columns with clear error Apr 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow: reject UINT32/UINT64 Parquet integer columns with clear error#16006

Arrow: reject UINT32/UINT64 Parquet integer columns with clear error#16006
drexler-sky wants to merge 3 commits intoapache:mainfrom
drexler-sky:issue_14547

drexler-sky commented Apr 17, 2026

Uh oh!

pvary commented Apr 17, 2026

Uh oh!

anoopj commented Apr 17, 2026

Uh oh!

Uh oh!

anoopj Apr 17, 2026

Uh oh!

drexler-sky Apr 18, 2026

Uh oh!

anoopj Apr 17, 2026

Uh oh!

drexler-sky Apr 18, 2026

Uh oh!

drexler-sky commented Apr 18, 2026

Uh oh!

drexler-sky commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

drexler-sky commented Apr 17, 2026

Uh oh!

pvary commented Apr 17, 2026

Uh oh!

anoopj commented Apr 17, 2026

Uh oh!

Uh oh!

anoopj Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

drexler-sky Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

anoopj Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

drexler-sky Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

drexler-sky commented Apr 18, 2026

Uh oh!

drexler-sky commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants