Skip to content

Arrow: reject UINT32/UINT64 Parquet integer columns with clear error#16006

Open
drexler-sky wants to merge 3 commits intoapache:mainfrom
drexler-sky:issue_14547
Open

Arrow: reject UINT32/UINT64 Parquet integer columns with clear error#16006
drexler-sky wants to merge 3 commits intoapache:mainfrom
drexler-sky:issue_14547

Conversation

@drexler-sky
Copy link
Copy Markdown
Contributor

The vectorized Arrow reader was silently reading unsigned Parquet integer columns (uint8, uint16, uint32, uint64) as signed, producing incorrect values for any value exceeding the signed maximum for that bit width.

Since Iceberg has no unsigned integer type, throw UnsupportedOperationException when the Arrow reader encounters an unsigned integer logical type annotation, consistent with how the schema conversion layer already rejects uint64.

Fixes #14547

The vectorized Arrow reader was silently reading unsigned Parquet integer
columns (uint8, uint16, uint32, uint64) as signed, producing incorrect
values for any value exceeding the signed maximum for that bit width.

Since Iceberg has no unsigned integer type, throw UnsupportedOperationException
when the Arrow reader encounters an unsigned integer logical type annotation,
consistent with how the schema conversion layer already rejects uint64.

Fixes apache#14547
@github-actions github-actions bot added the arrow label Apr 17, 2026
@pvary
Copy link
Copy Markdown
Contributor

pvary commented Apr 17, 2026

How do we behave when we use non-vectorized readers in Spark? Are we able to use read them?
How hard would it be to more lenient and implement a conversion? What would be the performance cost?

@anoopj
Copy link
Copy Markdown
Contributor

anoopj commented Apr 17, 2026

How do we behave when we use non-vectorized readers in Spark? Are we able to use read them? How hard would it be to more lenient and implement a conversion? What would be the performance cost?

This is a good point. Looks like the non-vectorized readers are more lenient. It allows the conversion, as long as it's not lossy. e.g. Allow reading a uint32 as Iceberg long, uint16 as int etc. The PR is blanket rejecting them. Let's try to be consistent with the non-vectorized reader.

public void testUnsignedIntegerColumnThrowsException() throws Exception {
tables = new HadoopTables();

for (int[] spec : new int[][] {{8, 32}, {16, 32}, {32, 32}, {64, 64}}) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a @ParameterizedTest instead of this loop such that we get better error messages if there are test failures.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to @ParameterizedTest

.id(1)
.named("col"));

File testFile = new File(tempDir, "unsigned-int" + unsignedBitWidth + ".parquet");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Generally a good idea to do append a System.nanoTime() to temp file names created in tests to guarantee uniqueness

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added System.nanoTime()

@drexler-sky
Copy link
Copy Markdown
Contributor Author

Thanks @pvary @anoopj — agreed, the blanket rejection was too aggressive. I've updated the PR to mirror BaseParquetReaders exactly:

  • uint8 / uint16 → IntegerType: accepted (lossless, same IntVector path — no perf cost)
  • uint32 → IntegerType: rejected with IllegalArgumentException("Cannot read UINT32 as an int value")
  • uint64 → LongType: rejected with IllegalArgumentException("Cannot read UINT64 as a long value")

Tests updated accordingly: dropped uint8/uint16 from the rejection matrix, and added a positive test that round-trips uint8=250 / uint16=50000.

@drexler-sky drexler-sky changed the title Arrow reader: reject unsigned Parquet integer columns with clear error Arrow: reject UINT32/UINT64 Parquet integer columns with clear error Apr 18, 2026
@drexler-sky
Copy link
Copy Markdown
Contributor Author

CI failed with

TestDynamicIcebergSink > testNoShuffleTopology() FAILED
    org.opentest4j.AssertionFailedError: 
    Expecting value to be true but was false
        at app//org.apache.iceberg.flink.sink.dynamic.TestDynamicIcebergSink.testNoShuffleTopology(TestDynamicIcebergSink.java:330)

The failure is unrelated to the changes in this PR.

I submitted PR #16026 to fix the CI failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Arrow reader should reject unsigned Parquet integers

4 participants