Arrow: reject UINT32/UINT64 Parquet integer columns with clear error#16006
Arrow: reject UINT32/UINT64 Parquet integer columns with clear error#16006drexler-sky wants to merge 3 commits intoapache:mainfrom
Conversation
The vectorized Arrow reader was silently reading unsigned Parquet integer columns (uint8, uint16, uint32, uint64) as signed, producing incorrect values for any value exceeding the signed maximum for that bit width. Since Iceberg has no unsigned integer type, throw UnsupportedOperationException when the Arrow reader encounters an unsigned integer logical type annotation, consistent with how the schema conversion layer already rejects uint64. Fixes apache#14547
|
How do we behave when we use non-vectorized readers in Spark? Are we able to use read them? |
This is a good point. Looks like the non-vectorized readers are more lenient. It allows the conversion, as long as it's not lossy. e.g. Allow reading a |
| public void testUnsignedIntegerColumnThrowsException() throws Exception { | ||
| tables = new HadoopTables(); | ||
|
|
||
| for (int[] spec : new int[][] {{8, 32}, {16, 32}, {32, 32}, {64, 64}}) { |
There was a problem hiding this comment.
Use a @ParameterizedTest instead of this loop such that we get better error messages if there are test failures.
There was a problem hiding this comment.
Changed to @ParameterizedTest
| .id(1) | ||
| .named("col")); | ||
|
|
||
| File testFile = new File(tempDir, "unsigned-int" + unsignedBitWidth + ".parquet"); |
There was a problem hiding this comment.
Minor: Generally a good idea to do append a System.nanoTime() to temp file names created in tests to guarantee uniqueness
There was a problem hiding this comment.
added System.nanoTime()
|
Thanks @pvary @anoopj — agreed, the blanket rejection was too aggressive. I've updated the PR to mirror
Tests updated accordingly: dropped uint8/uint16 from the rejection matrix, and added a positive test that round-trips uint8=250 / uint16=50000. |
|
CI failed with The failure is unrelated to the changes in this PR. I submitted PR #16026 to fix the CI failure. |
The vectorized Arrow reader was silently reading unsigned Parquet integer columns (uint8, uint16, uint32, uint64) as signed, producing incorrect values for any value exceeding the signed maximum for that bit width.
Since Iceberg has no unsigned integer type, throw UnsupportedOperationException when the Arrow reader encounters an unsigned integer logical type annotation, consistent with how the schema conversion layer already rejects uint64.
Fixes #14547