API, Spark: Add direct UUID string-to-ByteBuffer conversion by abhishek593 · Pull Request #16004 · apache/iceberg

abhishek593 · 2026-04-16T20:13:23Z

Resolves the TODO in SparkValueWriters.UUIDWriter requesting a direct conversion from string to byte buffer.

The Spark UUID write path previously went through:
UTF8String -> String -> UUID (UUID.fromString()) -> ByteBuffer

This created two unnecessary intermediate objects per row: a String (with UTF-8 decoding cost) and a UUID (with heavy fromString() parsing). Both are discarded immediately after the ByteBuffer is filled.

This PR adds UUIDUtil.convertToByteBuffer(byte[], ByteBuffer) which parses hex digits from raw ASCII bytes directly into two longs, written via putLong. Callers use UTF8String.getBytes() instead of toString(), following the same pattern as StringWriter in the same file which notes that getBytes() avoids encoding costs and may return the backing array directly.

Updated all 12 Spark UUID writer call sites (Avro, Parquet, ORC across v3.4/v3.5/v4.0/v4.1)

Eliminate intermediate String and UUID object allocations in the Spark UUID write path by parsing hex digits directly from UTF8String raw bytes into a ByteBuffer. Uses getBytes() instead of toString() to avoid encoding costs, matching the existing StringWriter pattern.

steveloughran

LGTM. I was trying to think of any extra tests, including comparing with the old version, but I think you've got that all covered

steveloughran · 2026-04-17T12:30:55Z

+   * @return the reuse buffer (or a new one) containing the 16-byte UUID
+   */
+  public static ByteBuffer convertToByteBuffer(byte[] uuidBytes, ByteBuffer reuse) {
+    Preconditions.checkArgument(


do you think a null check is worth it here? or just wasteful overkill?

@steveloughran I think it's overkill, because null here would immediately NPE on uuidBytes.length, which is a clear signal. The existing methods in this class also follow this pattern of not adding null checks, so adding here would be inconsistent.

github-actions bot added API spark labels Apr 16, 2026

abhishek593 force-pushed the byte-buffer branch from 1759950 to 8e62288 Compare April 17, 2026 08:06

steveloughran reviewed Apr 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API, Spark: Add direct UUID string-to-ByteBuffer conversion#16004

API, Spark: Add direct UUID string-to-ByteBuffer conversion#16004
abhishek593 wants to merge 1 commit intoapache:mainfrom
abhishek593:byte-buffer

abhishek593 commented Apr 16, 2026

Uh oh!

steveloughran left a comment

Uh oh!

steveloughran Apr 17, 2026

Uh oh!

abhishek593 Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

abhishek593 commented Apr 16, 2026

Uh oh!

steveloughran left a comment

Choose a reason for hiding this comment

Uh oh!

steveloughran Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

abhishek593 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants