Skip to content

API, Spark: Add direct UUID string-to-ByteBuffer conversion#16004

Open
abhishek593 wants to merge 1 commit intoapache:mainfrom
abhishek593:byte-buffer
Open

API, Spark: Add direct UUID string-to-ByteBuffer conversion#16004
abhishek593 wants to merge 1 commit intoapache:mainfrom
abhishek593:byte-buffer

Conversation

@abhishek593
Copy link
Copy Markdown

Fixes #16003

Resolves the TODO in SparkValueWriters.UUIDWriter requesting a direct conversion from string to byte buffer.

The Spark UUID write path previously went through:
UTF8String -> String -> UUID (UUID.fromString()) -> ByteBuffer

This created two unnecessary intermediate objects per row: a String (with UTF-8 decoding cost) and a UUID (with heavy fromString() parsing). Both are discarded immediately after the ByteBuffer is filled.

This PR adds UUIDUtil.convertToByteBuffer(byte[], ByteBuffer) which parses hex digits from raw ASCII bytes directly into two longs, written via putLong. Callers use UTF8String.getBytes() instead of toString(), following the same pattern as StringWriter in the same file which notes that getBytes() avoids encoding costs and may return the backing array directly.

Updated all 12 Spark UUID writer call sites (Avro, Parquet, ORC across v3.4/v3.5/v4.0/v4.1)

  Eliminate intermediate String and UUID object allocations in the Spark
  UUID write path by parsing hex digits directly from UTF8String raw bytes
  into a ByteBuffer. Uses getBytes() instead of toString() to avoid
  encoding costs, matching the existing StringWriter pattern.
Copy link
Copy Markdown
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I was trying to think of any extra tests, including comparing with the old version, but I think you've got that all covered

* @return the reuse buffer (or a new one) containing the 16-byte UUID
*/
public static ByteBuffer convertToByteBuffer(byte[] uuidBytes, ByteBuffer reuse) {
Preconditions.checkArgument(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think a null check is worth it here? or just wasteful overkill?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@steveloughran I think it's overkill, because null here would immediately NPE on uuidBytes.length, which is a clear signal. The existing methods in this class also follow this pattern of not adding null checks, so adding here would be inconsistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

API, Spark: Add direct UUID string-to-ByteBuffer conversion

2 participants