API, Spark: Add direct UUID string-to-ByteBuffer conversion#16004
Open
abhishek593 wants to merge 1 commit intoapache:mainfrom
Open
API, Spark: Add direct UUID string-to-ByteBuffer conversion#16004abhishek593 wants to merge 1 commit intoapache:mainfrom
abhishek593 wants to merge 1 commit intoapache:mainfrom
Conversation
Eliminate intermediate String and UUID object allocations in the Spark UUID write path by parsing hex digits directly from UTF8String raw bytes into a ByteBuffer. Uses getBytes() instead of toString() to avoid encoding costs, matching the existing StringWriter pattern.
1759950 to
8e62288
Compare
Contributor
steveloughran
left a comment
There was a problem hiding this comment.
LGTM. I was trying to think of any extra tests, including comparing with the old version, but I think you've got that all covered
| * @return the reuse buffer (or a new one) containing the 16-byte UUID | ||
| */ | ||
| public static ByteBuffer convertToByteBuffer(byte[] uuidBytes, ByteBuffer reuse) { | ||
| Preconditions.checkArgument( |
Contributor
There was a problem hiding this comment.
do you think a null check is worth it here? or just wasteful overkill?
Author
There was a problem hiding this comment.
@steveloughran I think it's overkill, because null here would immediately NPE on uuidBytes.length, which is a clear signal. The existing methods in this class also follow this pattern of not adding null checks, so adding here would be inconsistent.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #16003
Resolves the TODO in
SparkValueWriters.UUIDWriterrequesting a direct conversion from string to byte buffer.The Spark UUID write path previously went through:
UTF8String -> String -> UUID (UUID.fromString()) -> ByteBufferThis created two unnecessary intermediate objects per row: a
String(with UTF-8 decoding cost) and aUUID(with heavyfromString()parsing). Both are discarded immediately after the ByteBuffer is filled.This PR adds
UUIDUtil.convertToByteBuffer(byte[], ByteBuffer)which parses hex digits from raw ASCII bytes directly into two longs, written viaputLong. Callers useUTF8String.getBytes()instead oftoString(), following the same pattern asStringWriterin the same file which notes thatgetBytes()avoids encoding costs and may return the backing array directly.Updated all 12 Spark UUID writer call sites (Avro, Parquet, ORC across v3.4/v3.5/v4.0/v4.1)