-
Notifications
You must be signed in to change notification settings - Fork 756
fix(apple): Handle partial UTF-8 sequences in streaming LLM output #16219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
ByteLevel BPE tokenizers (like GPT-2/SmolLM) can split multi-byte UTF-8 characters across token boundaries. For example, the Chinese character "清" (UTF-8: E6 B8 85) might be split into two tokens: "æ¸" (E6 B8) and "ħ" (85). This commit adds a UTF8StreamingBuffer class that: - Accumulates bytes until complete UTF-8 sequences are formed - Rejects overlong encodings (0xC0, 0xC1) and out-of-range bytes (0xF5+) - Flushes remaining bytes at generation end This ensures correct display of CJK characters and other multi-byte Unicode content during streaming text generation on Apple platforms. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16219
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 6779fd2 with merge base 3a262ef ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Hi @ShunL12324! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
This PR needs a
|
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes garbled output when streaming text with CJK characters and other multi-byte Unicode content on Apple platforms. The issue occurs because ByteLevel BPE tokenizers can split multi-byte UTF-8 characters across token boundaries during streaming generation.
Key Changes:
- Adds a
UTF8StreamingBufferclass to accumulate bytes until complete UTF-8 sequences are formed - Implements validation and handling of partial UTF-8 sequences, invalid bytes, and continuation bytes
- Integrates the buffer into the streaming callback pipeline with proper flushing at generation end
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| if (valid) { | ||
| result.append(buffer_, i, seqLen); | ||
| i += seqLen - 1; // -1 because loop will i++ |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loop increments i by seqLen - 1 to account for the loop's i++, but the loop counter is incremented after the adjustment. This results in advancing by seqLen total, which is correct. However, when the sequence is invalid (skipped via continue), the loop only increments by 1. This creates an inconsistency where a valid multi-byte sequence advances correctly, but an invalid start byte only advances by 1, potentially causing the loop to re-examine bytes that are part of a previously skipped sequence. Consider restructuring the loop to avoid manual adjustment and use explicit indexing instead.
|
|
||
| // Flush any remaining bytes in the buffer | ||
| if (callback) { | ||
| std::string remaining = utf8Buffer->flush(); | ||
| if (!remaining.empty()) { | ||
| NSString *remainingString = [[NSString alloc] initWithBytes:remaining.data() | ||
| length:remaining.size() | ||
| encoding:NSUTF8StringEncoding]; | ||
| if (remainingString) { | ||
| callback(remainingString); | ||
| } | ||
| } | ||
| } |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flush logic is executed regardless of whether generate() succeeded or failed. If status != Error::Ok, the buffer may contain incomplete data that should not be emitted. The flush should only execute when generation completes successfully. Move this block inside a check for status == Error::Ok before the error handling code.
Summary
ByteLevel BPE tokenizers (like GPT-2/SmolLM) can split multi-byte UTF-8 characters across token boundaries during streaming text generation. For example, the Chinese character "清" (UTF-8:
E6 B8 85) might be split into two tokens:æ¸(E6 B8) andħ(85).This causes garbled output when displaying CJK characters and other multi-byte Unicode content on Apple platforms.
Solution: Add a
UTF8StreamingBufferclass inExecuTorchLLMTextRunner.mmthat:0xC0,0xC1, out-of-range0xF5+)Test plan
./scripts/build_apple_frameworks.sh --Debug) and verified fix on iOS device