Skip to content

Conversation

@ShunL12324
Copy link

@ShunL12324 ShunL12324 commented Dec 12, 2025

Summary

ByteLevel BPE tokenizers (like GPT-2/SmolLM) can split multi-byte UTF-8 characters across token boundaries during streaming text generation. For example, the Chinese character "清" (UTF-8: E6 B8 85) might be split into two tokens: æ¸ (E6 B8) and ħ (85).

This causes garbled output when displaying CJK characters and other multi-byte Unicode content on Apple platforms.

Solution: Add a UTF8StreamingBuffer class in ExecuTorchLLMTextRunner.mm that:

  • Accumulates bytes until complete UTF-8 sequences are formed
  • Validates continuation bytes for each sequence
  • Rejects invalid bytes (overlong encodings 0xC0, 0xC1, out-of-range 0xF5+)
  • Skips lone continuation bytes to prevent buffer accumulation
  • Flushes remaining valid bytes at generation end

Test plan

  • Tested with SmolLM model generating Chinese text - characters now display correctly instead of garbled output
  • Tested with ASCII-only output - no regression
  • Built debug framework (./scripts/build_apple_frameworks.sh --Debug) and verified fix on iOS device

ByteLevel BPE tokenizers (like GPT-2/SmolLM) can split multi-byte UTF-8
characters across token boundaries. For example, the Chinese character
"清" (UTF-8: E6 B8 85) might be split into two tokens: "æ¸" (E6 B8)
and "ħ" (85).

This commit adds a UTF8StreamingBuffer class that:
- Accumulates bytes until complete UTF-8 sequences are formed
- Rejects overlong encodings (0xC0, 0xC1) and out-of-range bytes (0xF5+)
- Flushes remaining bytes at generation end

This ensures correct display of CJK characters and other multi-byte
Unicode content during streaming text generation on Apple platforms.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Copilot AI review requested due to automatic review settings December 12, 2025 10:56
@pytorch-bot
Copy link

pytorch-bot bot commented Dec 12, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16219

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 6779fd2 with merge base 3a262ef (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla
Copy link

meta-cla bot commented Dec 12, 2025

Hi @ShunL12324!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@github-actions
Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@meta-cla
Copy link

meta-cla bot commented Dec 12, 2025

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 12, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes garbled output when streaming text with CJK characters and other multi-byte Unicode content on Apple platforms. The issue occurs because ByteLevel BPE tokenizers can split multi-byte UTF-8 characters across token boundaries during streaming generation.

Key Changes:

  • Adds a UTF8StreamingBuffer class to accumulate bytes until complete UTF-8 sequences are formed
  • Implements validation and handling of partial UTF-8 sequences, invalid bytes, and continuation bytes
  • Integrates the buffer into the streaming callback pipeline with proper flushing at generation end

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


if (valid) {
result.append(buffer_, i, seqLen);
i += seqLen - 1; // -1 because loop will i++
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop increments i by seqLen - 1 to account for the loop's i++, but the loop counter is incremented after the adjustment. This results in advancing by seqLen total, which is correct. However, when the sequence is invalid (skipped via continue), the loop only increments by 1. This creates an inconsistency where a valid multi-byte sequence advances correctly, but an invalid start byte only advances by 1, potentially causing the loop to re-examine bytes that are part of a previously skipped sequence. Consider restructuring the loop to avoid manual adjustment and use explicit indexing instead.

Copilot uses AI. Check for mistakes.
Comment on lines +249 to +261

// Flush any remaining bytes in the buffer
if (callback) {
std::string remaining = utf8Buffer->flush();
if (!remaining.empty()) {
NSString *remainingString = [[NSString alloc] initWithBytes:remaining.data()
length:remaining.size()
encoding:NSUTF8StringEncoding];
if (remainingString) {
callback(remainingString);
}
}
}
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flush logic is executed regardless of whether generate() succeeded or failed. If status != Error::Ok, the buffer may contain incomplete data that should not be emitted. The flush should only execute when generation completes successfully. Move this block inside a check for status == Error::Ok before the error handling code.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant