fix(apple): Handle partial UTF-8 sequences in streaming LLM output #16219

ShunL12324 · 2025-12-12T10:56:39Z

Summary

ByteLevel BPE tokenizers (like GPT-2/SmolLM) can split multi-byte UTF-8 characters across token boundaries during streaming text generation. For example, the Chinese character "清" (UTF-8: E6 B8 85) might be split into two tokens: æ¸ (E6 B8) and ħ (85).

This causes garbled output when displaying CJK characters and other multi-byte Unicode content on Apple platforms.

Solution: Add a UTF8StreamingBuffer class in ExecuTorchLLMTextRunner.mm that:

Accumulates bytes until complete UTF-8 sequences are formed
Validates continuation bytes for each sequence
Rejects invalid bytes (overlong encodings 0xC0, 0xC1, out-of-range 0xF5+)
Skips lone continuation bytes to prevent buffer accumulation
Flushes remaining valid bytes at generation end

Test plan

Tested with SmolLM model generating Chinese text - characters now display correctly instead of garbled output
Tested with ASCII-only output - no regression
Built debug framework (./scripts/build_apple_frameworks.sh --Debug) and verified fix on iOS device

ByteLevel BPE tokenizers (like GPT-2/SmolLM) can split multi-byte UTF-8 characters across token boundaries. For example, the Chinese character "清" (UTF-8: E6 B8 85) might be split into two tokens: "æ¸" (E6 B8) and "ħ" (85). This commit adds a UTF8StreamingBuffer class that: - Accumulates bytes until complete UTF-8 sequences are formed - Rejects overlong encodings (0xC0, 0xC1) and out-of-range bytes (0xF5+) - Flushes remaining bytes at generation end This ensures correct display of CJK characters and other multi-byte Unicode content during streaming text generation on Apple platforms. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

pytorch-bot · 2025-12-12T10:56:42Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16219

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 6779fd2 with merge base 3a262ef ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-cla · 2025-12-12T10:56:45Z

Hi @ShunL12324!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

github-actions · 2025-12-12T10:57:14Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

meta-cla · 2025-12-12T10:59:27Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

Copilot

Pull request overview

This PR fixes garbled output when streaming text with CJK characters and other multi-byte Unicode content on Apple platforms. The issue occurs because ByteLevel BPE tokenizers can split multi-byte UTF-8 characters across token boundaries during streaming generation.

Key Changes:

Adds a UTF8StreamingBuffer class to accumulate bytes until complete UTF-8 sequences are formed
Implements validation and handling of partial UTF-8 sequences, invalid bytes, and continuation bytes
Integrates the buffer into the streaming callback pipeline with proper flushing at generation end

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-12T11:00:53Z

extension/llm/apple/ExecuTorchLLM/Exported/ExecuTorchLLMTextRunner.mm

+
+      if (valid) {
+        result.append(buffer_, i, seqLen);
+        i += seqLen - 1;  // -1 because loop will i++


The loop increments i by seqLen - 1 to account for the loop's i++, but the loop counter is incremented after the adjustment. This results in advancing by seqLen total, which is correct. However, when the sequence is invalid (skipped via continue), the loop only increments by 1. This creates an inconsistency where a valid multi-byte sequence advances correctly, but an invalid start byte only advances by 1, potentially causing the loop to re-examine bytes that are part of a previously skipped sequence. Consider restructuring the loop to avoid manual adjustment and use explicit indexing instead.

Copilot · 2025-12-12T11:00:54Z

extension/llm/apple/ExecuTorchLLM/Exported/ExecuTorchLLMTextRunner.mm

+
+  // Flush any remaining bytes in the buffer
+  if (callback) {
+    std::string remaining = utf8Buffer->flush();
+    if (!remaining.empty()) {
+      NSString *remainingString = [[NSString alloc] initWithBytes:remaining.data()
+                                                           length:remaining.size()
+                                                         encoding:NSUTF8StringEncoding];
+      if (remainingString) {
+        callback(remainingString);
+      }
+    }
+  }


The flush logic is executed regardless of whether generate() succeeded or failed. If status != Error::Ok, the buffer may contain incomplete data that should not be emitted. The flush should only execute when generation completes successfully. Move this block inside a check for status == Error::Ok before the error handling code.

Copilot AI review requested due to automatic review settings December 12, 2025 10:56

ShunL12324 requested review from jackzhxng, larryliu0820 and mergennachin as code owners December 12, 2025 10:56

Copilot started reviewing on behalf of ShunL12324 December 12, 2025 10:57 View session

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 12, 2025

Copilot AI reviewed Dec 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(apple): Handle partial UTF-8 sequences in streaming LLM output #16219

fix(apple): Handle partial UTF-8 sequences in streaming LLM output #16219

Uh oh!

ShunL12324 commented Dec 12, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Dec 12, 2025 •

edited

Loading

Uh oh!

meta-cla bot commented Dec 12, 2025

Uh oh!

github-actions bot commented Dec 12, 2025

Uh oh!

meta-cla bot commented Dec 12, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 12, 2025

Uh oh!

Copilot AI Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix(apple): Handle partial UTF-8 sequences in streaming LLM output #16219

Are you sure you want to change the base?

fix(apple): Handle partial UTF-8 sequences in streaming LLM output #16219

Uh oh!

Conversation

ShunL12324 commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

pytorch-bot bot commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16219

✅ No Failures

Uh oh!

meta-cla bot commented Dec 12, 2025

Action Required

Process

Uh oh!

github-actions bot commented Dec 12, 2025

This PR needs a release notes: label

Uh oh!

meta-cla bot commented Dec 12, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ShunL12324 commented Dec 12, 2025 •

edited

Loading

pytorch-bot bot commented Dec 12, 2025 •

edited

Loading

This PR needs a `release notes:` label