Skip to content

[YARR] Advance index by firstCharacterAdditionalReadSize in last-alt > first-alt backtrack trampoline#221

Open
robobun wants to merge 1 commit into
oven-sh:mainfrom
robobun:farm/571dd7af/yarr-non-bmp-trampoline-index-sync
Open

[YARR] Advance index by firstCharacterAdditionalReadSize in last-alt > first-alt backtrack trampoline#221
robobun wants to merge 1 commit into
oven-sh:mainfrom
robobun:farm/571dd7af/yarr-non-bmp-trampoline-index-sync

Conversation

@robobun

@robobun robobun commented May 8, 2026

Copy link
Copy Markdown
Collaborator

What

On ARM64, /\B|x{1,2}?/u.exec("a\u{10ffff}b")[0].length returns 4294967295 instead of 0.

Why

The non-BMP first-character optimization (YARR_JIT_UNICODE_CAN_INCREMENT_INDEX_FOR_NON_BMP, ARM64-only) sets firstCharacterAdditionalReadSize = 1 whenever tryReadUnicodeChar decodes a surrogate pair, so that the body-alternative retry loop can skip the mid-surrogate position.

When the body backtracks out of the last alternative to retry from the first, and lastAlt.minimumSize > firstAlt.minimumSize, the trampoline in backtrack() / BodyAlternativeEnd does this:

// store matchStart = index + firstCharacterAdditionalReadSize
// ...
if (alternative->m_minimumSize > beginOp->m_alternative->m_minimumSize) {
    unsigned delta = ...;
    if (delta != 1)
        m_jit.sub32(Imm32(delta - 1), m_regs.index);
    m_jit.jump(beginOp->m_reentry);               // <-- index not bumped by fCARS
} else {
    m_jit.add32(m_regs.firstCharacterAdditionalReadSize, m_regs.index);  // sibling does this
    m_jit.add32(Imm32(delta + 1), m_regs.index);
    checkInput().linkTo(beginOp->m_reentry, &m_jit);
}

matchStart is advanced by firstCharacterAdditionalReadSize a few lines above, but m_regs.index is not in the first branch, so on reentry index != matchStart + firstAlternative.minimumSize.

If the first alternative is a zero-width assertion (e.g. \B) that succeeds at that stale index, BodyAlternativeNext writes output[0] = matchStart, output[1] = index with output[1] < output[0]. createRegExpMatchesArray (RegExpMatchesArray.h) then calls jsSubstringOfResolved(..., result.start, result.end - result.start) with an underflowed length of (unsigned)-1 = 4294967295.

Concrete trace for "a\u{10ffff}b" (code units: [0x61, 0xDBFF, 0xDFFF, 0x62])

  1. \B at 0 fails; alt 2 reentry index=1; fixed x reads 'a', fails → trampoline matchStart=1, index=1.
  2. \B at 1 fails; alt 2 reentry index=2; fixed x reads input[1]=0xDBFF, decodes the pair, fCARS=1, U+10FFFF ≠ x → trampoline.
  3. matchStart = index(2) + fCARS(1) = 3, index left at 2, jump to alt 1 reentry.
  4. \B at 2: prev = U+10FFFF (non-word), curr = lone trail 0xDFFF (non-word) → \B succeeds. Return (start=3, end=2).

Fix

Mirror the sibling else branch: add32(firstCharacterAdditionalReadSize, index) before the (possible) sub32, and gate the jump on checkInput() since index may now be length + 1 in the delta == 1 case. No-op on non-ARM64 (the whole block is inside #if ENABLE(YARR_JIT_UNICODE_CAN_INCREMENT_INDEX_FOR_NON_BMP)).

Tests

JSTests/stress/regexp-jit-non-bmp-backtrack-trampoline-index-sync.js — asserts the match result is a well-formed string (length ≤ input.length, index + length ≤ input.length) for several zero-width-first-alternative + variable-last-alternative patterns over inputs containing surrogate pairs, plus the originally-reported case.

…> first-alt backtrack trampoline

When the non-BMP first-character optimization is active and the body
backtracks out of the last alternative to retry from the first, the
trampoline at BodyAlternativeEnd advances matchStart by
firstCharacterAdditionalReadSize (so the next attempt starts past the
surrogate pair it just decoded). In the branch taken when
lastAlternative.minimumSize > firstAlternative.minimumSize it did not
apply the same adjustment to m_regs.index before jumping to
beginOp->m_reentry, so the reentry invariant
  index == matchStart + firstAlternative.minimumSize
was broken (index was firstCharacterAdditionalReadSize behind).

If the first alternative is a zero-width assertion such as \B, it can
succeed at that stale index and the generated epilogue writes
output[0] = matchStart, output[1] = index with output[1] < output[0].
createRegExpMatchesArray then builds the whole-match substring with
length (unsigned)(end - start) = 4294967295.

Reproduction (ARM64 only; the optimization is gated on CPU(ARM64)):

    /\B|x{1,2}?/u.exec("a\u{10ffff}b")[0].length  // 4294967295

Mirror the sibling else branch: add firstCharacterAdditionalReadSize to
index, and check input before jumping to reentry since index is no
longer guaranteed to be <= length when delta == 1.

* JSTests/stress/regexp-jit-non-bmp-backtrack-trampoline-index-sync.js: Added.
* Source/JavaScriptCore/yarr/YarrJIT.cpp:
@coderabbitai

coderabbitai Bot commented May 8, 2026

Copy link
Copy Markdown

Review Change Stack
No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5c87e864-2926-4a54-8b3c-ba388633adb1

📥 Commits

Reviewing files that changed from the base of the PR and between 88b2f7a and 7cb02ee.

📒 Files selected for processing (2)
  • JSTests/stress/regexp-jit-non-bmp-backtrack-trampoline-index-sync.js
  • Source/JavaScriptCore/yarr/YarrJIT.cpp

Walkthrough

The PR fixes input index synchronization in the regex JIT compiler's backtracking trampoline logic when the Unicode non-BMP first-character optimization is enabled. It updates the repeating-alternative backtracking handler to maintain consistency between the input index and matchStart, adds bounds checking for edge cases, and validates the fix with a comprehensive stress test covering surrogate pairs and alternative selection behavior.

Changes

Unicode Non-BMP Backtracking Index Synchronization

Layer / File(s) Summary
Backtracking Trampoline Core Fix
Source/JavaScriptCore/yarr/YarrJIT.cpp
Repeating-alternative backtracking trampoline conditionally adds firstCharacterAdditionalReadSize to m_regs.index when the non-BMP optimization is enabled, adds input-availability checking for delta == 1 cases, and preserves existing decrement logic for other deltas.
Test Assertion Helpers
JSTests/stress/regexp-jit-non-bmp-backtrack-trampoline-index-sync.js
Defines shouldBe for exact-value assertions and shouldBeOneOf for membership validation, both with contextual error messages.
Main Test Cases
JSTests/stress/regexp-jit-non-bmp-backtrack-trampoline-index-sync.js
10,000-iteration loop validates \B and quantifier combinations against non-BMP surrogate pairs, including delta>1 variants, earliest-position alternative selection, and surrogate-pair-at-end edge cases.
Sanity Check
JSTests/stress/regexp-jit-non-bmp-backtrack-trampoline-index-sync.js
Confirms that disabling the optimization via /m flag produces valid matches with expected behavior.
🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description provides detailed technical context (What, Why, Fix, Tests sections), but does not follow the WebKit template format requiring a Bugzilla reference and commit-message structure. Add a Bugzilla bug reference link (https://bugs.webkit.org/show_bug.cgi?id=#####) at the top and format as a commit message per the WebKit template.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and specifically describes the main change—advancing index by firstCharacterAdditionalReadSize in the last-alt to first-alt backtrack trampoline logic.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

robobun added a commit to oven-sh/bun that referenced this pull request May 8, 2026
… bump

The fix lives in JavaScriptCore (oven-sh/WebKit#221), vendored separately;
until WEBKIT_VERSION is bumped to include it the underflow is still present
on ARM64. Probe for the exact symptom once and soft-pass so CI on this
draft stays green and the expected failure does not mask unrelated
jsc-stress regressions. Every aarch64 shard on build 52790 reproduced
match[0].length === 4294967295, confirming the analysis.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant