[YARR] Advance index by firstCharacterAdditionalReadSize in last-alt > first-alt backtrack trampoline#221
Conversation
…> first-alt backtrack trampoline
When the non-BMP first-character optimization is active and the body
backtracks out of the last alternative to retry from the first, the
trampoline at BodyAlternativeEnd advances matchStart by
firstCharacterAdditionalReadSize (so the next attempt starts past the
surrogate pair it just decoded). In the branch taken when
lastAlternative.minimumSize > firstAlternative.minimumSize it did not
apply the same adjustment to m_regs.index before jumping to
beginOp->m_reentry, so the reentry invariant
index == matchStart + firstAlternative.minimumSize
was broken (index was firstCharacterAdditionalReadSize behind).
If the first alternative is a zero-width assertion such as \B, it can
succeed at that stale index and the generated epilogue writes
output[0] = matchStart, output[1] = index with output[1] < output[0].
createRegExpMatchesArray then builds the whole-match substring with
length (unsigned)(end - start) = 4294967295.
Reproduction (ARM64 only; the optimization is gated on CPU(ARM64)):
/\B|x{1,2}?/u.exec("a\u{10ffff}b")[0].length // 4294967295
Mirror the sibling else branch: add firstCharacterAdditionalReadSize to
index, and check input before jumping to reentry since index is no
longer guaranteed to be <= length when delta == 1.
* JSTests/stress/regexp-jit-non-bmp-backtrack-trampoline-index-sync.js: Added.
* Source/JavaScriptCore/yarr/YarrJIT.cpp:
|
ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (2)
WalkthroughThe PR fixes input index synchronization in the regex JIT compiler's backtracking trampoline logic when the Unicode non-BMP first-character optimization is enabled. It updates the repeating-alternative backtracking handler to maintain consistency between the input index and ChangesUnicode Non-BMP Backtracking Index Synchronization
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
… bump The fix lives in JavaScriptCore (oven-sh/WebKit#221), vendored separately; until WEBKIT_VERSION is bumped to include it the underflow is still present on ARM64. Probe for the exact symptom once and soft-pass so CI on this draft stays green and the expected failure does not mask unrelated jsc-stress regressions. Every aarch64 shard on build 52790 reproduced match[0].length === 4294967295, confirming the analysis.
What
On ARM64,
/\B|x{1,2}?/u.exec("a\u{10ffff}b")[0].lengthreturns4294967295instead of0.Why
The non-BMP first-character optimization (
YARR_JIT_UNICODE_CAN_INCREMENT_INDEX_FOR_NON_BMP, ARM64-only) setsfirstCharacterAdditionalReadSize = 1whenevertryReadUnicodeChardecodes a surrogate pair, so that the body-alternative retry loop can skip the mid-surrogate position.When the body backtracks out of the last alternative to retry from the first, and
lastAlt.minimumSize > firstAlt.minimumSize, the trampoline inbacktrack()/BodyAlternativeEnddoes this:matchStartis advanced byfirstCharacterAdditionalReadSizea few lines above, butm_regs.indexis not in the first branch, so on reentryindex != matchStart + firstAlternative.minimumSize.If the first alternative is a zero-width assertion (e.g.
\B) that succeeds at that stale index,BodyAlternativeNextwritesoutput[0] = matchStart,output[1] = indexwithoutput[1] < output[0].createRegExpMatchesArray(RegExpMatchesArray.h) then callsjsSubstringOfResolved(..., result.start, result.end - result.start)with an underflowed length of(unsigned)-1 = 4294967295.Concrete trace for
"a\u{10ffff}b"(code units:[0x61, 0xDBFF, 0xDFFF, 0x62])\Bat 0 fails; alt 2 reentryindex=1; fixedxreads'a', fails → trampolinematchStart=1,index=1.\Bat 1 fails; alt 2 reentryindex=2; fixedxreadsinput[1]=0xDBFF, decodes the pair,fCARS=1, U+10FFFF ≠x→ trampoline.matchStart = index(2) + fCARS(1) = 3,indexleft at 2, jump to alt 1 reentry.\Bat 2: prev = U+10FFFF (non-word), curr = lone trail 0xDFFF (non-word) →\Bsucceeds. Return(start=3, end=2).Fix
Mirror the sibling
elsebranch:add32(firstCharacterAdditionalReadSize, index)before the (possible)sub32, and gate the jump oncheckInput()since index may now belength + 1in thedelta == 1case. No-op on non-ARM64 (the whole block is inside#if ENABLE(YARR_JIT_UNICODE_CAN_INCREMENT_INDEX_FOR_NON_BMP)).Tests
JSTests/stress/regexp-jit-non-bmp-backtrack-trampoline-index-sync.js— asserts the match result is a well-formed string (length ≤ input.length, index + length ≤ input.length) for several zero-width-first-alternative + variable-last-alternative patterns over inputs containing surrogate pairs, plus the originally-reported case.