Skip to content

WTF: read interrupted SP from ucontext in signalHandlerSuspendResume#235

Open
robobun wants to merge 1 commit into
mainfrom
farm/ebd23fe3/sigpwr-ucontext-sp
Open

WTF: read interrupted SP from ucontext in signalHandlerSuspendResume#235
robobun wants to merge 1 commit into
mainfrom
farm/ebd23fe3/sigpwr-ucontext-sp

Conversation

@robobun
Copy link
Copy Markdown
Collaborator

@robobun robobun commented May 21, 2026

Fixes the SIGPWR suspend-loop deadlock reported in oven-sh/bun#31158 (and oven-sh/bun#29843 — same mechanism via Prisma's WASM-backed adapter).

The bug

Thread::suspend() sends the GC suspend-resume signal to the target thread and then spins:

while (true) {
    pthread_kill(m_handle, g_wtfConfig.sigThreadSuspendResume);
    globalSemaphoreForSuspendResume->wait();
    if (m_platformRegisters) break;
    Thread::yield();
}

The handler decides whether to publish a register snapshot by checking currentStackPointer() — the SP of the handler's own frame — against thread->m_stack:

void* approximateStackPointer = currentStackPointer();
if (!thread->m_stack.contains(approximateStackPointer)) {
    thread->m_platformRegisters = nullptr;
    globalSemaphoreForSuspendResume->post();
    return;
}

That only works so long as the handler runs on the normal stack. If SA_ONSTACK is set on the sigaction and the thread has a sigaltstack installed, the handler runs on the alt stack, the check fails on every retry, and the loop spins forever.

Who trips this

Go's cgo runtime. runtime.initsig walks every signal and, for any handler it didn't install itself, calls setsigstack to force-add SA_ONSTACK so Go's own threads' synchronous faults stay on a managed alt stack. That includes our SIGPWR handler. Any c-shared library that follows the same recipe (Mono, some JNI layouts, Rust cdylibs) hits it too.

Combined with sanitizer runtimes / libbacktrace / the host's crash handler installing an alternate signal stack on the main thread (ASAN does this unconditionally), the next GC thread-suspend delivers SIGPWR onto the alt stack → stack check fails every retry → the suspender wedges at 100% CPU.

Reproducible in a plain C program to confirm currentStackPointer vs the ucontext-saved SP:

main stack near: 0x7ffdea86460c
alt stack range: [0x74fca3657010..0x74fca3697010]
Case 1 (no SA_ONSTACK): handler_sp=0x7ffdea8638a4 interrupted_sp=0x7ffdea8645a0
Case 2 (SA_ONSTACK):    handler_sp=0x74fca36963a4 interrupted_sp=0x7ffdea8645a0

The fix

Read the SP the thread was running at when the signal arrived straight out of the ucontext (kernel-saved register state). That SP is stable regardless of whether the handler itself runs on the normal or the alt stack, and the existing retry-on-alt-stack semantics still work: if the thread genuinely was on an alt stack when the signal arrived (e.g. a nested handler), the ucontext SP reflects that and the check backs off as before.

currentStackPointer() stays as the fallback for non-HAVE(MACHINE_CONTEXT) platforms.

Test

Regression test coming from the Bun side (oven-sh/bun#31161) — spawns a child that sets SA_ONSTACK on the suspend-resume signal the same way Go's initsig does, then drives the WASM install path that triggers resetInstructionCacheOnAllThreads. Passes with this fix applied; hangs to the test-runner timeout without it.

signalHandlerSuspendResume validated the snapshot by calling
currentStackPointer() — the SP of the *handler frame itself*. That
assumes the handler always runs on the thread's normal stack. It does,
unless someone sets SA_ONSTACK on our sigaction + the thread has a
sigaltstack installed, in which case the handler runs on the alt stack
and the stack-range check fails on every invocation.

The assumption doesn't hold on Linux in practice. Go's cgo runtime
walks every signal in runtime.initsig and slaps SA_ONSTACK onto any
handler it didn't install itself (including ours). Combined with
sanitizer runtimes / libbacktrace / Bun's crash handler installing
alternate signal stacks on the main thread, the very next
Thread::suspend() delivers SIGPWR onto the alt stack, the check fails
on every retry, and the loop spins forever — the Bun event loop hangs
at the first WASM compile after the dlopen. See oven-sh/bun#31158 and
oven-sh/bun#29843 (Prisma MariaDB adapter, same mechanism).

The fix is to read the SP the thread was running at when the signal
arrived straight out of the kernel-saved register state. That SP lives
in the ucontext and is stable regardless of whether the handler itself
runs on the normal or the alt stack. The existing retry-on-alt-stack
semantics still work: if the thread genuinely was on an alt stack when
the signal arrived (e.g. a nested handler), the ucontext SP reflects
that and the check backs off as before.

Keep currentStackPointer() as the fallback for platforms without a
machine context.
@gogakoreli
Copy link
Copy Markdown

Small suggestion: now that the handler is robust to SA_ONSTACK via ucontext SP, the comment on the SIGPWR override (from ceb3e74) should be updated to explain the full picture. A reader seeing #if OS(LINUX) && USE(BUN_JSC_ADDITIONS) + SIGPWR with no context about why it's Linux-only or how it relates to the ucontext fix will be confused.

Suggested replacement:

#if OS(LINUX) && USE(BUN_JSC_ADDITIONS)
    // Thread suspension on Linux uses a signal (macOS uses Mach ports instead —
    // this entire signal-based mechanism is Linux/FreeBSD-only). We override the
    // default SIGUSR1 to SIGPWR because npm packages commonly register
    // process.on('SIGUSR1') handlers. SIGPWR ("power failure") is effectively
    // unused on modern Linux.
    //
    // Note: dlopen'd runtimes (Go cgo, Mono, JNI) may add SA_ONSTACK to this
    // handler's sigaction flags via their init routines. The handler reads the
    // interrupted SP from ucontext rather than its own frame pointer, making it
    // robust to alt-stack delivery regardless of flag mutation.
    // See oven-sh/bun#31158.
    g_wtfConfig.sigThreadSuspendResume = SIGPWR;
#endif

This ties together the "why SIGPWR", "why Linux-only", and "why it's safe" in one place.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4d3faed2-d201-48cf-944c-45088a80ed36

📥 Commits

Reviewing files that changed from the base of the PR and between 01ecc4e and a355e3e.

📒 Files selected for processing (1)
  • Source/WTF/wtf/posix/ThreadingPOSIX.cpp

Walkthrough

The signal-handler suspend/resume mechanism is improved to detect the interrupted thread's stack pointer from machine-context ucontext_t register state instead of runtime approximation. A new interruptedStackPointer() helper extracts the stack pointer on supported architectures, and signalHandlerSuspendResume uses it to validate whether the handler interrupted normal or alternate stack execution.

Changes

Signal handler stack pointer reliability

Layer / File(s) Summary
Machine-context stack pointer extraction
Source/WTF/wtf/posix/ThreadingPOSIX.cpp
New interruptedStackPointer(ucontext_t*) helper (under HAVE(MACHINE_CONTEXT)) reads the interrupted stack pointer directly from kernel-saved ucontext_t register fields for multiple Linux and FreeBSD CPU architectures, with fallback to nullptr on unsupported configurations.
Signal handler suspend/resume with reliable stack detection
Source/WTF/wtf/posix/ThreadingPOSIX.cpp
Thread::signalHandlerSuspendResume now uses interruptedStackPointer() to obtain stackPointerToCheck (with fallback to currentStackPointer()), validates it against the thread's recorded stack bounds, and either captures m_platformRegisters from ucontext or clears them and backs off for retry based on containment.
🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description comprehensively explains the bug, root cause, fix mechanism, and testing approach with technical details. However, it does not follow the repository's required template which mandates a Bugzilla bug link, reviewers, and structured sections (bug title, reviewed by, explanation, changed paths). Reformat the description to match the WebKit template: include a Bugzilla link at the top, add 'Reviewed by NOBODY (OOPS!)' section, structure with clear sections, and list changed paths with functions as per the template requirements.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'WTF: read interrupted SP from ucontext in signalHandlerSuspendResume' directly and accurately summarizes the main technical change—reading the interrupted stack pointer from ucontext instead of using currentStackPointer() in the signal handler suspend/resume logic.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

Preview Builds

Commit Release Date
a355e3e0 autobuild-preview-pr-235-a355e3e0 2026-05-21 04:48:50 UTC

robobun added a commit to oven-sh/bun that referenced this pull request May 21, 2026
Point WEBKIT_VERSION at the preview autobuild of oven-sh/WebKit#235,
which teaches signalHandlerSuspendResume to read the interrupted
thread's SP from the ucontext instead of the handler's own
currentStackPointer(). That's the root-cause fix — regardless of
what any dlopen'd library (Go cgo, Mono, JNI, Rust cdylib, …) does
to SA_ONSTACK, the stack-range check always sees the interrupted
stack rather than wherever the handler landed.

Delete the per-dlopen sigaction scan on the Bun side. With the WTF
change in place it's pure overhead, and the reporter rightly pointed
out that chaining workarounds for workarounds is the wrong direction.

Regression test (test/regression/issue/31158.test.ts) is unchanged
and now exercises the upstream fix directly.

Swap the preview-pr-235 tag for the merged autobuild hash once
oven-sh/WebKit#235 lands.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants