Fix deadlock that wedges the logs WebSocket for an entire process by hebo6 · Pull Request #495 · F1bonacc1/process-compose

hebo6 · 2026-05-23T04:11:14Z

Symptom

After running for a while (especially with TUI attach viewing log of any process), process logs <name> and the TUI log panel hang forever for that process. New WS subscribers complete the 101 Switching Protocols handshake but the server never sends a single frame. Only the REST endpoint /process/logs/:name/:endOffset/:limit continues to work. The only recovery is restarting process-compose, which kills every supervised process.

Root cause

Three interacting issues form the deadlock:

ProcessLogBuffer.Write holds mxObs across observer.WriteString. If any single observer's WriteString blocks, every subsequent Write and every new GetLogsAndSubscribe on this buffer is stuck behind the same lock forever.
HandleLogsStream's slHandler does a blocking logChan <- msg. The per-subscriber logChan has capacity 256. If handleLog's WriteJSON is blocked (e.g. peer's TCP receive buffer is full because the client stopped reading), logChan fills, slHandler blocks, and then (1) kicks in.
A process-global api.wsMtx guards every ws.WriteJSON in both handleLog and handleStateStream. gorilla/websocket already enforces a single writer per Conn internally, so this mutex is redundant — but worse, it lets a write that's stuck on one ws connection block writes on every other ws connection in the process.

A half-dead follower client (TUI attach that lost focus, network blip, etc.) is enough to trip (3) → (2) → (1) and permanently wedge logs for the affected process.

Fix

src/pclog/process_log_buffer.go — Write now snapshots b.observers under mxObs and releases the lock before fanning out to WriteString. A slow observer can no longer stall other writes or new subscriptions.
src/api/ws_api.go — slHandler switches to a non-blocking select { case logChan <- msg: default: }. Dropping a line for a back-pressured subscriber is far better than deadlocking the producer.
src/api/ws_api.go + src/api/state_ws_api.go — drop the global api.wsMtx around WriteJSON (gorilla's Conn is already single-writer-safe) and add ws.SetWriteDeadline(now + 10s) so a half-dead peer is evicted within 10 s instead of holding resources forever.

Reproduction

# repro-pc.yaml
version: "0.5"
processes:
  noisy:
    command: "i=0; while true; do printf 'line-%06d %s\n' $i $(date +%s%N); i=$((i+1)); done"
    availability:
      restart: "no"

process-compose -f repro-pc.yaml --port 8089 up -t=false &

# A follower that completes the WS handshake and then stops reading.
python3 ws_slow.py 8089 noisy true 999 &
SLOW=$!; sleep 0.5; kill -STOP $SLOW   # freeze the consumer

sleep 30   # let server-side Send-Q saturate

# Before the fix this hangs forever; REST still works.
process-compose --port 8089 process logs --tail 3 noisy
curl http://127.0.0.1:8089/process/logs/noisy/0/3

ws_slow.py is a ~30 line script that performs the WS upgrade and then sleeps without reading. Happy to add it under scripts/ or as a test if useful.

Verification

Step	Before	After
`Send-Q` to the frozen follower	rises to ~1.9 MB, then stays there indefinitely	rises to ~1.9 MB, then the follower is evicted by the write deadline (server fd count drops)
Second `process logs` after step above	hangs (timeout, 0 bytes)	succeeds immediately
`SIGCONT` on the slow client	does not recover	recovers immediately (was never broken once the deadline ticked)
REST `/process/logs/...`	works throughout	works throughout

No new tests added in this PR — happy to add a regression test (slow ws consumer + assertion that a second GetLogsAndSubscribe returns within a deadline) if you'd like.

F1bonacc1

Hey @hebo6,

Thank you for this PR and excellent diagnosis.
There might be one critical issue that slipped through the cracks.
See my comments.

F1bonacc1 · 2026-05-23T15:18:19Z

+				default:
+				}


Suggested change

default:

}

default:

dropped++

if !warnedOnce {

log.Warn().Str("process", processName).

Msg("ws subscriber backpressured; dropping log lines")

warnedOnce = true

}

}

...

// in handleLog's defer:

if dropped > 0 {

log.Warn().Str("process", procName).Uint64("dropped", dropped).

Msg("ws subscriber disconnected after dropped lines")

}

hebo6 · 2026-05-24T04:22:09Z

Thanks for the review.

I updated the implementation to use a per-WebSocket write mutex created in HandleLogsStream and shared by all handleLog goroutines for that request. This keeps writes serialized for a single ws.Conn without reintroducing the global API-level mutex.

I also added observability for dropped log lines: the first backpressure drop logs a warning, and disconnect cleanup logs the total number of dropped lines for that subscriber.

While touching this area, I also guarded logChan close/send with the same channel-state mutex to avoid a possible send-on-closed-channel panic when the client disconnects while log callbacks are still running.

sonarqubecloud · 2026-05-24T04:22:54Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

F1bonacc1

Looks good, thanks!

修复日志ws死锁

8dcb6e1

F1bonacc1 requested changes May 23, 2026

View reviewed changes

Fix log websocket backpressure

0be8ab4

F1bonacc1 approved these changes May 24, 2026

View reviewed changes

F1bonacc1 merged commit a92cb36 into F1bonacc1:main May 24, 2026
7 checks passed

hebo6 deleted the fix/logs-ws-deadlock branch May 25, 2026 08:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix deadlock that wedges the logs WebSocket for an entire process#495

Fix deadlock that wedges the logs WebSocket for an entire process#495
F1bonacc1 merged 2 commits into
F1bonacc1:mainfrom
hebo6:fix/logs-ws-deadlock

hebo6 commented May 23, 2026

Uh oh!

F1bonacc1 left a comment

Uh oh!

Uh oh!

F1bonacc1 May 23, 2026

Uh oh!

hebo6 commented May 24, 2026

Uh oh!

sonarqubecloud Bot commented May 24, 2026

Uh oh!

F1bonacc1 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-				default:
-				}
+				default:
+    				dropped++
+                    if !warnedOnce {
+                        log.Warn().Str("process", processName).
+                            Msg("ws subscriber backpressured; dropping log lines")
+                        warnedOnce = true
+                    }
+				}
+				...
+                // in handleLog's defer:
+                if dropped > 0 {
+                    log.Warn().Str("process", procName).Uint64("dropped", dropped).
+                        Msg("ws subscriber disconnected after dropped lines")
+                }

Uh oh!

Conversation

hebo6 commented May 23, 2026

Symptom

Root cause

Fix

Reproduction

Verification

Uh oh!

F1bonacc1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

F1bonacc1 May 23, 2026

Choose a reason for hiding this comment

Uh oh!

hebo6 commented May 24, 2026

Uh oh!

sonarqubecloud Bot commented May 24, 2026

Quality Gate passed

Uh oh!

F1bonacc1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants