Skip to content

Fix deadlock that wedges the logs WebSocket for an entire process#495

Merged
F1bonacc1 merged 2 commits into
F1bonacc1:mainfrom
hebo6:fix/logs-ws-deadlock
May 24, 2026
Merged

Fix deadlock that wedges the logs WebSocket for an entire process#495
F1bonacc1 merged 2 commits into
F1bonacc1:mainfrom
hebo6:fix/logs-ws-deadlock

Conversation

@hebo6

@hebo6 hebo6 commented May 23, 2026

Copy link
Copy Markdown
Contributor

Symptom

After running for a while (especially with TUI attach viewing log of any process), process logs <name> and the TUI log panel hang forever for that process. New WS subscribers complete the 101 Switching Protocols handshake but the server never sends a single frame. Only the REST endpoint /process/logs/:name/:endOffset/:limit continues to work. The only recovery is restarting process-compose, which kills every supervised process.

Root cause

Three interacting issues form the deadlock:

  1. ProcessLogBuffer.Write holds mxObs across observer.WriteString. If any single observer's WriteString blocks, every subsequent Write and every new GetLogsAndSubscribe on this buffer is stuck behind the same lock forever.
  2. HandleLogsStream's slHandler does a blocking logChan <- msg. The per-subscriber logChan has capacity 256. If handleLog's WriteJSON is blocked (e.g. peer's TCP receive buffer is full because the client stopped reading), logChan fills, slHandler blocks, and then (1) kicks in.
  3. A process-global api.wsMtx guards every ws.WriteJSON in both handleLog and handleStateStream. gorilla/websocket already enforces a single writer per Conn internally, so this mutex is redundant — but worse, it lets a write that's stuck on one ws connection block writes on every other ws connection in the process.

A half-dead follower client (TUI attach that lost focus, network blip, etc.) is enough to trip (3) → (2) → (1) and permanently wedge logs for the affected process.

Fix

  • src/pclog/process_log_buffer.goWrite now snapshots b.observers under mxObs and releases the lock before fanning out to WriteString. A slow observer can no longer stall other writes or new subscriptions.
  • src/api/ws_api.goslHandler switches to a non-blocking select { case logChan <- msg: default: }. Dropping a line for a back-pressured subscriber is far better than deadlocking the producer.
  • src/api/ws_api.go + src/api/state_ws_api.go — drop the global api.wsMtx around WriteJSON (gorilla's Conn is already single-writer-safe) and add ws.SetWriteDeadline(now + 10s) so a half-dead peer is evicted within 10 s instead of holding resources forever.

Reproduction

# repro-pc.yaml
version: "0.5"
processes:
  noisy:
    command: "i=0; while true; do printf 'line-%06d %s\n' $i $(date +%s%N); i=$((i+1)); done"
    availability:
      restart: "no"
process-compose -f repro-pc.yaml --port 8089 up -t=false &

# A follower that completes the WS handshake and then stops reading.
python3 ws_slow.py 8089 noisy true 999 &
SLOW=$!; sleep 0.5; kill -STOP $SLOW   # freeze the consumer

sleep 30   # let server-side Send-Q saturate

# Before the fix this hangs forever; REST still works.
process-compose --port 8089 process logs --tail 3 noisy
curl http://127.0.0.1:8089/process/logs/noisy/0/3

ws_slow.py is a ~30 line script that performs the WS upgrade and then sleeps without reading. Happy to add it under scripts/ or as a test if useful.

Verification

Step Before After
Send-Q to the frozen follower rises to ~1.9 MB, then stays there indefinitely rises to ~1.9 MB, then the follower is evicted by the write deadline (server fd count drops)
Second process logs after step above hangs (timeout, 0 bytes) succeeds immediately
SIGCONT on the slow client does not recover recovers immediately (was never broken once the deadline ticked)
REST /process/logs/... works throughout works throughout

No new tests added in this PR — happy to add a regression test (slow ws consumer + assertion that a second GetLogsAndSubscribe returns within a deadline) if you'd like.

@F1bonacc1 F1bonacc1 left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @hebo6,

Thank you for this PR and excellent diagnosis.
There might be one critical issue that slipped through the cracks.
See my comments.

Comment thread src/api/ws_api.go
Comment thread src/api/ws_api.go Outdated
Comment on lines +91 to +92
default:
}

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
default:
}
default:
dropped++
if !warnedOnce {
log.Warn().Str("process", processName).
Msg("ws subscriber backpressured; dropping log lines")
warnedOnce = true
}
}
...
// in handleLog's defer:
if dropped > 0 {
log.Warn().Str("process", procName).Uint64("dropped", dropped).
Msg("ws subscriber disconnected after dropped lines")
}

@hebo6

hebo6 commented May 24, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the review.

I updated the implementation to use a per-WebSocket write mutex created in HandleLogsStream and shared by all handleLog goroutines for that request. This keeps writes serialized for a single ws.Conn without reintroducing the global API-level mutex.

I also added observability for dropped log lines: the first backpressure drop logs a warning, and disconnect cleanup logs the total number of dropped lines for that subscriber.

While touching this area, I also guarded logChan close/send with the same channel-state mutex to avoid a possible send-on-closed-channel panic when the client disconnects while log callbacks are still running.

@sonarqubecloud

Copy link
Copy Markdown

@F1bonacc1 F1bonacc1 left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@F1bonacc1 F1bonacc1 merged commit a92cb36 into F1bonacc1:main May 24, 2026
7 checks passed
@hebo6 hebo6 deleted the fix/logs-ws-deadlock branch May 25, 2026 08:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants