Fix deadlock that wedges the logs WebSocket for an entire process#495
Conversation
| default: | ||
| } |
There was a problem hiding this comment.
| default: | |
| } | |
| default: | |
| dropped++ | |
| if !warnedOnce { | |
| log.Warn().Str("process", processName). | |
| Msg("ws subscriber backpressured; dropping log lines") | |
| warnedOnce = true | |
| } | |
| } | |
| ... | |
| // in handleLog's defer: | |
| if dropped > 0 { | |
| log.Warn().Str("process", procName).Uint64("dropped", dropped). | |
| Msg("ws subscriber disconnected after dropped lines") | |
| } | |
|
Thanks for the review. I updated the implementation to use a per-WebSocket write mutex created in I also added observability for dropped log lines: the first backpressure drop logs a warning, and disconnect cleanup logs the total number of dropped lines for that subscriber. While touching this area, I also guarded |
|



Symptom
After running for a while (especially with TUI
attachviewing log of any process),process logs <name>and the TUI log panel hang forever for that process. New WS subscribers complete the101 Switching Protocolshandshake but the server never sends a single frame. Only the REST endpoint/process/logs/:name/:endOffset/:limitcontinues to work. The only recovery is restartingprocess-compose, which kills every supervised process.Root cause
Three interacting issues form the deadlock:
ProcessLogBuffer.WriteholdsmxObsacrossobserver.WriteString. If any single observer'sWriteStringblocks, every subsequentWriteand every newGetLogsAndSubscribeon this buffer is stuck behind the same lock forever.HandleLogsStream'sslHandlerdoes a blockinglogChan <- msg. The per-subscriberlogChanhas capacity 256. IfhandleLog'sWriteJSONis blocked (e.g. peer's TCP receive buffer is full because the client stopped reading),logChanfills,slHandlerblocks, and then (1) kicks in.api.wsMtxguards everyws.WriteJSONin bothhandleLogandhandleStateStream. gorilla/websocket already enforces a single writer perConninternally, so this mutex is redundant — but worse, it lets a write that's stuck on one ws connection block writes on every other ws connection in the process.A half-dead follower client (TUI attach that lost focus, network blip, etc.) is enough to trip (3) → (2) → (1) and permanently wedge logs for the affected process.
Fix
src/pclog/process_log_buffer.go—Writenow snapshotsb.observersundermxObsand releases the lock before fanning out toWriteString. A slow observer can no longer stall other writes or new subscriptions.src/api/ws_api.go—slHandlerswitches to a non-blockingselect { case logChan <- msg: default: }. Dropping a line for a back-pressured subscriber is far better than deadlocking the producer.src/api/ws_api.go+src/api/state_ws_api.go— drop the globalapi.wsMtxaroundWriteJSON(gorilla'sConnis already single-writer-safe) and addws.SetWriteDeadline(now + 10s)so a half-dead peer is evicted within 10 s instead of holding resources forever.Reproduction
ws_slow.pyis a ~30 line script that performs the WS upgrade and then sleeps without reading. Happy to add it underscripts/or as a test if useful.Verification
Send-Qto the frozen followerprocess logsafter step aboveSIGCONTon the slow client/process/logs/...No new tests added in this PR — happy to add a regression test (slow ws consumer + assertion that a second
GetLogsAndSubscribereturns within a deadline) if you'd like.