Skip to content

Worker resilience, smarter load balancing, and connection stability#15

Merged
obrucheoghene merged 5 commits into
softhon:mainfrom
obrucheoghene:main
Mar 25, 2026
Merged

Worker resilience, smarter load balancing, and connection stability#15
obrucheoghene merged 5 commits into
softhon:mainfrom
obrucheoghene:main

Conversation

@obrucheoghene
Copy link
Copy Markdown
Collaborator

Worker resilience, smarter load balancing, and connection stability

Summary

  • Worker auto-recovery: Implement handleWorkerDeath to automatically replace dead mediasoup workers instead of silently losing them. Dead workers are removed from the pool and a replacement is spawned immediately.
  • Weighted load scoring: Replace naive peer-count load balancing with a weighted score (40% CPU, 30% memory, 30% peer count) using real resource usage data. getLeastLoadedWorker is now async to support live usage polling.
  • Heartbeat mechanism: Implement setupHeartbeat in SignalNode — sends a ping every 30s and disconnects the client on timeout, preventing zombie connections from silently accumulating.
  • Request timeouts: Add a 30s timeout to all pending sendRequest calls so hung requests fail fast instead of leaking memory indefinitely.
  • Selective consumer creation: Cap initial consumer creation at MAX_INITIAL_CONSUMERS (default 25) when a peer joins, preventing O(n²) transport explosion in large rooms.
  • Concurrent router piping: Switch producer-to-router piping from sequential forEach to Promise.all, ensuring cross-router canConsume() checks work correctly before consumers are created.
  • Expanded RTC port range: Widen default range from 2000–2300 to 10000–60000 to support higher peer counts.

Test plan

  • Join a room with an existing peer — verify consumers are created up to the MAX_INITIAL_CONSUMERS limit
  • Kill a mediasoup worker process mid-call — verify a replacement worker is spawned and the server remains healthy
  • Let a client go silent (no heartbeat ack) for >30s — verify it is disconnected with heartbeat_timeout
  • Trigger a slow/stuck action handler — verify the pending request rejects after 30s with a timeout error
  • Confirm existing multi-router piping still works (producers visible across routers)

@obrucheoghene obrucheoghene merged commit 9244295 into softhon:main Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant