Summary
The gRPC bidirectional stream between hydra-builder and the queue runner silently dies. When this happens:
- The builder process keeps running but stops receiving new work — no child
nix processes, idle CPU
- The queue runner still considers the builder's job slots as occupied (
currentJobs: 4) and never reclaims them
sinceLastPing on the runner side grows indefinitely (observed 51,696s / ~14h and 39,156s / ~10.9h)
- The builder's
--ping-interval 10 does not detect the dead stream or trigger reconnection
The only fix is to manually restart the builder service (launchctl kickstart -k), after which it immediately reconnects and starts receiving builds.
Observed pattern
From the builder logs, the sequence before going silent is typically:
INFO Finished building <drv>
INFO Start uploading paths to queue runner directly
INFO Finished uploading paths to queue runner directly. elapsed=138ms
INFO Successfully completed build process for <drv>
INFO Building <next-drv> ← receives new build
warning: file ... does not exist in binary cache ... ← fetching inputs
... (silence — no more log entries for 10+ hours)
The builder appears to get stuck during the input-fetching phase of a build. Meanwhile the gRPC stream dies — possibly because the long input-fetch blocks the ping response, or a network interruption goes undetected.
From the queue runner's /status endpoint:
{
"hostname": "builder-A",
"sinceLastPing": 51696,
"currentJobs": 4,
"failedBuilds": 0,
"succeededBuilds": 161
}
Other builders on the same network show sinceLastPing: 2-6 and are healthy.
Expected behavior
- The builder should detect the dead gRPC stream (via ping timeout or TCP keepalive) and reconnect automatically
- The queue runner should have a configurable timeout for
sinceLastPing — after which it marks the builder as disconnected, reclaims the job slots, and reschedules the builds on other machines
Environment
- Queue runner:
hydra-queue-runner 0.1.0-c1fe4808
- Builder:
hydra-builder 0.1.0-c1fe4808
--ping-interval 10 configured on all builders
- 10 darwin builders on a LAN, connected to queue runner via IPv6
- Observed across multiple builders, recurring after restarts
Workaround
Restart the builder service. It reconnects immediately and starts receiving builds within seconds.
Summary
The gRPC bidirectional stream between
hydra-builderand the queue runner silently dies. When this happens:nixprocesses, idle CPUcurrentJobs: 4) and never reclaims themsinceLastPingon the runner side grows indefinitely (observed 51,696s / ~14h and 39,156s / ~10.9h)--ping-interval 10does not detect the dead stream or trigger reconnectionThe only fix is to manually restart the builder service (
launchctl kickstart -k), after which it immediately reconnects and starts receiving builds.Observed pattern
From the builder logs, the sequence before going silent is typically:
The builder appears to get stuck during the input-fetching phase of a build. Meanwhile the gRPC stream dies — possibly because the long input-fetch blocks the ping response, or a network interruption goes undetected.
From the queue runner's
/statusendpoint:{ "hostname": "builder-A", "sinceLastPing": 51696, "currentJobs": 4, "failedBuilds": 0, "succeededBuilds": 161 }Other builders on the same network show
sinceLastPing: 2-6and are healthy.Expected behavior
sinceLastPing— after which it marks the builder as disconnected, reclaims the job slots, and reschedules the builds on other machinesEnvironment
hydra-queue-runner 0.1.0-c1fe4808hydra-builder 0.1.0-c1fe4808--ping-interval 10configured on all buildersWorkaround
Restart the builder service. It reconnects immediately and starts receiving builds within seconds.