Skip to content

gRPC builder stream silently dies, builder never reconnects #1674

@angerman

Description

@angerman

Summary

The gRPC bidirectional stream between hydra-builder and the queue runner silently dies. When this happens:

  1. The builder process keeps running but stops receiving new work — no child nix processes, idle CPU
  2. The queue runner still considers the builder's job slots as occupied (currentJobs: 4) and never reclaims them
  3. sinceLastPing on the runner side grows indefinitely (observed 51,696s / ~14h and 39,156s / ~10.9h)
  4. The builder's --ping-interval 10 does not detect the dead stream or trigger reconnection

The only fix is to manually restart the builder service (launchctl kickstart -k), after which it immediately reconnects and starts receiving builds.

Observed pattern

From the builder logs, the sequence before going silent is typically:

INFO  Finished building <drv>
INFO  Start uploading paths to queue runner directly
INFO  Finished uploading paths to queue runner directly. elapsed=138ms
INFO  Successfully completed build process for <drv>
INFO  Building <next-drv>                    ← receives new build
      warning: file ... does not exist in binary cache ...  ← fetching inputs
      ... (silence — no more log entries for 10+ hours)

The builder appears to get stuck during the input-fetching phase of a build. Meanwhile the gRPC stream dies — possibly because the long input-fetch blocks the ping response, or a network interruption goes undetected.

From the queue runner's /status endpoint:

{
  "hostname": "builder-A",
  "sinceLastPing": 51696,
  "currentJobs": 4,
  "failedBuilds": 0,
  "succeededBuilds": 161
}

Other builders on the same network show sinceLastPing: 2-6 and are healthy.

Expected behavior

  • The builder should detect the dead gRPC stream (via ping timeout or TCP keepalive) and reconnect automatically
  • The queue runner should have a configurable timeout for sinceLastPing — after which it marks the builder as disconnected, reclaims the job slots, and reschedules the builds on other machines

Environment

  • Queue runner: hydra-queue-runner 0.1.0-c1fe4808
  • Builder: hydra-builder 0.1.0-c1fe4808
  • --ping-interval 10 configured on all builders
  • 10 darwin builders on a LAN, connected to queue runner via IPv6
  • Observed across multiple builders, recurring after restarts

Workaround

Restart the builder service. It reconnects immediately and starts receiving builds within seconds.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions