Skip to content

fix: re-check step outputs on re-encounter to unstick cached builds#1671

Open
angerman wants to merge 2 commits into
NixOS:masterfrom
angerman:angerman/fix-cached-builds-stuck
Open

fix: re-check step outputs on re-encounter to unstick cached builds#1671
angerman wants to merge 2 commits into
NixOS:masterfrom
angerman:angerman/fix-cached-builds-stuck

Conversation

@angerman
Copy link
Copy Markdown
Contributor

Summary

Fixes an infinite re-load loop where builds whose outputs became available
between poll cycles (via concurrent builds, substitution, or external upload)
get stuck forever.

When a step is encountered that already exists (is_new = false), the code
previously just returned it as-is. If the step's outputs had become available
in the store since it was first created, it would be dispatched, the builder
would find nothing to do, and the cycle would repeat indefinitely.

Fix

Add two checks when is_new = false:

  1. If the step is already marked finished, return None (skip it)
  2. Query the store for missing outputs — if all are present, mark the step
    finished and return None

Origin

Originally developed and deployed in zw3rk/hydra-queue-runner (commit
67ef4845), adapted here to the upstream harmonia StorePath types.

Test plan

  • Verify that builds whose outputs appear in the store between poll
    cycles no longer get stuck in an infinite re-dispatch loop
  • Verify normal build dispatch is unaffected

When create_step encounters an existing step (is_new=false), it now
re-checks whether the step's outputs have appeared in the local nix
store since the step was first created. Without this check, builds
whose outputs became available between poll cycles (e.g. via
concurrent builds, substitution, or external upload) would get stuck
in an infinite re-load loop: the DB says finished=0, the step already
exists in memory so create_build inserts the build but never routes
it to handle_cached_build.

The fix adds two checks before returning Valid(step):
1. If the step is already marked finished, return None immediately.
2. Query missing outputs from the local store; if all are present,
   mark the step finished, insert into finished_drvs, and return None
   so create_build routes to handle_cached_build.

Adapted from zw3rk/hydra-queue-runner commit 67ef484.
@Ericson2314
Copy link
Copy Markdown
Member

I talked to @angerman elsewhere, and it came up this is probably happening because of IFD — regularly scheduled builds do not have this race condition.

I'm fine merging this for now then. Long term we should replace this with interposing the daemon socket so IFD builds can also be distributed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants