fix: re-check step outputs on re-encounter to unstick cached builds#1671
Open
angerman wants to merge 2 commits into
Open
fix: re-check step outputs on re-encounter to unstick cached builds#1671angerman wants to merge 2 commits into
angerman wants to merge 2 commits into
Conversation
When create_step encounters an existing step (is_new=false), it now re-checks whether the step's outputs have appeared in the local nix store since the step was first created. Without this check, builds whose outputs became available between poll cycles (e.g. via concurrent builds, substitution, or external upload) would get stuck in an infinite re-load loop: the DB says finished=0, the step already exists in memory so create_build inserts the build but never routes it to handle_cached_build. The fix adds two checks before returning Valid(step): 1. If the step is already marked finished, return None immediately. 2. Query missing outputs from the local store; if all are present, mark the step finished, insert into finished_drvs, and return None so create_build routes to handle_cached_build. Adapted from zw3rk/hydra-queue-runner commit 67ef484.
Member
|
I talked to @angerman elsewhere, and it came up this is probably happening because of IFD — regularly scheduled builds do not have this race condition. I'm fine merging this for now then. Long term we should replace this with interposing the daemon socket so IFD builds can also be distributed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes an infinite re-load loop where builds whose outputs became available
between poll cycles (via concurrent builds, substitution, or external upload)
get stuck forever.
When a step is encountered that already exists (
is_new = false), the codepreviously just returned it as-is. If the step's outputs had become available
in the store since it was first created, it would be dispatched, the builder
would find nothing to do, and the cycle would repeat indefinitely.
Fix
Add two checks when
is_new = false:None(skip it)finished and return
NoneOrigin
Originally developed and deployed in
zw3rk/hydra-queue-runner(commit67ef4845), adapted here to the upstream harmoniaStorePathtypes.Test plan
cycles no longer get stuck in an infinite re-dispatch loop