queue-runner and builder: fail faster on infrastructure errors#1748
queue-runner and builder: fail faster on infrastructure errors#1748Ericson2314 wants to merge 1 commit into
Conversation
77dc98b to
706238a
Compare
e61875b to
9fab2e1
Compare
|
@dasJ the second commit of this I would very much like your feedback on. The general rules I am thinking are:
I think all the edits here fit that pattern but I definitely didn't finish reviewing them yet. I am also confused why we had in so many places before. |
|
I am fully on board with most of these rules. Not dying is never a bad thing ;) The only thing I am unsure is the "don't die if fail to upload". Wdym by that? I don't see any unwraps or expects so I assume that's "assume the build succeeded". I don't think that's what we want because it could mean we end up with an incomplete binary cache. |
|
I think doing #1755 first will make this more clear.
To be clear, this was something I am keeping! It already is OK with "destination stores" being down, and I while I am not obvious to me that we want that, I didn't feel certain enough that we didn't to change it. |
Also to be clear, this PR should generally be making is die more. The counter-intuitive logic of "fail fast" is that not-dying can allow simple problems to fester, only for one to go down with a more complex problem later, and so it ends up being more sysadmin/debubgging work than it would be otherwise. |
a26c728 to
e2302ac
Compare
The queue-runner should not try to recover from a down database or invalid store state — only builder failures should be totally recoverable since builders are numerous and expected to come and go. The builder should die if it cannot communicate with the queue-runner — the queue-runner will notice and retry the step elsewhere. **Note: Manual review of the queue runner error comments is not yet complete!** Queue-runner changes: - Critical task loops (queue monitor, dispatch) now return `JoinHandle<Result<()>>` and are joined in the main `select!` loop. If either fails, the queue-runner exits with the error. Non-critical tasks (config reloader, uploader, fod checker) remain abort-only. - `create_build` returns `Result`; an invalid drv path is a hard error (indicates a GC rooting bug) rather than silently aborting the build - `handle_previous_failure` and `handle_cached_build` errors propagate (DB unreachable = stop processing) - `process_new_builds` returns `Result` and propagates child build creation errors - Queue monitor loop propagates `get_queued_builds`, `process_queue_change`, and `handle_jobset_change` errors instead of logging and continuing - `fail_step` handles missing jobs in queue/machine as warnings (race with builder disconnect), but DB errors propagate. Callers (`remove_machine`, cancelled steps) now use `?` instead of log-and-continue. - `abort_unsupported` propagates `inner_fail_job` DB errors through the dispatch loop. Unsupported build step log level downgraded from `error` to `warn` (expected operational condition, not a bug). - Fire-and-forget gRPC handlers (`build_step_update`, `complete_build`) use `spawn_fatal` — DB errors panic the task, bringing down the queue-runner. Builder parse errors (e.g. bad `BuildOutput`) are logged and dropped without crashing. - Machine tunnel loop: `remove_machine` is called once after the loop exits instead of at every break point. Channel errors downgraded from `error` to `warn` (builder disconnections are expected). - gRPC error messages now include the underlying error instead of generic strings. Redundant `tracing::error` calls removed where `#[tracing::instrument(err)]` already logs the error. - `parse_uuid` helper deduplicates UUID parsing across gRPC handlers. - Log directory creation at startup is fatal on failure - mTLS misconfiguration uses `anyhow::ensure!` Builder changes: - `filter_missing` returns `Result` — daemon connection failure is a hard error, not "path is present" - `handle_request` errors propagate (kills the builder — queue-runner will retry) - Ping message construction failure breaks the ping stream (builder exits), with the error captured and returned from the gRPC function - `submit_build_result` helper deduplicates the retry-then-report pattern for `complete_build` calls - Build task returns `Result<()>` instead of `()` — failure to submit results after retries propagates rather than calling `process::exit(1)` - mTLS misconfiguration uses `anyhow::ensure!` - Unreachable error paths in log stream (`utils.rs`) now panic with explanatory comments - `substitute_output` S3 replication uses async-block-as-try-block pattern for clean early exit on daemon error
WIP. See commit for preliminary details --- it is a long run on commit message because it's a bunch of separate edits tacked together, and it will probably become multiple commits once finished up.
I am hoping this will make it easier to fix the remaining issues, especially the ones that are hard to reproduce out of real-world testing