Cluster worker: hybrid deployment (containerised default + native/bare-metal option) under one transport-agnostic contract

## Decision (Jay 2026-06-14)
Support BOTH a containerised cluster worker and a native/bare-metal worker, not one or the other. Containerised is the recommended default; native is an opt-in for dedicated nodes. Relates to #890 (worker auto-update).

## Why containerised as the default (esp. GPU workers)
First-hand evidence while standing up the RTX 3060 SD backend: native CUDA on bleeding-edge Fedora 43 (gcc 15 / glibc 2.42) would not build (CUDA 12.9 nvcc rejects gcc>14; gcc-14 then hit a glibc mathcalls.h incompatibility). The container path (`nvidia-container-toolkit` + `docker run --gpus all`) bundled a compatible toolchain and saw the GPU immediately.
- Makes "install CUDA if missing" tractable: ensure driver + container toolkit, run image, instead of maintaining a per-distro driver/CUDA/kernel-module matrix.
- Makes #890 auto-update clean and reversible: pull new image tag, drain, swap, health-check, rollback to the prior tag on failure.
- Universal management: one update path, reproducible; multiple backends (sd.cpp, ComfyUI, ollama) coexist as separate containers without polluting the host.

## Why keep a native/bare-metal option
- Dedicated homelab/enthusiast nodes want max performance, minimal footprint, no container runtime.
- Locked-down or older hosts where Docker/podman is unavailable or GPU passthrough is finicky.
- Simpler mental model for a single-purpose box.

## Design: one worker, two deployment adapters
Define the worker as a transport-agnostic capability contract (register with controller, advertise backends/capabilities, drain protocol, `apply_update()`, health-check). The same worker-agent logic runs whether the backend process is a native binary or a container. "Container vs bare-metal" is a deployment adapter under a single `apply_update()` abstraction (image pull vs package/binary update). Avoids two codebases.

## Installer behaviour
Detect the host and recommend:
- Has, or can install, a container runtime -> containerised (default). For GPU: ensure driver + nvidia-container-toolkit, pull the backend image.
- User picks "dedicate this machine / minimal install" -> native path; installer best-effort sets up driver/CUDA per-distro, with the container path as the fallback when native GPU setup is too fragile.
- Cross-platform per the worker scripts policy (bash + powershell; Linux/macOS/Windows).

## Acceptance
A new GPU worker can be brought up either way from the installer, both register and serve the same capability contract, and #890's update/drain/rollback works identically across both adapters. Brainstorm the contract + adapter boundary before building; do not interrupt the current storybook/image-gen line.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cluster worker: hybrid deployment (containerised default + native/bare-metal option) under one transport-agnostic contract #892

Decision (Jay 2026-06-14)

Why containerised as the default (esp. GPU workers)

Why keep a native/bare-metal option

Design: one worker, two deployment adapters

Installer behaviour

Acceptance

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Cluster worker: hybrid deployment (containerised default + native/bare-metal option) under one transport-agnostic contract #892

Description

Decision (Jay 2026-06-14)

Why containerised as the default (esp. GPU workers)

Why keep a native/bare-metal option

Design: one worker, two deployment adapters

Installer behaviour

Acceptance

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions