Skip to content

Commit ea6dcb6

Browse files
authored
Kubernetes: fail fast if job pod was not scheduled (#3874)
After a job pod is created, wait and fail with `ComputeError` if the pod has either not been scheduled or has already finished (probably failed) within the scheduling timeout (10 seconds). A new `watch` permission for `pods` in the namespace is required. In addition, `run_job()` and `terminate_instance()` were refactored to clean up objects on failures. Part-of: #3871
1 parent ce0c210 commit ea6dcb6

3 files changed

Lines changed: 481 additions & 216 deletions

File tree

mkdocs/snippets/kubernetes/dstack-backend-role.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ metadata:
66
rules:
77
- apiGroups: [""]
88
resources: ["pods"]
9-
verbs: ["get", "create", "delete"]
9+
verbs: ["get", "watch", "create", "delete"]
1010
- apiGroups: [""]
1111
resources: ["services"]
1212
verbs: ["get", "create", "delete"]

0 commit comments

Comments
 (0)