Commit ea6dcb6
authored
Kubernetes: fail fast if job pod was not scheduled (#3874)
After a job pod is created, wait and fail with `ComputeError` if the
pod has either not been scheduled or has already finished (probably
failed) within the scheduling timeout (10 seconds).
A new `watch` permission for `pods` in the namespace is required.
In addition, `run_job()` and `terminate_instance()` were refactored
to clean up objects on failures.
Part-of: #38711 parent ce0c210 commit ea6dcb6
3 files changed
Lines changed: 481 additions & 216 deletions
File tree
- mkdocs/snippets/kubernetes
- src/dstack/_internal/core/backends/kubernetes
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
9 | | - | |
| 9 | + | |
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
| |||
0 commit comments