[Feature]: Dynamic Resource Allocation for Optimizer

## Description

Add workload-aware dynamic resource allocation to Amoro's optimizer, enabling automatic scale-up when optimization demand exceeds capacity and scale-down when optimizers are idle. Inspired by Spark's Dynamic Resource Allocation.

## Use case/motivation

Currently, Amoro's optimizer only supports a static `min-parallelism` floor — it scales up to meet the minimum but never scales down, and it does not respond to actual workload.

**1. Resource waste during low-load periods**: Optimizer pods remain running even when there are no optimization tasks for hours or days, consuming cluster resources that could serve other workloads.

**2. Manual intervention for load spikes**: When a burst of table optimizations is triggered (e.g., after a large batch ingestion), users must manually adjust `min-parallelism` — which is slow, error-prone, and operationally painful.

## Describe the solution

All scaling logic lives in AMS (`DefaultOptimizingService.OptimizerGroupKeeper`), making it container-agnostic. Container implementations (`KubernetesOptimizerContainer`, etc.) remain unchanged. Kubernetes is the primary target for testing and tuning.

```
                  ┌──────────────────────────────────┐
                  │   DefaultOptimizingService (AMS)  │
                  │                                   │
                  │   OptimizerGroupKeeper            │
                  │   ├─ min-parallelism (existing)   │
                  │   ├─ demand-driven scale-up       │
                  │   ├─ idle-driven scale-down       │
                  │   └─ graceful drain               │
                  └──────────────┬────────────────────┘
                                 │
                  requestResource() / releaseResource()
                                 │
           ┌─────────────────────┼─────────────────────┐
           │                     │                      │
     K8s Container ★       Flink Container        Spark Container
     (primary target)      (unchanged)            (unchanged)
```

### New configuration properties (ResourceGroup level, opt-in)

| Property | Default | Description |
|---|---|---|
| `dynamic-allocation.enabled` | `false` | Opt-in flag. Existing behavior fully preserved when disabled. |
| `max-parallelism` | `Integer.MAX_VALUE` | Upper bound on total threads |
| `dynamic-allocation.executor-idle-timeout` | `5m` | Remove an idle optimizer after this duration |
| `dynamic-allocation.scheduler-backlog-timeout` | `1m` | Trigger scale-up after sustained demand |
| `dynamic-allocation.sustained-backlog-timeout` | `30s` | Interval between subsequent scale-up rounds |
| `dynamic-allocation.scale-up-cooldown` | `30s` | Minimum interval between scale-up actions |
| `dynamic-allocation.scale-down-cooldown` | `1m` | Minimum interval between scale-down actions |
| `dynamic-allocation.drain-timeout` | `15m` | Force-remove a draining optimizer after this duration |

### Key behaviors

- **Scale-up**: Two-layer demand signal — immediate (PLANNED tasks exceed available threads → exponential ramp 1, 2, 4, 8...) and future (pending tables with low thread headroom → conservative pre-warm). Tracks in-flight pod registrations to avoid duplicate creation.
- **Scale-down**: Event-driven per-optimizer idle tracking. Removes the longest-idle optimizer one at a time, respecting `min-parallelism` floor.
- **Graceful drain**: Draining optimizers receive no new tasks (`pollTask()` returns null), but in-flight tasks complete naturally. Pod is deleted only after all tasks finish (or `drain-timeout` expires).

## Subtasks

- [ ] Phase 1: Configuration infrastructure + `max-parallelism` enforcement
- [ ] Phase 2: Demand-driven scale-up with exponential ramp
- [ ] Phase 3: Idle detection + scale-down + graceful drain
- [ ] Phase 4: Observability metrics + documentation

A detailed design document is available for reference during implementation and PR review.

Property	Default	Description
`dynamic-allocation.enabled`	`false`	Opt-in flag. Existing behavior fully preserved when disabled.
`max-parallelism`	`Integer.MAX_VALUE`	Upper bound on total threads
`dynamic-allocation.executor-idle-timeout`	`5m`	Remove an idle optimizer after this duration
`dynamic-allocation.scheduler-backlog-timeout`	`1m`	Trigger scale-up after sustained demand
`dynamic-allocation.sustained-backlog-timeout`	`30s`	Interval between subsequent scale-up rounds
`dynamic-allocation.scale-up-cooldown`	`30s`	Minimum interval between scale-up actions
`dynamic-allocation.scale-down-cooldown`	`1m`	Minimum interval between scale-down actions
`dynamic-allocation.drain-timeout`	`15m`	Force-remove a draining optimizer after this duration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Dynamic Resource Allocation for Optimizer #4191

Description

Use case/motivation

Describe the solution

New configuration properties (ResourceGroup level, opt-in)

Key behaviors

Subtasks

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature]: Dynamic Resource Allocation for Optimizer #4191

Description

Description

Use case/motivation

Describe the solution

New configuration properties (ResourceGroup level, opt-in)

Key behaviors

Subtasks

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions