[AMORO-4198][optimizer] Support graceful shutdown for in-progress tasks by j1wonpark · Pull Request #4197 · apache/amoro

j1wonpark · 2026-04-30T11:42:07Z

Why are the changes needed?

When an optimizer receives SIGTERM, in-progress tasks are silently dropped and
AMS re-schedules them — doubling work and potentially causing duplicate commits.

Brief change log

Optimizer.stopOptimizing(): join executor threads up to --shutdown-timeout-ms
(default 10 min); keep toucher alive during drain so AMS heartbeats continue
OptimizerExecutor.completeTask(): best-effort direct call after shutdown so
results are not silently dropped
OptimizerToucher.stop(): interrupt runner thread to wake it from sleep immediately
AbstractOptimizerOperator.waitAShortTime(): preserve interrupt flag
OptimizerConfig: new -st / --shutdown-timeout-ms option
StandaloneOptimizer / SparkOptimizer: register graceful shutdown hook on
Hadoop's ShutdownHookManager above FS_CACHE priority with explicit timeout
KubernetesOptimizerContainer: exec prefix in container command; derive
terminationGracePeriodSeconds from shutdown-timeout-ms + 30s buffer
optimizer.sh start-foreground: exec $CMDS so Java receives SIGTERM directly

How was this patch tested?

Add some test cases that check the changes thoroughly including negative and positive cases if possible
Add screenshots for manual tests if appropriate
Run test locally before making a pull request

Documentation

Does this pull request introduce a new feature? yes
If yes, how is the feature documented? JavaDocs / option usage string

On SIGTERM the optimizer flips its stopped flag and returns immediately, so in-flight task results are silently dropped (completeTask is gated by isStarted). On K8s this is compounded by `sh -c` swallowing SIGTERM, a 30s default grace period, and Hadoop's FileSystem cache cleanup racing JVM shutdown hooks. - Optimizer.stopOptimizing: join executors with a deadline, force interrupt only on timeout; keep toucher alive so AMS heartbeats continue while tasks drain. - OptimizerExecutor.completeTask: best-effort direct call after stop so the in-flight result still reaches AMS. - SparkOptimizer / StandaloneOptimizer: register on Hadoop ShutdownHookManager (priority above FS_CACHE / SparkContext) with an explicit per-hook timeout. - OptimizerConfig: new -st / --shutdown-timeout-ms (default 600s). - KubernetesOptimizerContainer: `sh -c 'exec <args>'` and an explicit terminationGracePeriodSeconds derived from -st + 30s buffer; user podTemplate values are respected. - optimizer.sh start-foreground: exec $CMDS so java gets PID 1. Signed-off-by: Jiwon Park <jpark92@outlook.kr>

github-actions Bot added module:ams-server Ams server module module:ams-optimizer AMS optimizer module type:build module:common labels Apr 30, 2026

j1wonpark changed the title ~~[AMORO][optimizer] Support graceful shutdown for in-progress tasks~~ [AMORO-4198][optimizer] Support graceful shutdown for in-progress tasks Apr 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMORO-4198][optimizer] Support graceful shutdown for in-progress tasks#4197

[AMORO-4198][optimizer] Support graceful shutdown for in-progress tasks#4197
j1wonpark wants to merge 1 commit intoapache:masterfrom
j1wonpark:optimizer-graceful-shutdown

j1wonpark commented Apr 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

j1wonpark commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are the changes needed?

Brief change log

How was this patch tested?

Documentation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

j1wonpark commented Apr 30, 2026 •

edited

Loading