Skip to content

[AMORO-4198][optimizer] Support graceful shutdown for in-progress tasks#4197

Open
j1wonpark wants to merge 1 commit intoapache:masterfrom
j1wonpark:optimizer-graceful-shutdown
Open

[AMORO-4198][optimizer] Support graceful shutdown for in-progress tasks#4197
j1wonpark wants to merge 1 commit intoapache:masterfrom
j1wonpark:optimizer-graceful-shutdown

Conversation

@j1wonpark
Copy link
Copy Markdown
Contributor

@j1wonpark j1wonpark commented Apr 30, 2026

Why are the changes needed?

Close #4198.

When an optimizer receives SIGTERM, in-progress tasks are silently dropped and
AMS re-schedules them — doubling work and potentially causing duplicate commits.

Brief change log

  • Optimizer.stopOptimizing(): join executor threads up to --shutdown-timeout-ms
    (default 10 min); keep toucher alive during drain so AMS heartbeats continue
  • OptimizerExecutor.completeTask(): best-effort direct call after shutdown so
    results are not silently dropped
  • OptimizerToucher.stop(): interrupt runner thread to wake it from sleep immediately
  • AbstractOptimizerOperator.waitAShortTime(): preserve interrupt flag
  • OptimizerConfig: new -st / --shutdown-timeout-ms option
  • StandaloneOptimizer / SparkOptimizer: register graceful shutdown hook on
    Hadoop's ShutdownHookManager above FS_CACHE priority with explicit timeout
  • KubernetesOptimizerContainer: exec prefix in container command; derive
    terminationGracePeriodSeconds from shutdown-timeout-ms + 30s buffer
  • optimizer.sh start-foreground: exec $CMDS so Java receives SIGTERM directly

How was this patch tested?

  • Add some test cases that check the changes thoroughly including negative and positive cases if possible

  • Add screenshots for manual tests if appropriate

  • Run test locally before making a pull request

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? JavaDocs / option usage string

On SIGTERM the optimizer flips its stopped flag and returns immediately,
so in-flight task results are silently dropped (completeTask is gated by
isStarted). On K8s this is compounded by `sh -c` swallowing SIGTERM, a
30s default grace period, and Hadoop's FileSystem cache cleanup racing
JVM shutdown hooks.

- Optimizer.stopOptimizing: join executors with a deadline, force
  interrupt only on timeout; keep toucher alive so AMS heartbeats
  continue while tasks drain.
- OptimizerExecutor.completeTask: best-effort direct call after stop so
  the in-flight result still reaches AMS.
- SparkOptimizer / StandaloneOptimizer: register on Hadoop
  ShutdownHookManager (priority above FS_CACHE / SparkContext) with an
  explicit per-hook timeout.
- OptimizerConfig: new -st / --shutdown-timeout-ms (default 600s).
- KubernetesOptimizerContainer: `sh -c 'exec <args>'` and an explicit
  terminationGracePeriodSeconds derived from -st + 30s buffer; user
  podTemplate values are respected.
- optimizer.sh start-foreground: exec $CMDS so java gets PID 1.

Signed-off-by: Jiwon Park <jpark92@outlook.kr>
@j1wonpark j1wonpark changed the title [AMORO][optimizer] Support graceful shutdown for in-progress tasks [AMORO-4198][optimizer] Support graceful shutdown for in-progress tasks Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Improvement]: Support graceful shutdown for in-progress optimizer tasks

1 participant