[AMORO-4198][optimizer] Support graceful shutdown for in-progress tasks#4197
Open
j1wonpark wants to merge 1 commit intoapache:masterfrom
Open
[AMORO-4198][optimizer] Support graceful shutdown for in-progress tasks#4197j1wonpark wants to merge 1 commit intoapache:masterfrom
j1wonpark wants to merge 1 commit intoapache:masterfrom
Conversation
On SIGTERM the optimizer flips its stopped flag and returns immediately, so in-flight task results are silently dropped (completeTask is gated by isStarted). On K8s this is compounded by `sh -c` swallowing SIGTERM, a 30s default grace period, and Hadoop's FileSystem cache cleanup racing JVM shutdown hooks. - Optimizer.stopOptimizing: join executors with a deadline, force interrupt only on timeout; keep toucher alive so AMS heartbeats continue while tasks drain. - OptimizerExecutor.completeTask: best-effort direct call after stop so the in-flight result still reaches AMS. - SparkOptimizer / StandaloneOptimizer: register on Hadoop ShutdownHookManager (priority above FS_CACHE / SparkContext) with an explicit per-hook timeout. - OptimizerConfig: new -st / --shutdown-timeout-ms (default 600s). - KubernetesOptimizerContainer: `sh -c 'exec <args>'` and an explicit terminationGracePeriodSeconds derived from -st + 30s buffer; user podTemplate values are respected. - optimizer.sh start-foreground: exec $CMDS so java gets PID 1. Signed-off-by: Jiwon Park <jpark92@outlook.kr>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are the changes needed?
Close #4198.
When an optimizer receives SIGTERM, in-progress tasks are silently dropped and
AMS re-schedules them — doubling work and potentially causing duplicate commits.
Brief change log
Optimizer.stopOptimizing(): join executor threads up to--shutdown-timeout-ms(default 10 min); keep toucher alive during drain so AMS heartbeats continue
OptimizerExecutor.completeTask(): best-effort direct call after shutdown soresults are not silently dropped
OptimizerToucher.stop(): interrupt runner thread to wake it from sleep immediatelyAbstractOptimizerOperator.waitAShortTime(): preserve interrupt flagOptimizerConfig: new-st/--shutdown-timeout-msoptionStandaloneOptimizer/SparkOptimizer: register graceful shutdown hook onHadoop's
ShutdownHookManageraboveFS_CACHEpriority with explicit timeoutKubernetesOptimizerContainer:execprefix in container command; deriveterminationGracePeriodSecondsfrom shutdown-timeout-ms + 30s bufferoptimizer.sh start-foreground:exec $CMDSso Java receives SIGTERM directlyHow was this patch tested?
Add some test cases that check the changes thoroughly including negative and positive cases if possible
Add screenshots for manual tests if appropriate
Run test locally before making a pull request
Documentation
usagestring