[AMORO-4172] Self-heal stuck RUNNING optimizing process in TableRuntimeRefreshExecutor#4196
Conversation
…meRefreshExecutor When a RUNNING optimizing process has all tasks succeeded but a transient failure of beginCommitting() (e.g. DB lock wait timeout) leaves the table stuck in *_OPTIMIZING indefinitely while AMS keeps running, the status is never recovered until AMS is restarted. The existing recovery path in OptimizingQueue#initTableRuntime already handles this on AMS restart, but there is currently no periodic retry while AMS is alive. Add a lightweight self-heal path in TableRuntimeRefreshExecutor.execute: if it detects a process in RUNNING state whose status != COMMITTING and allTasksPrepared() == true, it re-invokes beginCommitting(). On success, the normal OptimizingCommitExecutor pipeline resumes via handleTableChanged; on failure, it logs and retries on the next refresh cycle. Changes: - OptimizingProcess: add allTasksPrepared() to the interface - OptimizingQueue.TableOptimizingProcess: promote allTasksPrepared() to public @OverRide and wrap it in the existing lock for safe cross-thread observation of taskMap - TableRuntimeRefreshExecutor (server/scheduler/inline/): invoke tryHealStuckCommitting at the end of execute() - Test: regression test using TableRuntimeStore.begin().updateStatusCode to simulate the stuck state without a reload, verifying that a single refresh cycle restores the table to COMMITTING Fixes apache#4172
czy006
left a comment
There was a problem hiding this comment.
Thanks for the fix. I do not think this PR fully closes #4172 yet.
The current self-heal path only retries beginCommitting() when process.allTasksPrepared() is true. However, in the reported failure stack, beginCommitting() is invoked from the TaskRuntime.complete() callback before the outer transaction has successfully committed. If beginCommitting() fails with a DB lock wait timeout, StatedPersistentBase.invokeConsistency() can restore the current TaskRuntime state fields, so the last task may remain ACKED instead of SUCCESS.
In that real failure state, allTasksPrepared() returns false, and the new TableRuntimeRefreshExecutor recovery path will not run.
The added test covers a narrower scenario: it first lets completeTask() succeed, then manually changes only the table status back to MINOR_OPTIMIZING. That validates recovery for RUNNING + all tasks SUCCESS + table not COMMITTING, but it does not reproduce the actual lock-timeout rollback path described in #4172.
I think this needs either:
- a regression test that injects a failure from
beginCommitting()duringcompleteTask()and verifies the post-failure task/table/process state, or - a fix that also handles the case where the final task was rolled back to
ACKEDwhile the optimizer has already produced the task result.
Until that real failure path is covered, I would treat this PR as a partial mitigation rather than a complete fix for #4172.
Fixes #4172.
Supersedes #4195 (closed — that branch was stale against the recent scheduler-framework refactor).
Problem
When a table's optimizing process reaches the post-execution phase — all tasks are `SUCCESS` and the code is about to transition the table to `COMMITTING` — a transient failure of `DefaultTableRuntime#beginCommitting()` (e.g. DB lock wait timeout in the underlying `UPDATE table_runtime`) leaves the process `RUNNING` but the table status stuck in `*_OPTIMIZING`, forever, until AMS is restarted. The issue reporter's stack trace confirms exactly this symptom.
Why existing code paths do not recover the table
Result: while AMS is alive, there is no self-healing; the operator must restart AMS.
Approach
Leverage the existing `TableRuntimeRefreshExecutor`, which already scans every table periodically. At the end of `execute()`, detect the stuck pattern and re-drive the transition:
```
process != null
&& process.getStatus() == ProcessStatus.RUNNING
&& tableRuntime.getOptimizingStatus() != OptimizingStatus.COMMITTING
&& tableRuntime.getOptimizingStatus().isProcessing()
&& process.allTasksPrepared()
```
When all conditions hold, call `tableRuntime.beginCommitting()`. On success, the normal `handleTableChanged` → `OptimizingCommitExecutor` pipeline resumes. On failure, log at `WARN` and retry on the next refresh cycle (≤ 1 minute by default).
Why this location
Changes
```
4 files changed, 133 insertions(+), 6 deletions(-)
```
Verification
All targeted test suites pass locally on JDK 11:
```
mvn -pl amoro-ams -am test
-Dtest='TestDefaultOptimizingService#testRefreshExecutorHealsStuckRunningProcess'
-Dsurefire.failIfNoSpecifiedTests=false -DskipITs
→ Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, BUILD SUCCESS
mvn -pl amoro-ams -am test -Dtest='TestDefaultOptimizingService'
→ Tests run: 22, Failures: 0, Errors: 0, Skipped: 0, BUILD SUCCESS
mvn -pl amoro-ams -am test -Dtest='TestOptimizingQueue'
→ Tests run: 16, Failures: 0, Errors: 0, Skipped: 0, BUILD SUCCESS
mvn -pl amoro-ams -am test -Dtest='TestTableRuntimeRefreshExecutor'
→ Tests run: 16, Failures: 0, Errors: 0, Skipped: 0, BUILD SUCCESS
```
`spotless:check` clean. The self-heal path is observable during the new test:
```
WARN TableRuntimeRefresher: test_catalog.test_db.test_table detected stuck RUNNING
optimizing process (processId=..., status=MINOR_OPTIMIZING): all tasks have
succeeded but the table never transitioned to COMMITTING. Self-healing by
re-driving beginCommitting() (issue #4172).
```
Risk & behavioural notes