[AMORO-4172] Self-heal stuck RUNNING optimizing process in TableRuntimeRefreshExecutor by fightBoxing · Pull Request #4196 · apache/amoro

fightBoxing · 2026-04-25T14:21:49Z

Supersedes #4195 (closed — that branch was stale against the recent scheduler-framework refactor).

Problem

When a table's optimizing process reaches the post-execution phase — all tasks are `SUCCESS` and the code is about to transition the table to `COMMITTING` — a transient failure of `DefaultTableRuntime#beginCommitting()` (e.g. DB lock wait timeout in the underlying `UPDATE table_runtime`) leaves the process `RUNNING` but the table status stuck in `*_OPTIMIZING`, forever, until AMS is restarted. The issue reporter's stack trace confirms exactly this symptom.

Why existing code paths do not recover the table

`OptimizingQueue.acceptResult()` (L677–L683) calls `tableRuntime.beginCommitting()` once the moment a task transitions to `SUCCESS`. It has no retry: a single transient DB failure is unrecoverable at runtime.
`OptimizingQueue#initTableRuntime` (L144) does re-drive `beginCommitting()` when `allTasksPrepared()`, but only at AMS startup.
`OptimizingCommitExecutor` only schedules tables already in `COMMITTING`, so it cannot see the stuck state.

Result: while AMS is alive, there is no self-healing; the operator must restart AMS.

Approach

Leverage the existing `TableRuntimeRefreshExecutor`, which already scans every table periodically. At the end of `execute()`, detect the stuck pattern and re-drive the transition:

```
process != null
&& process.getStatus() == ProcessStatus.RUNNING
&& tableRuntime.getOptimizingStatus() != OptimizingStatus.COMMITTING
&& tableRuntime.getOptimizingStatus().isProcessing()
&& process.allTasksPrepared()
```

When all conditions hold, call `tableRuntime.beginCommitting()`. On success, the normal `handleTableChanged` → `OptimizingCommitExecutor` pipeline resumes. On failure, log at `WARN` and retry on the next refresh cycle (≤ 1 minute by default).

Why this location

`TableRuntimeRefreshExecutor` already runs `enabled() == true` for every table — no new thread pool, no new scheduling mechanism, and it's exactly the right granularity (slow poll, low cost per table).
`OptimizingCommitExecutor` is intentionally kept as-is; it gets notified via the existing `handleTableChanged` event once the self-heal succeeds.
`OptimizingQueue.acceptResult()` hot path (concurrent task completion) is not changed — we only expose an existing read method. The lock added around `allTasksPrepared()` is the same `ReentrantLock` that already serializes `acceptResult`.

Changes

File	Change
`OptimizingProcess.java`	Add `boolean allTasksPrepared()` to the interface.
`OptimizingQueue.java`	`TableOptimizingProcess.allTasksPrepared()` becomes `@Override public` and wraps `lock.lock()` / `unlock()` for safe cross-thread `taskMap` observation.
`scheduler/inline/TableRuntimeRefreshExecutor.java`	New private `tryHealStuckCommitting()` invoked at the end of `execute()`.
`TestDefaultOptimizingService.java`	New regression test `testRefreshExecutorHealsStuckRunningProcess` that uses `TableRuntimeStore#begin().updateStatusCode` to simulate the stuck state without a reload, then verifies a single `execute()` call transitions the table back to `COMMITTING`.

```
4 files changed, 133 insertions(+), 6 deletions(-)
```

Verification

All targeted test suites pass locally on JDK 11:

```
mvn -pl amoro-ams -am test
-Dtest='TestDefaultOptimizingService#testRefreshExecutorHealsStuckRunningProcess'
-Dsurefire.failIfNoSpecifiedTests=false -DskipITs
→ Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, BUILD SUCCESS

mvn -pl amoro-ams -am test -Dtest='TestDefaultOptimizingService'
→ Tests run: 22, Failures: 0, Errors: 0, Skipped: 0, BUILD SUCCESS

mvn -pl amoro-ams -am test -Dtest='TestOptimizingQueue'
→ Tests run: 16, Failures: 0, Errors: 0, Skipped: 0, BUILD SUCCESS

mvn -pl amoro-ams -am test -Dtest='TestTableRuntimeRefreshExecutor'
→ Tests run: 16, Failures: 0, Errors: 0, Skipped: 0, BUILD SUCCESS
```

`spotless:check` clean. The self-heal path is observable during the new test:

```
WARN TableRuntimeRefresher: test_catalog.test_db.test_table detected stuck RUNNING
optimizing process (processId=..., status=MINOR_OPTIMIZING): all tasks have
succeeded but the table never transitioned to COMMITTING. Self-healing by
re-driving beginCommitting() (issue #4172).
```

Risk & behavioural notes

No change to concurrent hot paths. Only one additional idempotent read (`allTasksPrepared`, lock-protected) per refresh cycle, short-circuited early for tables that are not processing.
Worst case if the underlying DB stays unhealthy: self-heal keeps logging at `WARN` and retrying every refresh cycle — identical cost to today's plus one cheap read, but strictly better than "stuck until restart".
This is complementary to the existing restart-recovery in `OptimizingQueue#initTableRuntime`, which remains unchanged. A companion test `testReloadAllTasksCompletedNotYetCommitting` already guards that path.

@OverRide

…meRefreshExecutor When a RUNNING optimizing process has all tasks succeeded but a transient failure of beginCommitting() (e.g. DB lock wait timeout) leaves the table stuck in *_OPTIMIZING indefinitely while AMS keeps running, the status is never recovered until AMS is restarted. The existing recovery path in OptimizingQueue#initTableRuntime already handles this on AMS restart, but there is currently no periodic retry while AMS is alive. Add a lightweight self-heal path in TableRuntimeRefreshExecutor.execute: if it detects a process in RUNNING state whose status != COMMITTING and allTasksPrepared() == true, it re-invokes beginCommitting(). On success, the normal OptimizingCommitExecutor pipeline resumes via handleTableChanged; on failure, it logs and retries on the next refresh cycle. Changes: - OptimizingProcess: add allTasksPrepared() to the interface - OptimizingQueue.TableOptimizingProcess: promote allTasksPrepared() to public @OverRide and wrap it in the existing lock for safe cross-thread observation of taskMap - TableRuntimeRefreshExecutor (server/scheduler/inline/): invoke tryHealStuckCommitting at the end of execute() - Test: regression test using TableRuntimeStore.begin().updateStatusCode to simulate the stuck state without a reload, verifying that a single refresh cycle restores the table to COMMITTING Fixes apache#4172

czy006

Thanks for the fix. I do not think this PR fully closes #4172 yet.

The current self-heal path only retries beginCommitting() when process.allTasksPrepared() is true. However, in the reported failure stack, beginCommitting() is invoked from the TaskRuntime.complete() callback before the outer transaction has successfully committed. If beginCommitting() fails with a DB lock wait timeout, StatedPersistentBase.invokeConsistency() can restore the current TaskRuntime state fields, so the last task may remain ACKED instead of SUCCESS.

In that real failure state, allTasksPrepared() returns false, and the new TableRuntimeRefreshExecutor recovery path will not run.

The added test covers a narrower scenario: it first lets completeTask() succeed, then manually changes only the table status back to MINOR_OPTIMIZING. That validates recovery for RUNNING + all tasks SUCCESS + table not COMMITTING, but it does not reproduce the actual lock-timeout rollback path described in #4172.

I think this needs either:

a regression test that injects a failure from beginCommitting() during completeTask() and verifies the post-failure task/table/process state, or
a fix that also handles the case where the final task was rolled back to ACKED while the optimizer has already produced the task result.

Until that real failure path is covered, I would treat this PR as a partial mitigation rather than a complete fix for #4172.

github-actions Bot added the module:ams-server Ams server module label Apr 25, 2026

fightBoxing mentioned this pull request Apr 25, 2026

[Bug]: status of process hang on running #4172

Open

2 tasks

czy006 reviewed Apr 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMORO-4172] Self-heal stuck RUNNING optimizing process in TableRuntimeRefreshExecutor#4196

[AMORO-4172] Self-heal stuck RUNNING optimizing process in TableRuntimeRefreshExecutor#4196
fightBoxing wants to merge 1 commit intoapache:masterfrom
fightBoxing:fix/4172-self-heal-stuck-committing-v2

fightBoxing commented Apr 25, 2026

Uh oh!

czy006 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fightBoxing commented Apr 25, 2026

Problem

Why existing code paths do not recover the table

Approach

Why this location

Changes

Verification

Risk & behavioural notes

Uh oh!

czy006 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants