Skip to content

[AMORO-4172] Self-heal stuck RUNNING optimizing process in TableRuntimeRefreshExecutor#4196

Open
fightBoxing wants to merge 1 commit intoapache:masterfrom
fightBoxing:fix/4172-self-heal-stuck-committing-v2
Open

[AMORO-4172] Self-heal stuck RUNNING optimizing process in TableRuntimeRefreshExecutor#4196
fightBoxing wants to merge 1 commit intoapache:masterfrom
fightBoxing:fix/4172-self-heal-stuck-committing-v2

Conversation

@fightBoxing
Copy link
Copy Markdown

Fixes #4172.

Supersedes #4195 (closed — that branch was stale against the recent scheduler-framework refactor).

Problem

When a table's optimizing process reaches the post-execution phase — all tasks are `SUCCESS` and the code is about to transition the table to `COMMITTING` — a transient failure of `DefaultTableRuntime#beginCommitting()` (e.g. DB lock wait timeout in the underlying `UPDATE table_runtime`) leaves the process `RUNNING` but the table status stuck in `*_OPTIMIZING`, forever, until AMS is restarted. The issue reporter's stack trace confirms exactly this symptom.

Why existing code paths do not recover the table

  • `OptimizingQueue.acceptResult()` (L677–L683) calls `tableRuntime.beginCommitting()` once the moment a task transitions to `SUCCESS`. It has no retry: a single transient DB failure is unrecoverable at runtime.
  • `OptimizingQueue#initTableRuntime` (L144) does re-drive `beginCommitting()` when `allTasksPrepared()`, but only at AMS startup.
  • `OptimizingCommitExecutor` only schedules tables already in `COMMITTING`, so it cannot see the stuck state.

Result: while AMS is alive, there is no self-healing; the operator must restart AMS.

Approach

Leverage the existing `TableRuntimeRefreshExecutor`, which already scans every table periodically. At the end of `execute()`, detect the stuck pattern and re-drive the transition:

```
process != null
&& process.getStatus() == ProcessStatus.RUNNING
&& tableRuntime.getOptimizingStatus() != OptimizingStatus.COMMITTING
&& tableRuntime.getOptimizingStatus().isProcessing()
&& process.allTasksPrepared()
```

When all conditions hold, call `tableRuntime.beginCommitting()`. On success, the normal `handleTableChanged` → `OptimizingCommitExecutor` pipeline resumes. On failure, log at `WARN` and retry on the next refresh cycle (≤ 1 minute by default).

Why this location

  • `TableRuntimeRefreshExecutor` already runs `enabled() == true` for every table — no new thread pool, no new scheduling mechanism, and it's exactly the right granularity (slow poll, low cost per table).
  • `OptimizingCommitExecutor` is intentionally kept as-is; it gets notified via the existing `handleTableChanged` event once the self-heal succeeds.
  • `OptimizingQueue.acceptResult()` hot path (concurrent task completion) is not changed — we only expose an existing read method. The lock added around `allTasksPrepared()` is the same `ReentrantLock` that already serializes `acceptResult`.

Changes

File Change
`OptimizingProcess.java` Add `boolean allTasksPrepared()` to the interface.
`OptimizingQueue.java` `TableOptimizingProcess.allTasksPrepared()` becomes `@Override public` and wraps `lock.lock()` / `unlock()` for safe cross-thread `taskMap` observation.
`scheduler/inline/TableRuntimeRefreshExecutor.java` New private `tryHealStuckCommitting()` invoked at the end of `execute()`.
`TestDefaultOptimizingService.java` New regression test `testRefreshExecutorHealsStuckRunningProcess` that uses `TableRuntimeStore#begin().updateStatusCode` to simulate the stuck state without a reload, then verifies a single `execute()` call transitions the table back to `COMMITTING`.

```
4 files changed, 133 insertions(+), 6 deletions(-)
```

Verification

All targeted test suites pass locally on JDK 11:

```
mvn -pl amoro-ams -am test
-Dtest='TestDefaultOptimizingService#testRefreshExecutorHealsStuckRunningProcess'
-Dsurefire.failIfNoSpecifiedTests=false -DskipITs
→ Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, BUILD SUCCESS

mvn -pl amoro-ams -am test -Dtest='TestDefaultOptimizingService'
→ Tests run: 22, Failures: 0, Errors: 0, Skipped: 0, BUILD SUCCESS

mvn -pl amoro-ams -am test -Dtest='TestOptimizingQueue'
→ Tests run: 16, Failures: 0, Errors: 0, Skipped: 0, BUILD SUCCESS

mvn -pl amoro-ams -am test -Dtest='TestTableRuntimeRefreshExecutor'
→ Tests run: 16, Failures: 0, Errors: 0, Skipped: 0, BUILD SUCCESS
```

`spotless:check` clean. The self-heal path is observable during the new test:

```
WARN TableRuntimeRefresher: test_catalog.test_db.test_table detected stuck RUNNING
optimizing process (processId=..., status=MINOR_OPTIMIZING): all tasks have
succeeded but the table never transitioned to COMMITTING. Self-healing by
re-driving beginCommitting() (issue #4172).
```

Risk & behavioural notes

  • No change to concurrent hot paths. Only one additional idempotent read (`allTasksPrepared`, lock-protected) per refresh cycle, short-circuited early for tables that are not processing.
  • Worst case if the underlying DB stays unhealthy: self-heal keeps logging at `WARN` and retrying every refresh cycle — identical cost to today's plus one cheap read, but strictly better than "stuck until restart".
  • This is complementary to the existing restart-recovery in `OptimizingQueue#initTableRuntime`, which remains unchanged. A companion test `testReloadAllTasksCompletedNotYetCommitting` already guards that path.

…meRefreshExecutor

When a RUNNING optimizing process has all tasks succeeded but a transient
failure of beginCommitting() (e.g. DB lock wait timeout) leaves the table
stuck in *_OPTIMIZING indefinitely while AMS keeps running, the status is
never recovered until AMS is restarted. The existing recovery path in
OptimizingQueue#initTableRuntime already handles this on AMS restart, but
there is currently no periodic retry while AMS is alive.

Add a lightweight self-heal path in TableRuntimeRefreshExecutor.execute:
if it detects a process in RUNNING state whose status != COMMITTING and
allTasksPrepared() == true, it re-invokes beginCommitting(). On success,
the normal OptimizingCommitExecutor pipeline resumes via handleTableChanged;
on failure, it logs and retries on the next refresh cycle.

Changes:
- OptimizingProcess: add allTasksPrepared() to the interface
- OptimizingQueue.TableOptimizingProcess: promote allTasksPrepared() to
  public @OverRide and wrap it in the existing lock for safe cross-thread
  observation of taskMap
- TableRuntimeRefreshExecutor (server/scheduler/inline/): invoke
  tryHealStuckCommitting at the end of execute()
- Test: regression test using TableRuntimeStore.begin().updateStatusCode
  to simulate the stuck state without a reload, verifying that a single
  refresh cycle restores the table to COMMITTING

Fixes apache#4172
@github-actions github-actions Bot added the module:ams-server Ams server module label Apr 25, 2026
Copy link
Copy Markdown
Contributor

@czy006 czy006 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. I do not think this PR fully closes #4172 yet.

The current self-heal path only retries beginCommitting() when process.allTasksPrepared() is true. However, in the reported failure stack, beginCommitting() is invoked from the TaskRuntime.complete() callback before the outer transaction has successfully committed. If beginCommitting() fails with a DB lock wait timeout, StatedPersistentBase.invokeConsistency() can restore the current TaskRuntime state fields, so the last task may remain ACKED instead of SUCCESS.

In that real failure state, allTasksPrepared() returns false, and the new TableRuntimeRefreshExecutor recovery path will not run.

The added test covers a narrower scenario: it first lets completeTask() succeed, then manually changes only the table status back to MINOR_OPTIMIZING. That validates recovery for RUNNING + all tasks SUCCESS + table not COMMITTING, but it does not reproduce the actual lock-timeout rollback path described in #4172.

I think this needs either:

  1. a regression test that injects a failure from beginCommitting() during completeTask() and verifies the post-failure task/table/process state, or
  2. a fix that also handles the case where the final task was rolled back to ACKED while the optimizer has already produced the task result.

Until that real failure path is covered, I would treat this PR as a partial mitigation rather than a complete fix for #4172.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:ams-server Ams server module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: status of process hang on running

2 participants