Flink: Persist flink job id in DynamicWriteResultAggregator state by lrsb · Pull Request #16011 · apache/iceberg

lrsb · 2026-04-17T12:40:21Z

The DynamicCommitter deduplicates commits on the triplet (flink.job-id, flink.operator-id, max-committed-checkpoint-id) stored in each snapshot summary. The aggregator previously sampled the job id from the runtime environment on every operator open(), so a restart or rescale with a fresh Flink JobID would stamp restored write results with the new id. The committer's ancestor walk then could not find the snapshot committed under the old id, and silently re-committed the data, producing duplicate rows in the target table.

Persist the aggregator's job id as operator state and reuse it on restore, so committables keep the job id under which their data was originally produced. This restores the dedup invariant across job restarts, autoscaler savepoint+resubmit cycles, and session-cluster resubmissions, which was broken under sustained catalog-side latency where commit responses were lost client-side.

The DynamicCommitter deduplicates commits on the triplet (flink.job-id, flink.operator-id, max-committed-checkpoint-id) stored in each snapshot summary. The aggregator previously sampled the job id from the runtime environment on every operator open(), so a restart or rescale with a fresh Flink JobID would stamp restored write results with the new id. The committer's ancestor walk then could not find the snapshot committed under the old id, and silently re-committed the data, producing duplicate rows in the target table. Persist the aggregator's job id as operator state and reuse it on restore, so committables keep the job id under which their data was originally produced. This restores the dedup invariant across job restarts, autoscaler savepoint+resubmit cycles, and session-cluster resubmissions, which was broken under sustained catalog-side latency where commit responses were lost client-side.

DynamicWriteResultAggregator read the flink job id and operator id live from the runtime environment each time it built a DynamicCommittable. After a restart or rescale these values change, so committables produced from restored state were stamped with the new job id, defeating the persistence that DynamicCommitter dedup relies on. Use the aggregator's stored flinkJobId (restored from operator state) and operatorId fields when constructing committables, and normalize the operatorId source in open() to getOperatorUniqueID().

github-actions bot added the flink label Apr 17, 2026

lrsb marked this pull request as ready for review April 17, 2026 12:41

lrsb marked this pull request as draft April 17, 2026 13:15

lrsb added 2 commits April 17, 2026 15:25

lrsb marked this pull request as ready for review April 17, 2026 13:27

lrsb force-pushed the github branch from 6747d7d to fac1992 Compare April 17, 2026 13:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flink: Persist flink job id in DynamicWriteResultAggregator state#16011

Flink: Persist flink job id in DynamicWriteResultAggregator state#16011
lrsb wants to merge 2 commits intoapache:mainfrom
lrsb:github

lrsb commented Apr 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lrsb commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lrsb commented Apr 17, 2026 •

edited

Loading