PS-11202 [8.0]: Fix abrupt server termination when reading page tracking files fails by jakub-nowakowski-percona · Pull Request #5988 · percona/percona-server

jakub-nowakowski-percona · 2026-06-01T10:57:20Z

https://perconadev.atlassian.net/browse/PS-11202

When ib_page_* files in #ib_archive contain corrupted data or when reading them fails due to an I/O error, the server terminates with a fatal error in release builds or hits an assertion in debug builds. Replace all fatal os_file_read calls in the page archiver with os_file_read_no_error_handling, convert unsafe assertions on file data to runtime checks, and propagate errors gracefully so the server logs a warning and continues as if page tracking were unavailable.

jakub-nowakowski-percona · 2026-06-02T10:19:29Z

Tests: https://ps80.cd.percona.com/view/8.0%20+%208.4%20parallel%20MTR/job/percona-server-8.0-param-parallel-mtr/163/

satya-bodapati

As discussed on slack, please check during recovery, an ERROR is thrown (not a warning). it is big enough deal for user to know that page trackign is not initialized.

Please grep for this error after the startup and also add queries to show all page tracking queries respond with -1

satya-bodapati · 2026-06-02T14:04:01Z

      arch_oper_mutex_exit();
      ut_d(ut_error);
-      ut_o(return);
+      ut_o(return );


modified by mistake?

It's due to clang-format. I can't stop it from changing that.

satya-bodapati · 2026-06-02T14:10:53Z

 Arch_File_Ctx Arch_Group::s_dblwr_file_ctx;

 Arch_Group::~Arch_Group() {
-  ut_ad(!m_is_active);


can we instead make m_is_active() to false on error paths?

Done. m_is_active is now set to false in Arch_Page_Sys::Recovery::recover() with group->disable(LSN_MAX);.

…ing files fails https://perconadev.atlassian.net/browse/PS-11202 When ib_page_* files in #ib_archive contain corrupted data or when reading them fails due to an I/O error, the server terminates with a fatal error in release builds or hits an assertion in debug builds. Replace all fatal os_file_read calls in the page archiver with os_file_read_no_error_handling, convert unsafe assertions on file data to runtime checks, and propagate errors gracefully so the server logs a warning and continues as if page tracking were unavailable.

…#5995) - author_association in the pull_request_target payload reports CONTRIBUTOR for private percona org members, so the prior gate skipped them (verified on PR #5988: the pull_request_target run was skipped) - replace it with an authorize job that checks repo collaborator permission (write+) via GHA_RUNNER_PAT; members resolve to write, non-collaborators to read on this public repo (empirically confirmed) - dispatch now needs authorize and runs only when ok == true PS-11254

@inikep

…flow Replaces the Cirrus `(arm64) gcc RelWithDebInfo [Noble]` task on 8.0, trunk, and 8.4 before Cirrus shuts down 2026-06-01. Centralised on the 8.0 default branch because GHA `schedule:` triggers only fire from the default branch. Three cron entries dispatch to the corresponding ref via the `pickbranch` job; the build step gates Boost cache, KEYRING_VAULT vs COMPONENT_KEYRING_VAULT, READLINE vs EDITLINE, and WITH_CURL on the target branch. MTR suite: `binlog_nogtid` per Przemek 2026-05-18: "I would keep binlog_nogtid for RelWithDebInfo cron." Manual `workflow_dispatch` exposes a branch selector for ad-hoc runs. PS-11078: add path-filtered pull_request self-test trigger The nightly workflow had no pull_request trigger, so opening this PR could not exercise the new file. The other PR's CI green came from build-arm64.yml firing incidentally (because PR target is 8.0), not from validating the nightly file's RelWithDebInfo + binlog_nogtid + pickbranch logic. Add a path-filtered pull_request trigger so the workflow self-tests when the file itself changes. pickbranch handles pull_request by building against `github.base_ref` (the PR's target branch), exercising the branch-aware cmake conditional for that target. This stays in the merged file as a permanent safety net: future edits to build-arm64-nightly.yml will self-validate before merge. PS-11078: harden create-runner against cloud-init bootstrap flake Three flakes today (5974 first, 5972 first, 5972 rerun) all failed at the same step: Hetzner VM provisions fine, but the runner agent never calls home within the 10-min `runner_wait` budget. Codex diagnosis 2026-05-22: cloud-init bootstrap (apt update/install, runner release download, config.sh) hits transient slowness; default budget too tight. - runner_version: '2.334.0' (was implicit 'latest', fetched on every boot) - runner_wait: '120' (= 120x10s = 20min, was default 10min) - pre_runner_script: retry apt 3x with 15/30/45s backoff - concurrency: pull_request gets PR-scoped group + cancel-in-progress (was global, never-cancel). Cron + workflow_dispatch stay protected. Followup (separate PR): same hardening on build-arm64.yml; ship a debug escape hatch that preserves /var/log/cloud-init-output.log on create-runner failure. PS-11078: revert runner_version pin + add debug-keep-vm diagnostic Tonight's 4th create-runner failure (run 26314796152) was not transient apt slowness or fixed by pinning runner_version=2.334.0; Hetzner appears degraded (cax41 fully unavailable, 429 burst on the token, repeated runner-registration timeouts). The pin was a speculative variable that shouldn't stay until we have VM-side evidence. Reverts: - runner_version (was '2.334.0', back to action default 'latest') Keeps (cheap, useful regardless): - runner_wait: '120' (20min budget) - pre_runner_script: apt retry 3x with backoff Adds: - workflow_dispatch input `debug_keep_vm` (boolean, default false) - Step `Preserve VM for manual diagnosis (on failure, debug-only)` in create-runner that prints VM IP + SSH command + cleanup snippet to the GHA step summary when debug_keep_vm is true and create-runner failed - delete-runner skip condition: when debug_keep_vm && create-runner failure, leave the VM alive for SSH diagnosis. orphan-sweep reaps after 6h. Tomorrow morning workflow: 1. Trigger workflow_dispatch with debug_keep_vm=true 2. Wait for create-runner failure (or success) 3. If fail: SSH in with key 107239874, grab /var/log/cloud-init-output.log and /actions-runner/_diag/, then DELETE the VM via Hetzner API PS-11078: unify per-PR + nightly arm64 into build.yml - New build.yml: dispatch job picks BUILD_TYPE/MTR/CCACHE_MAXSIZE per event (pull_request -> Debug+main.1st, schedule '0 1 * * *' -> RelWithDebInfo+binlog_nogtid, workflow_dispatch -> input.build_type) - Drop pickbranch + branch-aware cmake/apt; this file is 8.0-only per Przemek 2026-05-27 - Rename step "System and compiler info" -> "Compiler and cmake info" - x86_64 nightly intentionally not wired; sibling job can reuse dispatch outputs later - Delete build-arm64.yml + build-arm64-nightly.yml (folded in) PS-11219: AWS Graviton EC2 fallback when Hetzner CAX exhausted build.yml: pick-target retries Hetzner 9 sweeps (backoff 2/5/10/15/20/30/45/60 min, ~3h7m total) before falling back to c7g.4xlarge in eu-central-1 (spot across 1a/1b/1c, then on-demand same AZs). OIDC via aws-actions/configure-aws-credentials@v4 SHA- pinned; one-shot runner registration token via gh api; userdata rendered inline as heredoc (single source of truth, no PR-controlled checkout); --client-token for launch idempotency; tag-based delete-runner-aws discovery for cancelled-mid-launch cleanup; permissions: id-token: write scoped per-job. orphan-sweep.yml: sweep-ec2 job (hourly, 6h threshold, OIDC, tag- keyed) covers any leaked instances the per-run delete misses. Validated end-to-end via fork PoC nogueiraanderson/percona-server@PS-11179-ec2-fallback-poc iteration 11 (build-arm64 12m46s cold ccache on c7g.4xlarge in eu-central-1). Requires: secret AWS_ROLE_ARN set to the role ARN provisioned by the companion Percona/percona-cd-platform PR. Reviewer: @inikep. PS-11078 (parent: Cirrus to GHA arm64 migration on percona-server). PS-11179 (sibling: Jenkins-fleet ARM fallback). PS-11254: Build arm64 on fork PRs from percona org members (#5994) * ci(build): cut Hetzner capacity sweeps 9 to 4 for faster AWS fallback - pick-target: MAX_SWEEPS 9->4, BACKOFF_MIN (2 5 10 15 20 30 45 60)->(2 5 10) - AWS Graviton EC2 fallback now fires after ~17m instead of ~3h7m - pick-target timeout-minutes 240->30 to match the shorter sweep budget - refresh stale ~3h7m and 9-sweep comments and step-summary strings PS-11254 * ci(build): gate fork-PR arm64 builds to percona org members - add pull_request_target trigger so fork PRs resolve secrets in base-repo context (same-repo PRs stay on pull_request, no double-run) - dispatch gate authorizes fork PRs only when the author is a percona org member (author_association MEMBER/OWNER/COLLABORATOR); no per-run approval - build-arm64 checks out the fork head sha and holds no secrets, preserving the trust split (workflow + job permissions stay contents: read) - scope concurrency group by event_name and make pull_request_target cancellable PS-11254 ci(build): gate fork PRs by repo write access, not author_association (#5995) - author_association in the pull_request_target payload reports CONTRIBUTOR for private percona org members, so the prior gate skipped them (verified on PR #5988: the pull_request_target run was skipped) - replace it with an authorize job that checks repo collaborator permission (write+) via GHA_RUNNER_PAT; members resolve to write, non-collaborators to read on this public repo (empirically confirmed) - dispatch now needs authorize and runs only when ok == true PS-11254

jakub-nowakowski-percona requested a review from satya-bodapati June 1, 2026 10:57

jakub-nowakowski-percona self-assigned this Jun 1, 2026

satya-bodapati requested changes Jun 2, 2026

View reviewed changes

jakub-nowakowski-percona force-pushed the PS-11202-8.0-corrupted-page-tracking-info-crash branch from 0c72c65 to f996b61 Compare June 3, 2026 12:25

nogueiraanderson closed this Jun 5, 2026

nogueiraanderson reopened this Jun 5, 2026

nogueiraanderson closed this Jun 5, 2026

nogueiraanderson reopened this Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PS-11202 [8.0]: Fix abrupt server termination when reading page tracking files fails#5988

PS-11202 [8.0]: Fix abrupt server termination when reading page tracking files fails#5988
jakub-nowakowski-percona wants to merge 1 commit into
percona:8.0from
jakub-nowakowski-percona:PS-11202-8.0-corrupted-page-tracking-info-crash

jakub-nowakowski-percona commented Jun 1, 2026

Uh oh!

jakub-nowakowski-percona commented Jun 2, 2026

Uh oh!

satya-bodapati left a comment

Uh oh!

satya-bodapati Jun 2, 2026

Uh oh!

jakub-nowakowski-percona Jun 3, 2026

Uh oh!

satya-bodapati Jun 2, 2026

Uh oh!

jakub-nowakowski-percona Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

jakub-nowakowski-percona commented Jun 1, 2026

Uh oh!

jakub-nowakowski-percona commented Jun 2, 2026

Uh oh!

satya-bodapati left a comment

Choose a reason for hiding this comment

Uh oh!

satya-bodapati Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

jakub-nowakowski-percona Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

satya-bodapati Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

jakub-nowakowski-percona Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants