PS-11202 [8.0]: Fix abrupt server termination when reading page tracking files fails#5988
Open
jakub-nowakowski-percona wants to merge 1 commit into
Conversation
Contributor
Author
satya-bodapati
requested changes
Jun 2, 2026
satya-bodapati
left a comment
Contributor
There was a problem hiding this comment.
As discussed on slack, please check during recovery, an ERROR is thrown (not a warning). it is big enough deal for user to know that page trackign is not initialized.
Please grep for this error after the startup and also add queries to show all page tracking queries respond with -1
| arch_oper_mutex_exit(); | ||
| ut_d(ut_error); | ||
| ut_o(return); | ||
| ut_o(return ); |
Contributor
There was a problem hiding this comment.
modified by mistake?
Contributor
Author
There was a problem hiding this comment.
It's due to clang-format. I can't stop it from changing that.
| Arch_File_Ctx Arch_Group::s_dblwr_file_ctx; | ||
|
|
||
| Arch_Group::~Arch_Group() { | ||
| ut_ad(!m_is_active); |
Contributor
There was a problem hiding this comment.
can we instead make m_is_active() to false on error paths?
Contributor
Author
There was a problem hiding this comment.
Done. m_is_active is now set to false in Arch_Page_Sys::Recovery::recover() with group->disable(LSN_MAX);.
…ing files fails https://perconadev.atlassian.net/browse/PS-11202 When ib_page_* files in #ib_archive contain corrupted data or when reading them fails due to an I/O error, the server terminates with a fatal error in release builds or hits an assertion in debug builds. Replace all fatal os_file_read calls in the page archiver with os_file_read_no_error_handling, convert unsafe assertions on file data to runtime checks, and propagate errors gracefully so the server logs a warning and continues as if page tracking were unavailable.
0c72c65 to
f996b61
Compare
nogueiraanderson
added a commit
that referenced
this pull request
Jun 5, 2026
…#5995) - author_association in the pull_request_target payload reports CONTRIBUTOR for private percona org members, so the prior gate skipped them (verified on PR #5988: the pull_request_target run was skipped) - replace it with an authorize job that checks repo collaborator permission (write+) via GHA_RUNNER_PAT; members resolve to write, non-collaborators to read on this public repo (empirically confirmed) - dispatch now needs authorize and runs only when ok == true PS-11254
inikep
pushed a commit
that referenced
this pull request
Jun 17, 2026
…flow Replaces the Cirrus `(arm64) gcc RelWithDebInfo [Noble]` task on 8.0, trunk, and 8.4 before Cirrus shuts down 2026-06-01. Centralised on the 8.0 default branch because GHA `schedule:` triggers only fire from the default branch. Three cron entries dispatch to the corresponding ref via the `pickbranch` job; the build step gates Boost cache, KEYRING_VAULT vs COMPONENT_KEYRING_VAULT, READLINE vs EDITLINE, and WITH_CURL on the target branch. MTR suite: `binlog_nogtid` per Przemek 2026-05-18: "I would keep binlog_nogtid for RelWithDebInfo cron." Manual `workflow_dispatch` exposes a branch selector for ad-hoc runs. PS-11078: add path-filtered pull_request self-test trigger The nightly workflow had no pull_request trigger, so opening this PR could not exercise the new file. The other PR's CI green came from build-arm64.yml firing incidentally (because PR target is 8.0), not from validating the nightly file's RelWithDebInfo + binlog_nogtid + pickbranch logic. Add a path-filtered pull_request trigger so the workflow self-tests when the file itself changes. pickbranch handles pull_request by building against `github.base_ref` (the PR's target branch), exercising the branch-aware cmake conditional for that target. This stays in the merged file as a permanent safety net: future edits to build-arm64-nightly.yml will self-validate before merge. PS-11078: harden create-runner against cloud-init bootstrap flake Three flakes today (5974 first, 5972 first, 5972 rerun) all failed at the same step: Hetzner VM provisions fine, but the runner agent never calls home within the 10-min `runner_wait` budget. Codex diagnosis 2026-05-22: cloud-init bootstrap (apt update/install, runner release download, config.sh) hits transient slowness; default budget too tight. - runner_version: '2.334.0' (was implicit 'latest', fetched on every boot) - runner_wait: '120' (= 120x10s = 20min, was default 10min) - pre_runner_script: retry apt 3x with 15/30/45s backoff - concurrency: pull_request gets PR-scoped group + cancel-in-progress (was global, never-cancel). Cron + workflow_dispatch stay protected. Followup (separate PR): same hardening on build-arm64.yml; ship a debug escape hatch that preserves /var/log/cloud-init-output.log on create-runner failure. PS-11078: revert runner_version pin + add debug-keep-vm diagnostic Tonight's 4th create-runner failure (run 26314796152) was not transient apt slowness or fixed by pinning runner_version=2.334.0; Hetzner appears degraded (cax41 fully unavailable, 429 burst on the token, repeated runner-registration timeouts). The pin was a speculative variable that shouldn't stay until we have VM-side evidence. Reverts: - runner_version (was '2.334.0', back to action default 'latest') Keeps (cheap, useful regardless): - runner_wait: '120' (20min budget) - pre_runner_script: apt retry 3x with backoff Adds: - workflow_dispatch input `debug_keep_vm` (boolean, default false) - Step `Preserve VM for manual diagnosis (on failure, debug-only)` in create-runner that prints VM IP + SSH command + cleanup snippet to the GHA step summary when debug_keep_vm is true and create-runner failed - delete-runner skip condition: when debug_keep_vm && create-runner failure, leave the VM alive for SSH diagnosis. orphan-sweep reaps after 6h. Tomorrow morning workflow: 1. Trigger workflow_dispatch with debug_keep_vm=true 2. Wait for create-runner failure (or success) 3. If fail: SSH in with key 107239874, grab /var/log/cloud-init-output.log and /actions-runner/_diag/, then DELETE the VM via Hetzner API PS-11078: unify per-PR + nightly arm64 into build.yml - New build.yml: dispatch job picks BUILD_TYPE/MTR/CCACHE_MAXSIZE per event (pull_request -> Debug+main.1st, schedule '0 1 * * *' -> RelWithDebInfo+binlog_nogtid, workflow_dispatch -> input.build_type) - Drop pickbranch + branch-aware cmake/apt; this file is 8.0-only per Przemek 2026-05-27 - Rename step "System and compiler info" -> "Compiler and cmake info" - x86_64 nightly intentionally not wired; sibling job can reuse dispatch outputs later - Delete build-arm64.yml + build-arm64-nightly.yml (folded in) PS-11219: AWS Graviton EC2 fallback when Hetzner CAX exhausted build.yml: pick-target retries Hetzner 9 sweeps (backoff 2/5/10/15/20/30/45/60 min, ~3h7m total) before falling back to c7g.4xlarge in eu-central-1 (spot across 1a/1b/1c, then on-demand same AZs). OIDC via aws-actions/configure-aws-credentials@v4 SHA- pinned; one-shot runner registration token via gh api; userdata rendered inline as heredoc (single source of truth, no PR-controlled checkout); --client-token for launch idempotency; tag-based delete-runner-aws discovery for cancelled-mid-launch cleanup; permissions: id-token: write scoped per-job. orphan-sweep.yml: sweep-ec2 job (hourly, 6h threshold, OIDC, tag- keyed) covers any leaked instances the per-run delete misses. Validated end-to-end via fork PoC nogueiraanderson/percona-server@PS-11179-ec2-fallback-poc iteration 11 (build-arm64 12m46s cold ccache on c7g.4xlarge in eu-central-1). Requires: secret AWS_ROLE_ARN set to the role ARN provisioned by the companion Percona/percona-cd-platform PR. Reviewer: @inikep. PS-11078 (parent: Cirrus to GHA arm64 migration on percona-server). PS-11179 (sibling: Jenkins-fleet ARM fallback). PS-11254: Build arm64 on fork PRs from percona org members (#5994) * ci(build): cut Hetzner capacity sweeps 9 to 4 for faster AWS fallback - pick-target: MAX_SWEEPS 9->4, BACKOFF_MIN (2 5 10 15 20 30 45 60)->(2 5 10) - AWS Graviton EC2 fallback now fires after ~17m instead of ~3h7m - pick-target timeout-minutes 240->30 to match the shorter sweep budget - refresh stale ~3h7m and 9-sweep comments and step-summary strings PS-11254 * ci(build): gate fork-PR arm64 builds to percona org members - add pull_request_target trigger so fork PRs resolve secrets in base-repo context (same-repo PRs stay on pull_request, no double-run) - dispatch gate authorizes fork PRs only when the author is a percona org member (author_association MEMBER/OWNER/COLLABORATOR); no per-run approval - build-arm64 checks out the fork head sha and holds no secrets, preserving the trust split (workflow + job permissions stay contents: read) - scope concurrency group by event_name and make pull_request_target cancellable PS-11254 ci(build): gate fork PRs by repo write access, not author_association (#5995) - author_association in the pull_request_target payload reports CONTRIBUTOR for private percona org members, so the prior gate skipped them (verified on PR #5988: the pull_request_target run was skipped) - replace it with an authorize job that checks repo collaborator permission (write+) via GHA_RUNNER_PAT; members resolve to write, non-collaborators to read on this public repo (empirically confirmed) - dispatch now needs authorize and runs only when ok == true PS-11254
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
https://perconadev.atlassian.net/browse/PS-11202
When ib_page_* files in #ib_archive contain corrupted data or when reading them fails due to an I/O error, the server terminates with a fatal error in release builds or hits an assertion in debug builds. Replace all fatal os_file_read calls in the page archiver with os_file_read_no_error_handling, convert unsafe assertions on file data to runtime checks, and propagate errors gracefully so the server logs a warning and continues as if page tracking were unavailable.