Skip to content

PS-11202 [8.0]: Fix abrupt server termination when reading page tracking files fails#5988

Open
jakub-nowakowski-percona wants to merge 1 commit into
percona:8.0from
jakub-nowakowski-percona:PS-11202-8.0-corrupted-page-tracking-info-crash
Open

PS-11202 [8.0]: Fix abrupt server termination when reading page tracking files fails#5988
jakub-nowakowski-percona wants to merge 1 commit into
percona:8.0from
jakub-nowakowski-percona:PS-11202-8.0-corrupted-page-tracking-info-crash

Conversation

@jakub-nowakowski-percona

Copy link
Copy Markdown
Contributor

https://perconadev.atlassian.net/browse/PS-11202

When ib_page_* files in #ib_archive contain corrupted data or when reading them fails due to an I/O error, the server terminates with a fatal error in release builds or hits an assertion in debug builds. Replace all fatal os_file_read calls in the page archiver with os_file_read_no_error_handling, convert unsafe assertions on file data to runtime checks, and propagate errors gracefully so the server logs a warning and continues as if page tracking were unavailable.

@jakub-nowakowski-percona

Copy link
Copy Markdown
Contributor Author

@satya-bodapati satya-bodapati left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed on slack, please check during recovery, an ERROR is thrown (not a warning). it is big enough deal for user to know that page trackign is not initialized.

Please grep for this error after the startup and also add queries to show all page tracking queries respond with -1

arch_oper_mutex_exit();
ut_d(ut_error);
ut_o(return);
ut_o(return );

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified by mistake?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's due to clang-format. I can't stop it from changing that.

Arch_File_Ctx Arch_Group::s_dblwr_file_ctx;

Arch_Group::~Arch_Group() {
ut_ad(!m_is_active);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we instead make m_is_active() to false on error paths?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. m_is_active is now set to false in Arch_Page_Sys::Recovery::recover() with group->disable(LSN_MAX);.

…ing files fails

https://perconadev.atlassian.net/browse/PS-11202

When ib_page_* files in #ib_archive contain corrupted data or when reading them
fails due to an I/O error, the server terminates with a fatal error in release
builds or hits an assertion in debug builds. Replace all fatal os_file_read
calls in the page archiver with os_file_read_no_error_handling, convert unsafe
assertions on file data to runtime checks, and propagate errors gracefully so
the server logs a warning and continues as if page tracking were unavailable.
@jakub-nowakowski-percona jakub-nowakowski-percona force-pushed the PS-11202-8.0-corrupted-page-tracking-info-crash branch from 0c72c65 to f996b61 Compare June 3, 2026 12:25
nogueiraanderson added a commit that referenced this pull request Jun 5, 2026
…#5995)

- author_association in the pull_request_target payload reports CONTRIBUTOR
  for private percona org members, so the prior gate skipped them (verified
  on PR #5988: the pull_request_target run was skipped)
- replace it with an authorize job that checks repo collaborator permission
  (write+) via GHA_RUNNER_PAT; members resolve to write, non-collaborators
  to read on this public repo (empirically confirmed)
- dispatch now needs authorize and runs only when ok == true

PS-11254
inikep pushed a commit that referenced this pull request Jun 17, 2026
…flow

Replaces the Cirrus `(arm64) gcc RelWithDebInfo [Noble]` task on 8.0,
trunk, and 8.4 before Cirrus shuts down 2026-06-01.

Centralised on the 8.0 default branch because GHA `schedule:` triggers
only fire from the default branch. Three cron entries dispatch to the
corresponding ref via the `pickbranch` job; the build step gates Boost
cache, KEYRING_VAULT vs COMPONENT_KEYRING_VAULT, READLINE vs EDITLINE,
and WITH_CURL on the target branch.

MTR suite: `binlog_nogtid` per Przemek 2026-05-18:
"I would keep binlog_nogtid for RelWithDebInfo cron."

Manual `workflow_dispatch` exposes a branch selector for ad-hoc runs.

PS-11078: add path-filtered pull_request self-test trigger

The nightly workflow had no pull_request trigger, so opening this PR
could not exercise the new file. The other PR's CI green came from
build-arm64.yml firing incidentally (because PR target is 8.0), not
from validating the nightly file's RelWithDebInfo + binlog_nogtid +
pickbranch logic.

Add a path-filtered pull_request trigger so the workflow self-tests
when the file itself changes. pickbranch handles pull_request by
building against `github.base_ref` (the PR's target branch), exercising
the branch-aware cmake conditional for that target.

This stays in the merged file as a permanent safety net: future edits
to build-arm64-nightly.yml will self-validate before merge.

PS-11078: harden create-runner against cloud-init bootstrap flake

Three flakes today (5974 first, 5972 first, 5972 rerun) all failed at
the same step: Hetzner VM provisions fine, but the runner agent never
calls home within the 10-min `runner_wait` budget. Codex diagnosis
2026-05-22: cloud-init bootstrap (apt update/install, runner release
download, config.sh) hits transient slowness; default budget too tight.

- runner_version: '2.334.0' (was implicit 'latest', fetched on every boot)
- runner_wait: '120' (= 120x10s = 20min, was default 10min)
- pre_runner_script: retry apt 3x with 15/30/45s backoff
- concurrency: pull_request gets PR-scoped group + cancel-in-progress
  (was global, never-cancel). Cron + workflow_dispatch stay protected.

Followup (separate PR): same hardening on build-arm64.yml; ship a
debug escape hatch that preserves /var/log/cloud-init-output.log on
create-runner failure.

PS-11078: revert runner_version pin + add debug-keep-vm diagnostic

Tonight's 4th create-runner failure (run 26314796152) was not transient
apt slowness or fixed by pinning runner_version=2.334.0; Hetzner appears
degraded (cax41 fully unavailable, 429 burst on the token, repeated
runner-registration timeouts). The pin was a speculative variable that
shouldn't stay until we have VM-side evidence.

Reverts:
- runner_version (was '2.334.0', back to action default 'latest')

Keeps (cheap, useful regardless):
- runner_wait: '120' (20min budget)
- pre_runner_script: apt retry 3x with backoff

Adds:
- workflow_dispatch input `debug_keep_vm` (boolean, default false)
- Step `Preserve VM for manual diagnosis (on failure, debug-only)` in
  create-runner that prints VM IP + SSH command + cleanup snippet to the
  GHA step summary when debug_keep_vm is true and create-runner failed
- delete-runner skip condition: when debug_keep_vm && create-runner failure,
  leave the VM alive for SSH diagnosis. orphan-sweep reaps after 6h.

Tomorrow morning workflow:
1. Trigger workflow_dispatch with debug_keep_vm=true
2. Wait for create-runner failure (or success)
3. If fail: SSH in with key 107239874, grab /var/log/cloud-init-output.log
   and /actions-runner/_diag/, then DELETE the VM via Hetzner API

PS-11078: unify per-PR + nightly arm64 into build.yml

- New build.yml: dispatch job picks BUILD_TYPE/MTR/CCACHE_MAXSIZE per event
  (pull_request -> Debug+main.1st, schedule '0 1 * * *' -> RelWithDebInfo+binlog_nogtid,
  workflow_dispatch -> input.build_type)
- Drop pickbranch + branch-aware cmake/apt; this file is 8.0-only per Przemek 2026-05-27
- Rename step "System and compiler info" -> "Compiler and cmake info"
- x86_64 nightly intentionally not wired; sibling job can reuse dispatch outputs later
- Delete build-arm64.yml + build-arm64-nightly.yml (folded in)

PS-11219: AWS Graviton EC2 fallback when Hetzner CAX exhausted

build.yml: pick-target retries Hetzner 9 sweeps (backoff
2/5/10/15/20/30/45/60 min, ~3h7m total) before falling back to
c7g.4xlarge in eu-central-1 (spot across 1a/1b/1c, then on-demand
same AZs). OIDC via aws-actions/configure-aws-credentials@v4 SHA-
pinned; one-shot runner registration token via gh api; userdata
rendered inline as heredoc (single source of truth, no PR-controlled
checkout); --client-token for launch idempotency; tag-based
delete-runner-aws discovery for cancelled-mid-launch cleanup;
permissions: id-token: write scoped per-job.

orphan-sweep.yml: sweep-ec2 job (hourly, 6h threshold, OIDC, tag-
keyed) covers any leaked instances the per-run delete misses.

Validated end-to-end via fork PoC
nogueiraanderson/percona-server@PS-11179-ec2-fallback-poc iteration
11 (build-arm64 12m46s cold ccache on c7g.4xlarge in eu-central-1).

Requires: secret AWS_ROLE_ARN set to the role ARN provisioned by
the companion Percona/percona-cd-platform PR. Reviewer: @inikep.

PS-11078 (parent: Cirrus to GHA arm64 migration on percona-server).
PS-11179 (sibling: Jenkins-fleet ARM fallback).

PS-11254: Build arm64 on fork PRs from percona org members (#5994)

* ci(build): cut Hetzner capacity sweeps 9 to 4 for faster AWS fallback

- pick-target: MAX_SWEEPS 9->4, BACKOFF_MIN (2 5 10 15 20 30 45 60)->(2 5 10)
- AWS Graviton EC2 fallback now fires after ~17m instead of ~3h7m
- pick-target timeout-minutes 240->30 to match the shorter sweep budget
- refresh stale ~3h7m and 9-sweep comments and step-summary strings

PS-11254

* ci(build): gate fork-PR arm64 builds to percona org members

- add pull_request_target trigger so fork PRs resolve secrets in base-repo
  context (same-repo PRs stay on pull_request, no double-run)
- dispatch gate authorizes fork PRs only when the author is a percona org
  member (author_association MEMBER/OWNER/COLLABORATOR); no per-run approval
- build-arm64 checks out the fork head sha and holds no secrets, preserving
  the trust split (workflow + job permissions stay contents: read)
- scope concurrency group by event_name and make pull_request_target cancellable

PS-11254

ci(build): gate fork PRs by repo write access, not author_association (#5995)

- author_association in the pull_request_target payload reports CONTRIBUTOR
  for private percona org members, so the prior gate skipped them (verified
  on PR #5988: the pull_request_target run was skipped)
- replace it with an authorize job that checks repo collaborator permission
  (write+) via GHA_RUNNER_PAT; members resolve to write, non-collaborators
  to read on this public repo (empirically confirmed)
- dispatch now needs authorize and runs only when ok == true

PS-11254
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants