Skip to content

K8SPG-777 Fix the crash in ShouldCheckStandbyLag#1652

Open
Kajot-dev wants to merge 5 commits into
percona:mainfrom
Kajot-dev:fix/crversion-crash
Open

K8SPG-777 Fix the crash in ShouldCheckStandbyLag#1652
Kajot-dev wants to merge 5 commits into
percona:mainfrom
Kajot-dev:fix/crversion-crash

Conversation

@Kajot-dev

Copy link
Copy Markdown
Contributor

CHANGE DESCRIPTION

Problem:
In K8SPG-777 a problem was fixed where a cluster resource with empty crVersion would crash the operator, by the operator setting it to version.Version() early in the reconciliation loop.

However in K8SPG-374 #1407 an independent pollAndRequeueStandbys goroutine was introduced which lists clusters and calls ShouldCheckStandbyLag() -> CompareVersion("2.9.0"), but the coroutine does not call .Default() so the operator crashes.

Solution:
Instead of making the goroutine call cr.Default(), let's simply fallback to a version.Version() when crVersion is not set since we know that operator will set it to that value and have test for it

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported PG version?
  • Does the change support oldest and newest supported Kubernetes version?

Signed-off-by: Jakub Jaruszewski <jjaruszewski@man.poznan.pl>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Prevents operator crashes when PerconaPGCluster.spec.crVersion is empty by making version comparisons fall back to the operator’s build version, which is especially relevant for callers like the standby-lag polling goroutine that may not run Default().

Changes:

  • Update PerconaPGCluster.Version() to use version.Version() when spec.crVersion is unset.

Comment on lines +508 to +512
crVersion := cr.Spec.CRVersion
if crVersion == "" {
crVersion = version.Version()
}
return gover.Must(gover.NewVersion(crVersion))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this only falls back with the assumption that the opeator will reconcile the cluster and set the version asynchronously. Falling back to version.Version() on non empty can introduce hard to find bugs. But a way to do it, would actually be a validation pattern on the CRD

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree

Comment on lines +508 to +511
crVersion := cr.Spec.CRVersion
if crVersion == "" {
crVersion = version.Version()
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case would only work on e2e tests

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to test ShouldCheckStandbyLag regression but it'd be great to add a simple unit test that covers PerconaPGCluster.Version() behavior

@egegunes

Copy link
Copy Markdown
Contributor

@hors @gkech please take a look at this

@egegunes egegunes added this to the v3.1.0 milestone Jun 24, 2026
Signed-off-by: jjaruszewski <jjaruszewski@man.poznan.pl>
Copilot AI review requested due to automatic review settings June 24, 2026 09:32
@Kajot-dev Kajot-dev force-pushed the fix/crversion-crash branch from 2bf8219 to ed73169 Compare June 24, 2026 09:32

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment thread pkg/apis/pgv2.percona.com/v2/perconapgcluster_types.go Outdated
Signed-off-by: jjaruszewski <jjaruszewski@man.poznan.pl>
Copilot AI review requested due to automatic review settings June 24, 2026 12:14

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 7 changed files in this pull request and generated no new comments.

Signed-off-by: jjaruszewski <jjaruszewski@man.poznan.pl>
@Kajot-dev Kajot-dev force-pushed the fix/crversion-crash branch from 3a00214 to 4901a83 Compare June 24, 2026 14:22
@JNKPercona

Copy link
Copy Markdown
Collaborator
Test Name Result Time
backup-enable-disable passed 00:11:56
builtin-extensions passed 00:06:20
cert-manager-tls passed 00:09:46
custom-envs passed 00:18:56
custom-tls passed 00:06:28
database-init-sql passed 00:02:26
demand-backup passed 00:22:45
demand-backup-offline-snapshot passed 00:13:56
dynamic-configuration passed 00:03:28
finalizers passed 00:03:26
init-deploy passed 00:03:13
huge-pages passed 00:02:56
major-upgrade-14-to-15 passed 00:10:52
major-upgrade-15-to-16 passed 00:10:56
major-upgrade-16-to-17 passed 00:10:23
major-upgrade-17-to-18 passed 00:10:45
ldap passed 00:03:42
ldap-tls passed 00:05:42
monitoring passed 00:08:12
monitoring-pmm3 passed 00:09:08
one-pod passed 00:06:13
operator-self-healing failure 01:30:15
pitr passed 00:12:05
scaling passed 00:05:45
scheduled-backup passed 00:31:48
self-healing passed 00:09:18
sidecars passed 00:03:06
standby-pgbackrest passed 00:17:37
standby-streaming passed 00:13:30
start-from-backup passed 00:13:54
tablespaces passed 00:07:21
telemetry-transfer passed 00:04:40
upgrade-consistency passed 00:06:36
upgrade-minor passed 00:07:40
users passed 00:05:13
Summary Value
Tests Run 35/35
Job Duration 02:42:42
Total Test Time 06:50:34

commit: 4199c53
image: perconalab/percona-postgresql-operator:PR-1652-4199c53f5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants