K8SPG-1057: Allow using etcd as patroni DCS by yoav-katz · Pull Request #1647 · percona/percona-postgresql-operator

yoav-katz · 2026-06-18T21:54:59Z

CHANGE DESCRIPTION

Problem:
Patroni supports multiple DCS backends, but the operator hardcodes Kubernetes Endpoints as the only option. This blocks clusters on managed Kubernetes platforms where workloads cannot reach the control plane API.

Cause:
The kubernetes: stanza was hardcoded in the generated Patroni config with no mechanism to select a different backend.
Several other pieces of the operator also assumed k8s DCS: RBAC rules unconditionally granted Endpoints permissions, the primary service routed through Patroni-managed Endpoints objects, and pod role labels/annotations were expected to be set by Patroni itself (which only happens with k8s DCS).

Solution:
Add a spec.patroni.dcs field (type: kubernetes default, type: etcd alternative). The field is immutable after cluster creation, enforced by a CEL validation rule on the CRD.

When type: etcd, the operator:

Emits an etcd3: stanza in the generated Patroni config instead of kubernetes:, with optional TLS (cacert/cert/key) and auth credentials (PATRONI_ETCD3_USERNAME/PATRONI_ETCD3_PASSWORD) sourced from referenced Secrets.
Injects on_start and on_role_change Patroni callbacks pointing to a new patroni-role-change.sh script. Since Patroni does not set pod role labels or the status annotation when using etcd DCS, this script patches the pod via the k8s API on every role transition, restoring the label (role=primary|replica) and annotation ({"role":"primary"}) that the rest of the operator depends on for Service routing and primary detection.
Creates the primary Service with a label selector (role=primary) instead of the previous headless-Endpoints-to-Patroni-leader-ClusterIP indirection, which only works with k8s DCS.
Skips creating the Patroni leader lease Service and distributed configuration Service, which are k8s DCS artifacts.
Omits the Endpoints RBAC permissions from the postgres pod ServiceAccount, since they are not needed.
Validates that referenced TLS and auth Secrets exist and contain the required keys, surfacing issues as Warning events on the cluster.

The Kubernetes DCS path is unchanged.

CHECKLIST

Jira

Is the Jira ticket created and referenced properly?
Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

Is an E2E test/test case added for the new feature/change?
Are unit tests added where appropriate?

Config/Logging/Testability

Are all needed new/changed options added to default YAML files?
Helm Chart Merge Request
Did we add proper logging messages for operator actions?
Did we ensure compatibility with the previous version or cluster upgrade process?
Does the change support oldest and newest supported PG version?
Does the change support oldest and newest supported Kubernetes version?

it-percona-cla · 2026-06-18T21:55:04Z

All committers have signed the CLA.

…erator into etcd-dcs

yoav-katz · 2026-06-18T22:27:32Z

I will wait for the jira ticket to open the PR in the helm charts repo

yoav-katz · 2026-06-18T22:42:44Z

and another note - this is my first OSS contribute so be gentel 😄
if there is stuff that you think should be changed becuase of style/dependency consideration I will be happy to fix!

egegunes

@yoav-katz the implementation looks good to me in general. but we definitely need an e2e test that deploys etcd and configures PerconaPGCluster to use it.

DanBrima · 2026-06-19T07:46:30Z

would love to see this merged!

…erator into etcd-dcs

yoav-katz · 2026-06-20T16:10:05Z

QUESTIONS FOR REVIEWERS:

e2e coverage approach: The new suite covers the etcd DCS happy path but does not run the existing full test suite (switchover, pgbackrest backup/restore, scale-up, upgrade, etc.) against etcd DCS. Should we: (a) add the most critical existing tests (switchover, backup) to this suite, or (b) parameterize the main suite to run with both DCS backends?
Read-from-replica test: There is no step testing that replica pods are correctly labeled and that reads can be served through the replica service. Should that be added before merge?
DCS immutability UX: The CEL rule prevents changing dcs.type after cluster creation. If a user needs to migrate between DCS backends, the only path is delete and recreate. Is this the right trade-off, or should we document a migration procedure?
Routing: The current implementation routes primary/replica traffic via k8s Services with label selectors (role=primary, role=replica). The longer-term goal is to replace this with HAProxy, which would discover and health-check postgres pods directly via Patroni's REST API - removing the dependency on pod labels for routing entirely. should HAProxy integration be implemented inside the operator, or should the operator when using etcd as a dcs simply expose a headless Service covering all postgres pods and leave HAProxy configuration to the user?

yoav-katz · 2026-06-20T19:37:36Z

Operator-managed etcd (future consideration)

The current design requires users to supply an external etcd cluster via spec.patroni.dcs.etcd.endpoints. This is a reasonable first step, but it places a significant operational burden on users who don't already have etcd infrastructure. An alternative would be a managed sub-field on the etcd spec, e.g.:

spec:
  patroni:
    dcs:
      type: etcd
      etcd:
        managed:           # operator deploys etcd itself
          replicas: 3      # 1 for dev, 3 for production HA
          storage: 1Gi
          storageClass: standard
        # endpoints: omitted when managed: is set

The operator would create and reconcile an etcd StatefulSet (with PVCs) co-located with the PostgreSQL cluster. This raises a few design questions:
(a) Should this be scoped to this PR or tracked as a follow-up?
(b) If implemented, should it be a thin wrapper (the operator just creates a StatefulSet from a known etcd image) or should it delegate to an existing etcd operator (e.g., via a EtcdCluster CR)?

egegunes · 2026-06-22T06:00:23Z

e2e coverage approach: The new suite covers the etcd DCS happy path but does not run the existing full test suite (switchover, pgbackrest backup/restore, scale-up, upgrade, etc.) against etcd DCS. Should we: (a) add the most critical existing tests (switchover, backup) to this suite, or (b) parameterize the main suite to run with both DCS backends?

Let's start with (a).

Read-from-replica test: There is no step testing that replica pods are correctly labeled and that reads can be served through the replica service. Should that be added before merge?

I don't think it's crucial but would be a good addition.

DCS immutability UX: The CEL rule prevents changing dcs.type after cluster creation. If a user needs to migrate between DCS backends, the only path is delete and recreate. Is this the right trade-off, or should we document a migration procedure?

For the start, I think it's better to not allow live migration. We can revisit this after receiving feedback.

Routing: The current implementation routes primary/replica traffic via k8s Services with label selectors (role=primary, role=replica). The longer-term goal is to replace this with HAProxy, which would discover and health-check postgres pods directly via Patroni's REST API - removing the dependency on pod labels for routing entirely. should HAProxy integration be implemented inside the operator, or should the operator when using etcd as a dcs simply expose a headless Service covering all postgres pods and leave HAProxy configuration to the user?

Why the longer-term goal is to replace routing by labels with HAProxy? Also, operator already creates a headless service covering all postgres pods.

Operator-managed etcd: The operator would create and reconcile an etcd StatefulSet (with PVCs) co-located with the PostgreSQL cluster. This raises a few design questions:
(a) Should this be scoped to this PR or tracked as a follow-up?
(b) If implemented, should it be a thin wrapper (the operator just creates a StatefulSet from a known etcd image) or should it delegate to an existing etcd operator (e.g., via a EtcdCluster CR)?

I don't think we should ever have etcd managed by the operator. It should be the user who configure and manage etcd infrastructure.

yoav-katz · 2026-06-24T11:57:58Z

Why the longer-term goal is to replace routing by labels with HAProxy? Also, operator already creates a headless service covering all postgres pods.

The label-based routing still tightly couples failover to the Kubernetes API - Patroni's callback needs to patch pod labels on every role change. That's the same dependency we're trying to reduce by moving to external etcd. HAProxy querying Patroni's REST health endpoints directly removes that dependency entirely - failover is self-contained within Patroni and etcd, no K8s API writes in the hot path.

egegunes · 2026-06-25T06:08:18Z

 ) error {
 	// With etcd DCS, Patroni stores distributed configuration in etcd, not k8s Endpoints.
-	if dcs := cluster.Spec.Patroni.GetDCS(); dcs != nil && dcs.Type == v1beta1.PatroniDCSTypeEtcd {
+	if dcs := cluster.GetDCS(); dcs != nil && dcs.Type == v1beta1.PatroniDCSTypeEtcd {


since we are repeating this code in many places, would it make sense to have something like cluster.IsDCSEtcd() and move this conditions into there?

Instead of IsDCSEtcd() I went with a more general DCSType() method that normalizes a nil DCS to the default (kubernetes), so adding a new DCS type in the future doesn't require a function per type.
Do you think we should also add IsDCSEtcd() as a convenience on top of it?

i think the current version is good, i mostly wanted to cleanup all those nil checks

Copilot

Pull request overview

This PR adds support for using external etcd as Patroni’s DCS backend (in addition to the existing Kubernetes Endpoints DCS), enabling deployments where workloads cannot reach the Kubernetes control plane API. It introduces a new spec.patroni.dcs API, updates Patroni config generation, RBAC/service behavior, and adds validations plus unit/E2E coverage for the etcd path.

Changes:

Add spec.patroni.dcs (default kubernetes, optional etcd) with CEL immutability validation and generated deepcopy/CRD updates.
When etcd DCS is selected: generate etcd3: Patroni config, add role-change callbacks, adjust primary Service selection logic, and validate referenced TLS/auth Secrets (with Secret watch/indexing).
Add unit tests and a new KUTTL E2E scenario (etcd-dcs) to validate config, behavior, and immutability.

Reviewed changes

Copilot reviewed 41 out of 42 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
pkg/apis/upstream.pgv2.percona.com/v1beta1/zz_generated.deepcopy.go	Generated deepcopy updates for new Patroni DCS types.
pkg/apis/upstream.pgv2.percona.com/v1beta1/patroni_types.go	Adds `dcs` API types, helpers, and CEL validations.
pkg/apis/pgv2.percona.com/v2/perconapgcluster_types.go	Adds etcd DCS validation and Secret indexer for reconciliation triggers.
pkg/apis/pgv2.percona.com/v2/perconapgcluster_types_test.go	Unit tests for Percona CR validation of etcd DCS endpoints.
percona/controller/pgcluster/patroni_etcd.go	Reconcile-time Secret presence/key validation + Warning events for etcd DCS Secrets.
percona/controller/pgcluster/patroni_etcd_test.go	Unit tests for etcd DCS Secret validation reconciliation behavior.
percona/controller/pgcluster/controller.go	Watches Secrets via multiple field indexes (envFrom + patroni etcd secrets) with dedupe.
internal/patroni/reconcile.go	Mount etcd TLS Secret and inject Patroni etcd auth env vars into instance Pods.
internal/patroni/rbac.go	Conditionalize Endpoints/Service create RBAC rules based on DCS type.
internal/patroni/config.go	Emit `etcd3:` config + callbacks for etcd DCS; keep Kubernetes DCS behavior unchanged.
internal/patroni/config_test.go	Adds unit coverage for etcd DCS Patroni YAML generation and env behavior.
internal/controller/postgrescluster/patroni.go	Skip k8s-DCS-specific artifacts/status reads when using etcd DCS.
internal/controller/postgrescluster/cluster.go	Use selector-based primary Service for etcd DCS; avoid applying Endpoints when not used.
e2e-tests/tests/etcd-dcs/00-deploy-operator.yaml	E2E setup step: deploy operator and client.
e2e-tests/tests/etcd-dcs/00-assert.yaml	E2E assertions for operator/CRD readiness.
e2e-tests/tests/etcd-dcs/01-etcd-setup.yaml	E2E step: deploy a single-node etcd StatefulSet for testing.
e2e-tests/tests/etcd-dcs/01-assert.yaml	E2E assertion: etcd is ready.
e2e-tests/tests/etcd-dcs/02-create-cluster.yaml	E2E step: create cluster configured for etcd DCS.
e2e-tests/tests/etcd-dcs/02-assert.yaml	E2E assertions: cluster reaches ready state with etcd DCS.
e2e-tests/tests/etcd-dcs/03-write-data.yaml	E2E: write data to primary via client.
e2e-tests/tests/etcd-dcs/04-read-from-primary.yaml	E2E: read data back from primary.
e2e-tests/tests/etcd-dcs/04-assert.yaml	E2E assertion: expected read result.
e2e-tests/tests/etcd-dcs/05-assert.yaml	E2E assertions for created resources.
e2e-tests/tests/etcd-dcs/06-check-patroni-config.yaml	E2E: verify Patroni config contains `etcd3` + callbacks and omits `kubernetes:`.
e2e-tests/tests/etcd-dcs/07-check-patronictl.yaml	E2E: verify `patronictl list` shows running/leader.
e2e-tests/tests/etcd-dcs/08-check-etcd-keys.yaml	E2E: verify Patroni keys appear in etcd.
e2e-tests/tests/etcd-dcs/09-check-no-warning-events.yaml	E2E: ensure no unexpected Warning events for etcd secrets.
e2e-tests/tests/etcd-dcs/10-check-dcs-immutability.yaml	E2E: ensure DCS type immutability is enforced by admission/CEL.
e2e-tests/tests/etcd-dcs/11-check-pod-labels.yaml	E2E: verify role labels are present (callback executed).
e2e-tests/tests/etcd-dcs/99-remove-cluster-gracefully.yaml	E2E teardown: remove resources and validate operator stability.
e2e-tests/run-release.csv	Adds `etcd-dcs` to release E2E run list.
e2e-tests/run-pr.csv	Adds `etcd-dcs` to PR E2E run list.
deploy/cw-bundle.yaml	Bundle CRD updates to include new DCS schema/validations.
deploy/crd.yaml	Generated CRD updates for new DCS schema/validations.
deploy/bundle.yaml	Bundle CRD updates for new DCS schema/validations.
config/crd/bases/upstream.pgv2.percona.com_postgresclusters.yaml	Base CRD schema updates for DCS fields/validations.
config/crd/bases/pgv2.percona.com_perconapgclusters.yaml	Base CRD schema updates for DCS fields/validations.
cmd/postgres-operator/main.go	Adds field index registration for etcd DCS referenced Secrets.
build/postgres-operator/patroni-role-change.sh	New Patroni callback script to patch pod role label + status annotation.
build/postgres-operator/init-entrypoint.sh	Installs the new Patroni role-change script into runtime bindir.
build/postgres-operator/Dockerfile	Ships the new Patroni role-change script in the image.
build/crd/percona/generated/pgv2.percona.com_perconapgclusters.yaml	Generated Percona CRD output updated for new DCS schema/validations.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 41 out of 42 changed files in this pull request and generated 2 comments.

JNKPercona · 2026-06-26T06:45:52Z

Test Name	Result	Time
backup-enable-disable	passed	00:00:00
builtin-extensions	passed	00:00:00
cert-manager-tls	passed	00:00:00
custom-envs	passed	00:00:00
custom-tls	passed	00:00:00
database-init-sql	passed	00:00:00
demand-backup	passed	00:35:24
demand-backup-offline-snapshot	passed	00:16:08
dynamic-configuration	passed	00:00:00
finalizers	passed	00:00:00
init-deploy	passed	00:00:00
huge-pages	passed	00:00:00
major-upgrade-14-to-15	passed	00:00:00
major-upgrade-15-to-16	passed	00:00:00
major-upgrade-16-to-17	passed	00:00:00
major-upgrade-17-to-18	passed	00:00:00
ldap	passed	00:00:00
ldap-tls	passed	00:00:00
monitoring	passed	00:00:00
monitoring-pmm3	passed	00:00:00
one-pod	passed	00:00:00
operator-self-healing	passed	00:00:00
pitr	passed	00:00:00
scaling	passed	00:00:00
scheduled-backup	passed	00:00:00
self-healing	passed	00:00:00
sidecars	passed	00:00:00
standby-pgbackrest	passed	00:00:00
standby-streaming	passed	00:13:55
start-from-backup	passed	00:00:00
tablespaces	passed	00:00:00
telemetry-transfer	passed	00:00:00
upgrade-consistency	passed	00:00:00
upgrade-minor	passed	00:00:00
users	passed	00:00:00
etcd-dcs	passed	00:00:00

Summary	Value
Tests Run	36/36
Job Duration	00:52:15
Total Test Time	01:05:28

commit: 6cb833e
image: perconalab/percona-postgresql-operator:PR-1647-6cb833e6c

yoav-katz added 2 commits June 19, 2026 00:21

feat(etcd)

d3ec2fd

fix

a07a883

yoav-katz requested review from egegunes, gkech, hors, mayankshah1607, nmarukovich, oksana-grishchenko and pooknull as code owners June 18, 2026 21:55

yoav-katz and others added 3 commits June 19, 2026 00:56

Merge branch 'main' into etcd-dcs

3d26af1

Self Code Review

e5ecd24

Merge branch 'etcd-dcs' of github.com:yoav-katz/percona-postgresql-op…

fa46761

…erator into etcd-dcs

yoav-katz marked this pull request as draft June 18, 2026 22:26

egegunes changed the title ~~feat(etcd)~~ K8SPG-1057: Allow using etcd as patroni DCS Jun 19, 2026

egegunes added the community label Jun 19, 2026

egegunes requested changes Jun 19, 2026

View reviewed changes

yoav-katz and others added 10 commits June 19, 2026 18:06

feat(e2e_tests)

241cf08

Update run-pr.csv

d34a071

Update run-release.csv

f12b817

fix(crd)

3f1de89

Merge branch 'etcd-dcs' of github.com:yoav-katz/percona-postgresql-op…

36094e3

…erator into etcd-dcs

fix(bug)

2b0ac02

regenerate-crds

48082f5

feat(e2e_tests): added check for immutable dcs after creation

70d51b5

fix(tag)

05c0042

feat(patroni-role-change)

8f26d3c

yoav-katz added 3 commits June 20, 2026 17:36

fix(patroni callbacks)

4168133

fix(patroni-role-change callback): added status too

2466216

fix(service): using lables with role=primary instead of Endpoint

4f19784

yoav-katz marked this pull request as ready for review June 20, 2026 16:10

yoav-katz requested review from DhruthiKV, eleo007, jvpasinatto and valmiranogueira as code owners June 20, 2026 16:10

fix(cert-manager): fixed cert-manager tests

89e82d0

fix(tests)

692e3e3

yoav-katz requested a review from egegunes June 21, 2026 22:06

egegunes reviewed Jun 22, 2026

View reviewed changes

Comment thread e2e-tests/tests/etcd-dcs/05-check-password-leak.yaml Outdated

Comment thread internal/controller/postgrescluster/patroni.go Outdated

fix(MR comments)

67363a8

egegunes reviewed Jun 25, 2026

View reviewed changes

fix(pgcluster): added DCSType(): returns the type of the DCS

92d684a

Copilot AI review requested due to automatic review settings June 25, 2026 15:29

Merge branch 'main' into etcd-dcs

4bbcf67

Copilot started reviewing on behalf of yoav-katz June 25, 2026 15:29 View session

Copilot AI reviewed Jun 25, 2026

View reviewed changes

Comment thread pkg/apis/upstream.pgv2.percona.com/v1beta1/patroni_types.go

Comment thread percona/controller/pgcluster/patroni_etcd_test.go

Potential fix for pull request finding

6cb833e

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings June 25, 2026 16:15

Copilot started reviewing on behalf of yoav-katz June 25, 2026 16:16 View session

Copilot AI reviewed Jun 25, 2026

View reviewed changes

Comment thread internal/patroni/reconcile.go

Comment thread internal/controller/postgrescluster/patroni.go

egegunes added this to the v3.1.0 milestone Jun 26, 2026

egegunes approved these changes Jun 26, 2026

View reviewed changes

Uh oh!

Conversation

yoav-katz commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CHANGE DESCRIPTION

CHECKLIST

Uh oh!

it-percona-cla commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yoav-katz commented Jun 18, 2026

Uh oh!

yoav-katz commented Jun 18, 2026

Uh oh!

egegunes left a comment

Choose a reason for hiding this comment

Uh oh!

DanBrima commented Jun 19, 2026

Uh oh!

yoav-katz commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

QUESTIONS FOR REVIEWERS:

Uh oh!

yoav-katz commented Jun 20, 2026

Uh oh!

egegunes commented Jun 22, 2026

Uh oh!

Uh oh!

Uh oh!

yoav-katz commented Jun 24, 2026

Uh oh!

egegunes Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

yoav-katz Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

egegunes Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

JNKPercona commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yoav-katz commented Jun 18, 2026 •

edited

Loading

it-percona-cla commented Jun 18, 2026 •

edited

Loading

yoav-katz commented Jun 20, 2026 •

edited

Loading