Skip to content

Reconcile loop: poolsync ↔ dynamicprefix writes 7+/s to apiserver indefinitely (post-Apr-12 main) #16

Description

@jr42

Summary

A homelab Kubernetes cluster running main-amd64 (current main HEAD 4d98156, with the Apr 6-12 PR series merged: dual-stack-preservation, service-suffix-annotation, go-1-26 bump) has been observing two compounding problems:

  1. OOMKill crash loop on the chart's default 128Mi memory limit — pod restarts every ~5 minutes, accumulated 387 restarts on a single pod over 4 days. Bumping resources.limits.memory to 512Mi resolves the OOM (steady-state usage is ~80 MiB).
  2. Steady-state reconcile loop — even with the OOM fixed, the operator writes:
    • dynamicprefixes.dynamic-prefix.io PUT at 3.6/s
    • ciliumloadbalancerippools.cilium.io PUT at 2.3/s
    • ciliumloadbalancerippools.cilium.io PATCH at 1.2/s
    • Total ~7.1 writes/sec sustained.

This sustained churn drove a kube-apiserver heap allocation of ~1.5 GB in DeepCopyJSONValue/schemaCoercingConverter per apiserver, causing a +3 GB cluster-wide memory step that took 4 days to root-cause.

The reconcile loop appears to be self-perpetuating; the last known-good release is v0.0.2 (2026-03-14), pre-merge of #9, #10, #13. Pinning to v0.0.2 eliminates the write storm.

Reproduction

  • Operator running with: image.tag: main-amd64, pullPolicy: Always
  • Workload: ~100 cluster-wide Services (mix of LoadBalancer + ClusterIP), 3 DynamicPrefix CRs, 3 CiliumLoadBalancerIPPools, 2 referenced by each DynamicPrefix
  • Cluster: 3-node Talos / Kubernetes 1.34, Cilium 1.19.x

Evidence

Reconcile loop pattern (operator startup logs)

INFO  Pool synced successfully  pool=dmz-ipv6-network blockCount=3
INFO  DynamicPrefix changed, enqueuing referencing pools  dynamicPrefix=dmz-ipv6 poolCount=2
INFO  Syncing pool  pool=dmz-ipv6-dynamic
INFO  Pool synced successfully  pool=dmz-ipv6-dynamic blockCount=3
INFO  Syncing pool  pool=dmz-ipv6-subnet
INFO  Pool synced successfully  pool=dmz-ipv6-subnet blockCount=3
INFO  DynamicPrefix changed, enqueuing referencing pools  dynamicPrefix=dmz-ipv6-subnet poolCount=1
INFO  cilium.io/v2alpha1 CiliumBGPAdvertisement is deprecated; use cilium.io/v2 CiliumBGPAdvertisement
INFO  Reconciling BGP advertisements  dynamicPrefix=dmz-ipv6

After steady-state is reached, "Pool synced successfully" → "DynamicPrefix changed, enqueuing referencing pools" → "Syncing pool" cycles back without observable workload changes.

Apiserver write rate (kube-apiserver_request_total)

3.600/s  PUT    dynamicprefixes.dynamic-prefix.io
2.277/s  PUT    ciliumloadbalancerippools.cilium.io
1.200/s  PATCH  ciliumloadbalancerippools.cilium.io

These are sustained for hours with no underlying change to Services, the operator's own pod, or the upstream RA stream.

Apiserver heap profile

go tool pprof on a freshly-restarted apiserver after 25 minutes:

flat%   func
54.51%  k8s.io/apimachinery/pkg/runtime.DeepCopyJSONValue   ← 1.21 GB
 9.37%  DeepCopyJSONValue (different line)
 5.32%  sigs.k8s.io/json/.../unquote

Cumulative call chain:

etcd3.watchChan.serialProcessEvents (1.68 GB)
 → schemaCoercingConverter.ConvertToVersion (1.50 GB)
 → unstructured.DeepCopyObject → DeepCopyJSON → DeepCopyJSONValue (1.48 GB)

The dynamicprefixes and ciliumloadbalancerippools CRDs feed into this allocation path on each WATCH event.

Hypothesised root cause (please verify in source)

The chain looks like:

  1. poolsync controller writes to a CiliumLoadBalancerIPPool (status, blockCount, or annotation).
  2. The pool watch fires.
  3. The dynamicprefix controller (or poolsync's reconciler) sees the pool change and re-enqueues the referencing DynamicPrefix.
  4. DynamicPrefix reconciliation runs, calls enqueuing referencing pools, leading back to (1).

Common fixes for this pattern:

  • equality.Semantic.DeepEqual of the desired vs actual CiliumLoadBalancerIPPool.spec (or .status if writing status) before issuing the Patch / Update. Don't write if no field actually changed.
  • controllerutil.CreateOrPatch with an idempotent mutator function so identical desired state → empty Patch → no apiserver write.
  • Filter the watch on CiliumLoadBalancerIPPool to ignore events where metadata.managedFields shows the change came from this controller's manager name (avoid self-trigger).
  • Add a predicate.GenerationChangedPredicate on the CiliumLoadBalancerIPPool source so spec-only changes drive reconciles, not status-only ones (assuming the pool has status sub-resource).

Mitigations applied downstream

  1. Bumped the chart's default resources.limits.memory from 128Mi to 512Mi in our values override (the OOM fix).
  2. Pinned image.tag to v0.0.2 and pullPolicy: IfNotPresent — the last release before the post-merge build.

For others using the chart at default resource limits with a busy cluster: the 128Mi default may be insufficient for the post-merge build. Consider raising the chart default to 256Mi or 512Mi.

Suggested next steps

  • Add equality.Semantic.DeepEqual guards in poolsync reconciler before writing pool spec/status
  • Same in dynamicprefix reconciler before writing back to DynamicPrefix
  • Verify with the same workload (3 DynamicPrefix, 3 IPPools, ~100 Services): write rate should drop to <0.1/s for both CRDs
  • Bump chart's default resources.limits.memory to 256Mi or 512Mi
  • Consider releasing v0.0.5 with the loop fix once verified

Happy to help test fixes — we can roll a candidate image into our cluster as a regression check.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions