Skip to content

fix(webhook): fix cache error on k8s client pagination.#1059

Open
Zenithar wants to merge 2 commits intomainfrom
zenithar/chaos-controller/fix_safetyNetCountTooLarge_error
Open

fix(webhook): fix cache error on k8s client pagination.#1059
Zenithar wants to merge 2 commits intomainfrom
zenithar/chaos-controller/fix_safetyNetCountTooLarge_error

Conversation

@Zenithar
Copy link
Copy Markdown
Contributor

@Zenithar Zenithar commented Apr 9, 2026

What does this PR do?

  • Adds new functionality
  • Alters existing functionality
  • Fixes a bug
  • Improves documentation or testing

Summary

Fixes the webhook admission error:

"chaos-controller.chaos-engineering.svc" denied the request: error checking for
countNotTooLarge safetynet: error listing target pods: continue list option is
not supported by the cache

Root Cause

safetyNetCountNotTooLarge uses paginated listing (Limit/Continue) with the
controller-runtime cached client (Manager.GetClient()). The cache has two
known issues (kubernetes-sigs/controller-runtime#3044):

  1. Continue is not supported — the cache rejects it with a hard error.
  2. When Limit is set, the cache silently returns an incomplete list
    truncated at the limit, without signaling that the list is incomplete.

This means neither paginated nor unpaginated listing works correctly with the
cached client for counting resources.

Fix

Use mgr.GetAPIReader() — a direct API server client that supports proper
pagination — for the List calls in safetyNetCountNotTooLarge.

Changes:

  • utils/utils.go: Add APIReader client.Reader field to
    SetupWebhookWithManagerConfig.
  • main.go: Pass mgr.GetAPIReader() in the webhook config
    (GetAPIReader() is already used for watchers).
  • api/v1beta1/disruption_webhook.go:
    • Add apiReader package-level variable, initialized from config.
    • Restore paginated List calls (Limit: 1000 + Continue loops)
      using apiReader instead of the cached k8sClient.
    • Fix incorrect error message "error listing target pods" → "error listing
      nodes" in the node-level branch.
    • Pass ctx from the admission handler instead of context.Background().
    • Add nil guards to initialSafetyNets and safetyNetMissingNetworkFilters.
    • Change safetyNetMissingNetworkFilters from value to pointer receiver.
  • api/v1beta1/disruption_webhook_test.go: Set apiReader in all test
    setups that exercise safety nets. Add 6 new unit tests for
    safetyNetCountNotTooLarge.

Code Quality Checklist

  • The documentation is up to date.
  • My code is sufficiently commented and passes continuous integration checks.
  • I have signed my commit (see Contributing Docs).

Testing

  • I leveraged continuous integration testing
    • by depending on existing unit tests or end-to-end tests.
    • by adding new unit tests or end-to-end tests.
  • I manually tested the following steps:
    • Applied a Disruption resource that previously triggered the error.
    • locally.
    • as a canary deployment to a cluster.

New tests added

6 unit tests for safetyNetCountNotTooLarge covering:

  • Small fraction of pods (no trigger)
  • >80% of namespace pods (triggers)
  • >66% of cluster pods (triggers)
  • disableCountTooLarge bypass
  • Node-level disruption listing
  • No matching targets (early return)

@Zenithar Zenithar self-assigned this Apr 9, 2026
@Zenithar Zenithar marked this pull request as ready for review April 9, 2026 15:41
@Zenithar Zenithar requested a review from a team as a code owner April 9, 2026 15:41
@datadog-official
Copy link
Copy Markdown

datadog-official bot commented Apr 9, 2026

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 17.39%
Overall Coverage: 38.65% (+0.15%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: b29490c | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant