fix(webhook): fix cache error on k8s client pagination. by Zenithar · Pull Request #1059 · DataDog/chaos-controller

Zenithar · 2026-04-09T15:35:18Z

What does this PR do?

Adds new functionality
Alters existing functionality
Fixes a bug
Improves documentation or testing

Summary

Fixes the webhook admission error:

"chaos-controller.chaos-engineering.svc" denied the request: error checking for
countNotTooLarge safetynet: error listing target pods: continue list option is
not supported by the cache

Root Cause

safetyNetCountNotTooLarge uses paginated listing (Limit/Continue) with the
controller-runtime cached client (Manager.GetClient()). The cache has two
known issues (kubernetes-sigs/controller-runtime#3044):

Continue is not supported — the cache rejects it with a hard error.
When Limit is set, the cache silently returns an incomplete list
truncated at the limit, without signaling that the list is incomplete.

This means neither paginated nor unpaginated listing works correctly with the
cached client for counting resources.

Fix

Use mgr.GetAPIReader() — a direct API server client that supports proper
pagination — for the List calls in safetyNetCountNotTooLarge.

Changes:

utils/utils.go: Add APIReader client.Reader field to
SetupWebhookWithManagerConfig.
main.go: Pass mgr.GetAPIReader() in the webhook config
(GetAPIReader() is already used for watchers).
api/v1beta1/disruption_webhook.go:
- Add apiReader package-level variable, initialized from config.
- Restore paginated List calls (Limit: 1000 + Continue loops)
  using apiReader instead of the cached k8sClient.
- Fix incorrect error message "error listing target pods" → "error listing
  nodes" in the node-level branch.
- Pass ctx from the admission handler instead of context.Background().
- Add nil guards to initialSafetyNets and safetyNetMissingNetworkFilters.
- Change safetyNetMissingNetworkFilters from value to pointer receiver.
api/v1beta1/disruption_webhook_test.go: Set apiReader in all test
setups that exercise safety nets. Add 6 new unit tests for
safetyNetCountNotTooLarge.

Code Quality Checklist

The documentation is up to date.
My code is sufficiently commented and passes continuous integration checks.
I have signed my commit (see Contributing Docs).

Testing

I leveraged continuous integration testing
- by depending on existing unit tests or end-to-end tests.
- by adding new unit tests or end-to-end tests.
I manually tested the following steps:
- Applied a Disruption resource that previously triggered the error.
- locally.
- as a canary deployment to a cluster.

New tests added

6 unit tests for safetyNetCountNotTooLarge covering:

Small fraction of pods (no trigger)
>80% of namespace pods (triggers)
>66% of cluster pods (triggers)
disableCountTooLarge bypass
Node-level disruption listing
No matching targets (early return)

datadog-official · 2026-04-09T15:41:41Z

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
• Patch Coverage: 17.39%
• Overall Coverage: 38.65% (+0.15%)

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: b29490c | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!}

fix(webhook): fix cache error on k8s client pagination.

b60a27c

Zenithar self-assigned this Apr 9, 2026

Zenithar marked this pull request as ready for review April 9, 2026 15:41

Zenithar requested a review from a team as a code owner April 9, 2026 15:41

fix(webhook): use APIReader client without cache.

b29490c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(webhook): fix cache error on k8s client pagination.#1059

fix(webhook): fix cache error on k8s client pagination.#1059
Zenithar wants to merge 2 commits intomainfrom
zenithar/chaos-controller/fix_safetyNetCountTooLarge_error

Zenithar commented Apr 9, 2026 •

edited

Loading

Uh oh!

datadog-official bot commented Apr 9, 2026 •

edited by datadog-prod-us1-6 bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Zenithar commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Summary

Root Cause

Fix

Code Quality Checklist

Testing

New tests added

Uh oh!

datadog-official bot commented Apr 9, 2026 • edited by datadog-prod-us1-6 bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Zenithar commented Apr 9, 2026 •

edited

Loading

datadog-official bot commented Apr 9, 2026 •

edited by datadog-prod-us1-6 bot

Loading