[SREP-3734] Fix AlertmanagerInhibitions test race condition#3220
[SREP-3734] Fix AlertmanagerInhibitions test race condition#3220smarthall wants to merge 1 commit intoopenshift:mainfrom
Conversation
|
There are test jobs defined for this repository which are not configured to run automatically. Comment |
WalkthroughReplaces a static 3-minute sleep and single Prometheus query with an Eventually-based polling loop. The new logic retries Prometheus Alerts retrieval until detecting either Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 10 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: smarthall The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/e2e/osd/inhibitions.go (1)
162-181: Propagatectxand log transient Prometheus errors.The test function receives
ctxas a parameter; use it instead ofcontext.Background()to honour test cancellation. Additionally, log Prometheus query errors to aid flake diagnosis.
- Derive timeout from
ctxviacontext.WithTimeout(ctx, ...)and passctxtoEventually(ctx, func(ctx context.Context) bool { ... }, ...)so Gomega respects upstream cancellation (Gomega v1.39.1 supports this).- Log errors from
prometheusApiClient.Alerts()usingginkgo.GinkgoLogr.Error(err, "Unable to query prom API")before returning false, matching the established pattern inpkg/e2e/state/alerts.go.♻️ Proposed refactor
- Eventually(func() bool { - timeout, cancel := context.WithTimeout(context.Background(), 10*time.Second) + Eventually(ctx, func(ctx context.Context) bool { + queryCtx, cancel := context.WithTimeout(ctx, 10*time.Second) defer cancel() - alerts, err := prometheusApiClient.Alerts(timeout) + alerts, err := prometheusApiClient.Alerts(queryCtx) if err != nil { + ginkgo.GinkgoLogr.Error(err, "Unable to query prom API") return false } for _, alert := range alerts.Alerts { if (alert.Labels["alertname"] == "ClusterOperatorDown" || alert.Labels["alertname"] == "ClusterOperatorDegraded") && alert.Labels["name"] == "authentication" { return true } } return false - }, 5*time.Minute, 30*time.Second).Should(BeTrue()) + }, 5*time.Minute, 30*time.Second).Should(BeTrue(), "ClusterOperatorDown/Degraded alert for authentication never appeared")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/e2e/osd/inhibitions.go` around lines 162 - 181, Replace context.Background() with the test's ctx by deriving the short query timeout via context.WithTimeout(ctx, 10*time.Second) and pass ctx into Eventually so Gomega can honor upstream cancellation (Gomega v1.39.1 supports Eventually(ctx, ...)). Inside the body that calls prometheusApiClient.Alerts(timeout), log transient errors before returning false using ginkgo.GinkgoLogr.Error(err, "Unable to query prom API"); keep the alert-name checks as-is (alert.Labels["alertname"] and alert.Labels["name"] == "authentication") and ensure the cancel() defer remains.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/e2e/osd/inhibitions.go`:
- Around line 162-181: Replace context.Background() with the test's ctx by
deriving the short query timeout via context.WithTimeout(ctx, 10*time.Second)
and pass ctx into Eventually so Gomega can honor upstream cancellation (Gomega
v1.39.1 supports Eventually(ctx, ...)). Inside the body that calls
prometheusApiClient.Alerts(timeout), log transient errors before returning false
using ginkgo.GinkgoLogr.Error(err, "Unable to query prom API"); keep the
alert-name checks as-is (alert.Labels["alertname"] and alert.Labels["name"] ==
"authentication") and ensure the cancel() defer remains.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: c0e8c436-cab8-4dda-a8a3-ea11dd554bef
📒 Files selected for processing (1)
pkg/e2e/osd/inhibitions.go
…d sleep The test queries Prometheus for a ClusterOperatorDown/Degraded alert after injecting a bogus OIDC provider. The previous fixed 3-minute sleep was too tight — the alert enters pending state once the operator reports degraded and Prometheus scrapes + evaluates the rule (typically 2-3 min), but occasionally takes longer. Replace with an Eventually poll (5m timeout, 30s interval) that returns as soon as the alert appears. Signed-off-by: Daniel Hall <danhall@redhat.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
84727e0 to
148d898
Compare
|
@smarthall: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
inhibits ClusterOperatorDegradedtest was using a fixed 3-minute sleep before querying Prometheus for alerts, which was occasionally too shortEventuallypoll (5m timeout, 30s interval) that returns as soon as the alert appearsDetails
The test injects a bogus OIDC identity provider to degrade the
authenticationClusterOperator, then checks Prometheus forClusterOperatorDownorClusterOperatorDegradedalerts. The alert enters "pending" state once the operator reports degraded and Prometheus scrapes + evaluates the rule (typically 2-3 min). The previous fixed 3-minute sleep was tight enough that slight timing variation caused failures.The sleep was reduced from 10m to 3m in #2952 (OSD-29225), which was primarily fixing an unrelated deep-equality bug.
Example failures
Expected <bool>: false to be trueatinhibitions.go:181Test plan
periodic-ci-openshift-osde2e-main-nightly-4.21-osd-awspasses with this change🤖 Generated with Claude Code