Skip to content

[Failover] NoExecute taint is not added to the unhealthy cluster after a period of time #6951

@ryanwuer

Description

@ryanwuer

What happened:
NoExecute taint is not added to the unhealthy cluster after a period of time

What you expected to happen:
NoExecute taint will be added to the unhealthy cluster after a period of time

How to reproduce it (as minimally and precisely as possible):
The doc we find relatable with Failover is here: https://karmada.io/docs/v1.14/userguide/failover/failover-analysis

We got two clusters named gy1 and gy2. And a workload with weight 1:1 is deployed into gy1 and gy2.

We made a network fault injection to check Karmada's failover logic. The network fault is injected to gy2, making it cannot be reached. After a short while, the NoSchedule taint is added to gy2, just like the doc describes. But after 5min(whicn can be set by --failover-eviction-timeout flag), no NoExecute taint is added to gy2.

From the doc we know the default tolerations of PropagationPolicy like below are added by karamda-webhook:

apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
  name: nginx-propagation
  namespace: default
spec:
  placement:
    clusterTolerations:
    - effect: NoExecute
      key: cluster.karmada.io/not-ready
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: cluster.karmada.io/unreachable
      operator: Exists
      tolerationSeconds: 300
  resourceSelectors:
  - apiVersion: apps/v1
    kind: Deployment
    name: nginx
    namespace: default

The tolerations are not matched with NoSchedule taint. As a result, when we make a new deploy, all replicas are scheduled to gy1, which makes resources consumption doubled. It's not as expected. What we expect is the new version of the workload will be deployed to gy1 with the same number of replica just like that before the network injection is made. The gy2 may be not reached, but it's the control plane, the workloads in it may be running healthily as usual. We cannot double replicas in gy1, that could make resources being exhausted.

Anything else we need to know?:

Environment:

  • Karmada version: v1.14.5
  • kubectl-karmada or karmadactl version (the result of kubectl-karmada version or karmadactl version): kubectl karmada version: version.Info{GitVersion:"v1.12.2", GitCommit:"0e82ce1823fdff2859053e48eebce189d78dc9a1", GitTreeState:"clean", BuildDate:"2025-01-02T12:19:54Z", GoVersion:"go1.22.9", Compiler:"gc", Platform:"darwin/arm64"}
  • Others: K8s version: v1.19.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions