Skip to content

Conversation

@thiyyakat
Copy link
Member

@thiyyakat thiyyakat commented Dec 10, 2025

What this PR does / why we need it:

This PR introduces a feature that allows operators and endusers to preserve a machine/node and the backing VM for diagnostic purposes.

The expected behaviour, use cases and usage are detailed in the proposal that can be found here

Which issue(s) this PR fixes:
Fixes #1008

Special notes for your reviewer:

Please take a look at the questions asked here.

Release note:

Introduce support for preservation of machines (both Running and Failed), and the backing node (if it exists). 

@gardener-robot gardener-robot added kind/api-change API change with impact on API users needs/second-opinion Needs second review by someone else needs/rebase Needs git rebase labels Dec 10, 2025
@gardener-robot
Copy link

@thiyyakat You need rebase this pull request with latest master branch. Please check.

@gardener-robot gardener-robot added needs/review Needs review size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 10, 2025
@thiyyakat thiyyakat force-pushed the feat/preserve-machine branch 2 times, most recently from 06ecf58 to 89f2900 Compare December 10, 2025 12:06
@thiyyakat
Copy link
Member Author

thiyyakat commented Dec 11, 2025

Questions that remain unanswered:

  1. On recovery of a preserved machine, it transitions from Failed to Running. However, if the preserve annotation was when-failed, then the node continues to be preserved in Running even though the annotation says when-failed - is that okay? The node needs to be preserved so that pods can get scheduled onto it without CA scaling it down.
  2. drain timeout is checked currently by calculating time from LastUpdateTime (from when machine moved to Failed) to now. Is there a better way to do it?
    timeOutOccurred = utiltime.HasTimeOutOccurred(machine.Status.CurrentStatus.LastUpdateTime, timeOutDuration)
    In the normal drain, it is checked wrt DeletionTimestamp
  3. In some parts of the code, checks are performed to see if the returned error is due to a Conflict, and ConflictRetry rather than ShortRetry is returned. When should these checks be performed? The preservation flow has a lot of update calls. : Addressed. Use ConflictRetry when appropriate.

Copy link
Member Author

@thiyyakat thiyyakat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: A review meeting was held today for this PR. The comments were given during the meeting.

During the meeting, we revisited the decision to move drain to Failed state for preserved machine. The reason discussed previously was that it didn't make sense semantically to move the machine to Terminating and then do the drain, because there is a possibility that the machine may recover. Since Terminating is a final state, the drain (separate from the drain in triggerDeletionFlow) will be performed in Failed phase. There was no change proposed during the meeting. This design decision was only reconfirmed.

Copy link
Member

@takoverflow takoverflow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have only gone through half of the PR, have some suggestions PTAL.

// or if it is a candidate for auto-preservation
// TODO@thiyyakat: find more suitable name for function
func (c *controller) isMachineCandidateForPreservation(ctx context.Context, machineSet *v1alpha1.MachineSet, machine *v1alpha1.Machine) (bool, error) {
if machineutils.IsPreserveExpiryTimeSet(machine) && !machineutils.HasPreservationTimedOut(machine) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsPreserveExpiryTimeSet already checks that the time is non-zero, then only HasPreservationTimedOut is called.
Is there any reason to perform the redundant IsZero check for PreserveExpiryTime again in HasPreservationTimedOut?
I don't see the function being called elsewhere as well.

If the zero check is removed, it could just be simplified to

func HasPreservationTimedOut(m *v1alpha1.Machine) bool {
	return !m.Status.CurrentStatus.PreserveExpiryTime.After(time.Now())
}

}
nodeName := machine.Labels[v1alpha1.NodeLabelKey]
if nodeName != "" {
preservedCondition := v1.NodeCondition{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider renaming this to preservedConditionFalse?

Comment on lines 2475 to 2493
err := nodeops.AddOrUpdateConditionsOnNode(ctx, c.targetCoreClient, nodeName, preservedCondition)
if err != nil {
return err
}
// Step 2: remove CA's scale-down disabled annotations to allow CA to scale down node if needed
CAAnnotations := make(map[string]string)
CAAnnotations[autoscaler.ClusterAutoscalerScaleDownDisabledAnnotationKey] = ""
latestNode, err := c.targetCoreClient.CoreV1().Nodes().Get(ctx, nodeName, metav1.GetOptions{})
if err != nil {
klog.Errorf("error trying to get backing node %q for machine %s. Retrying, error: %v", nodeName, machine.Name, err)
return err
}
latestNodeCopy := latestNode.DeepCopy()
latestNodeCopy, _, _ = annotations.RemoveAnnotation(latestNodeCopy, CAAnnotations) // error can be ignored, always returns nil
_, err = c.targetCoreClient.CoreV1().Nodes().Update(ctx, latestNodeCopy, metav1.UpdateOptions{})
if err != nil {
klog.Errorf("Node UPDATE failed for node %q of machine %q. Retrying, error: %s", nodeName, machine.Name, err)
return err
}
Copy link
Member

@takoverflow takoverflow Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why there are two get and update calls made for a node, can these not be combined into a single atomic node object update?

And I know this is not part of your PR but can we update this RemoveAnnotation function, it's needlessly complicated.
All you have to do after fetching the object and checking that annotations are non-nil is

delete(obj.Annotations, annotationKey)

Creating a dummy annotation map, then passing it and then creating a new map which doesn't have the key. All of this complication can be avoided.

Copy link
Member Author

@thiyyakat thiyyakat Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By 2 Get() calls are you referring to the call within AddOrUpdateConditionsOnNode and the following Get() here:
latestNode, err := c.targetCoreClient.CoreV1().Nodes().Get(ctx, nodeName, metav1.GetOptions{})?

The first one can be avoided if we didn't use the function. The second one is required because step 1 adds conditions to the node object, and the function does not return the updated node object. Fetching from the cache doesn't guarantee an up-to-date node object (tested this out empirically). I could potentially avoid fetching the objects if I didn't use the function. Will test it out.

The two update calls cannot be combined since step 1 requires an UpdateStatus() call, and step 2 updates the Spec, and requires an Update() call.

I will update the RemoveAnnotation function as recommended by you.

Edit: The RemoveAnnotation function returns a boolean indicating whether or not an update is needed. This value is being used in other usages of the function. The function cannot be updated. I will use your suggestion instead of using the function since the boolean value is not required in this case.

// stopMachinePreservation stops the preservation of the machine and node
func (c *controller) stopMachinePreservation(ctx context.Context, machine *v1alpha1.Machine) error {
// check if preserveExpiryTime is set, if not, no need to do anything
if !machineutils.IsPreserveExpiryTimeSet(machine) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can there be scenarios where the preserveExpiryTime hasn't been set but the node has preserve conditions
and scale-down disabled annotation added to it? If so, then the removal will never proceed right?

Please let me know if it's not a possible scenario.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setting of the preserveExpiryTime is the first step in machine preservation. Node conditions and the CA annotation are added only if the step 1 completes successfully. However, if a user manually adds the CA annotation and the node condition, but not the preserveExpiryTime then the case you described may occur. I'm not sure we should handle that case though.

if nodeName == "" && isExpirySet {
return true, nil
}
node, err := c.nodeLister.Get(nodeName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when a machine doesn't have the nodeName set i.e. nodeName is "" and isExpirySet is false.
Wouldn't this always fail? Why even try to get the node in that case?

Why not move the check above outside of this function i.e. inside preserveMachine, fetch the nodeName and
use isExpirySet. WDYT?

	if nodeName == "" && isExpirySet {
		return true, nil
	}
    if isExpirySet {
        isComplete, err := c.isMachinePreservationComplete(machine, nodeName)
	    if err != nil {
		    return err
	    }
    }

Comment on lines 996 to 1009
// if machine is preserved, stop preservation. Else, do nothing.
// this check is done in case the annotation value has changed from preserve=now to preserve=when-failed, in which case preservation needs to be stopped
preserveExpirySet := machineutils.IsPreserveExpiryTimeSet(clone)
machineFailed := machineutils.IsMachineFailed(clone)
if !preserveExpirySet && !machineFailed {
return
} else if !preserveExpirySet {
err = c.preserveMachine(ctx, clone, preserveValue)
return
}
// Here, we do not stop preservation even when preserve expiry time is set but the machine is in Running.
// This is to accommodate the case where the annotation is when-failed and the machine has recovered from Failed to Running.
// In this case, we want the preservation to continue so that CA does not scale down the node before pods are assigned to it
return
Copy link
Member

@takoverflow takoverflow Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revisit this case, the comments and the code seem to contradict each other, if you wish to compare oldMachine annotation value with the newMachine
to make decisions to stop preservation etc, consider utilising updateMachine which would have both objects available.

@gardener-robot gardener-robot added the needs/changes Needs (more) changes label Dec 18, 2025
@thiyyakat thiyyakat force-pushed the feat/preserve-machine branch from 22c646e to 7c062b5 Compare December 19, 2025 08:30
@thiyyakat thiyyakat force-pushed the feat/preserve-machine branch from e2a7ea7 to 74603a4 Compare December 31, 2025 09:56
@thiyyakat thiyyakat marked this pull request as ready for review January 6, 2026 05:56
@thiyyakat thiyyakat requested a review from a team as a code owner January 6, 2026 05:56
LastUpdateTime metav1.Time `json:"lastUpdateTime,omitempty"`

// PreserveExpiryTime is the time at which MCM will stop preserving the machine
PreserveExpiryTime metav1.Time `json:"preserveExpiryTime,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"After timeout, the node.machine.sapcloud.io/preserve=now and cluster-autoscaler.kubernetes.io/scale-down-disabled: "true" are deleted. The machine.CurrentStatus.PreserveExpiryTime is set to nil."

Should the PreserveExpiryTime be *metav1.Time?

Copy link
Member Author

@thiyyakat thiyyakat Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since PreserveExpiryTime is not modified in any of the functions and only assignments are made to the field, I have not declared it as a pointer. The proposal doc can be updated to machine.CurrentStatus.PreserveExpiryTime is set to metav1.Time{} if the reasoning is acceptable.

Made the change suggested. 👍

// The maximum number of machines in the machine deployment that will be auto-preserved.
// In the gardener context, this number is derived from the AutoPreserveFailedMachineMax set at the worker level, distributed amongst the worker's machine deployments
// +optional
AutoPreserveFailedMachineMax int32 `json:"autoPreserveFailedMachineMax,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have optional fields as pointer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consulted @elankath , and he agrees that optional fields must be declared as pointers. Will make the change.

}

// IsPreserveExpiryTimeSet checks if machine is preserved by MCM
func IsPreserveExpiryTimeSet(m *v1alpha1.Machine) bool {
Copy link
Contributor

@r4mek r4mek Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function needed anymore? We can just do nil check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup yup. Will make the change 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/api-change API change with impact on API users needs/changes Needs (more) changes needs/rebase Needs git rebase needs/review Needs review needs/second-opinion Needs second review by someone else size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Preservation of Failed Machines for diagnostics

4 participants