Skip to content

Add event recording and status conditions for worker deployments#203

Open
thearcticwatch wants to merge 8 commits intomainfrom
add_events
Open

Add event recording and status conditions for worker deployments#203
thearcticwatch wants to merge 8 commits intomainfrom
add_events

Conversation

@thearcticwatch
Copy link

What changed: Added Kubernetes events and status conditions

(TemporalConnectionHealthy, RolloutReady) to the worker controller
reconciliation loop.

##Why: Reconciliation failures were only visible in controller logs —
events and conditions let users diagnose issues directly via kubectl.

  1. Closes Add events to the TemporalWorkerDeployment CRD when there is a problem #28

  2. How was this tested:
    added unit tests

  3. Any docs updates needed?
    N/A

@thearcticwatch thearcticwatch requested review from a team and jlegrone as code owners February 21, 2026 00:39
@CLAassistant
Copy link

CLAassistant commented Feb 21, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Collaborator

@carlydf carlydf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also make fmt-imports will solve some of your lint errors

@thearcticwatch thearcticwatch enabled auto-merge (squash) February 21, 2026 14:14
Copy link
Collaborator

@carlydf carlydf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good! just did initial review, we should still add a functional test once these comments are addressed.

I found https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#events and https://book.kubebuilder.io/reference/raising-events#creating-events helpful while reviewing.

}
if err != nil {
r.Recorder.Eventf(workerDeploy, corev1.EventTypeWarning, "TestWorkflowStartFailed",
"Failed to start gate workflow %q (buildID %s): %v", wf.workflowType, wf.buildID, err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put the task queue name in this event?

Comment on lines +209 to +219
if err := r.executeK8sOperations(ctx, l, workerDeploy, p); err != nil {
return err
}

deploymentHandler := temporalClient.WorkerDeploymentClient().GetHandle(p.WorkerDeploymentName)

if err := r.startTestWorkflows(ctx, l, workerDeploy, temporalClient, p); err != nil {
return err
}

if err := r.updateVersionConfig(ctx, l, workerDeploy, deploymentHandler, p); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this refactor just to make the code easier to understand, or was it necessary for the events change? (I'm all for it, just curious)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the tets were failing for Cognitive Complexity and the refactor fixed them. Also just a bonus refactor.

Comment on lines 283 to 284
r.setCondition(&workerDeploy, temporaliov1alpha1.ConditionRolloutReady, metav1.ConditionTrue,
"RolloutSucceeded", "Target version rollout complete "+workerDeploy.Status.TargetVersion.BuildID)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels inconsistent to me to be saying "rollout ready", "rollout succeeded", and "rollout complete" all together. Maybe just "rollout complete" for all of them?

Also, I just realized, the condition as written will cause us to emit a ConditionRolloutReady event at the end of the first reconcile loop that meets this condition (which is correct), but also at the end of every subsequent reconcile loop, which mean we will emit the same event every 30s.

I think if we instead emit this event in the actual updateVersionConfig function where we successfully set the Current Version, that would ensure that we only emit it once per completed rollout

//+kubebuilder:rbac:groups=core,resources=secrets,verbs=get;list;watch
//+kubebuilder:rbac:groups=apps,resources=deployments,verbs=get;list;watch;create;update;patch;delete
//+kubebuilder:rbac:groups=apps,resources=deployments/scale,verbs=update
// +kubebuilder:rbac:groups="",resources=events,verbs=create;patch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://book.kubebuilder.io/reference/raising-events#granting-the-required-permissions says that this would be enough:

// +kubebuilder:rbac:groups=events.k8s.io,resources=events,verbs=create;patch

so I think kubebuilder:rbac:groups="" may be overly permissive?

Comment on lines 153 to 157
r.Recorder.Eventf(&workerDeploy, corev1.EventTypeWarning, "TemporalConnectionNotFound",
"Unable to fetch TemporalConnection %q: %v", workerDeploy.Spec.WorkerOptions.TemporalConnectionRef.Name, err)
r.setCondition(&workerDeploy, temporaliov1alpha1.ConditionTemporalConnectionHealthy, metav1.ConditionFalse,
"TemporalConnectionNotFound", fmt.Sprintf("TemporalConnection %q not found: %v", workerDeploy.Spec.WorkerOptions.TemporalConnectionRef.Name, err))
_ = r.Status().Update(ctx, &workerDeploy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this pattern of setting and recording events is repeated throughout the codebase; think we should clean this up by having a nice helper which could also make it look neater

wdyt

)

func (r *TemporalWorkerDeploymentReconciler) executePlan(ctx context.Context, l logr.Logger, temporalClient sdkclient.Client, p *plan) error {
func (r *TemporalWorkerDeploymentReconciler) executeK8sOperations(ctx context.Context, l logr.Logger, workerDeploy *temporaliov1alpha1.TemporalWorkerDeployment, p *plan) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any specific reason to rename this? i quite liked the approach of having these three files/steps where we first generate a status, make a plan and then act on it.

ConflictToken: vcfg.ConflictToken,
Identity: getControllerIdentity(),
}); err != nil {
r.Recorder.Eventf(workerDeploy, corev1.EventTypeWarning, "VersionRegistrationFailed",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also log the error here, in addition to recording the event here imo. We have followed this elsewhere in the same file and in the code, and I think it could be good practice given that if we do have a persistence failure, the logs would still be an option for an operator.

Comment on lines 268 to 276
r.Recorder.Eventf(&workerDeploy, corev1.EventTypeWarning, "PlanGenerationFailed",
"Unable to generate reconciliation plan: %v", err)
return ctrl.Result{}, err
}

// Execute the plan, handling any errors
if err := r.executePlan(ctx, l, temporalClient, plan); err != nil {
if err := r.executePlan(ctx, l, &workerDeploy, temporalClient, plan); err != nil {
r.Recorder.Eventf(&workerDeploy, corev1.EventTypeWarning, "PlanExecutionFailed",
"Unable to execute reconciliation plan: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any specific reason you chose to not persist failures for plan generation and execution phases but persisted when we do generate the status from the server?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add events to the TemporalWorkerDeployment CRD when there is a problem

4 participants