feat: add soft node-affinity stickiness for AgentRuntime and CodeInte…#412
feat: add soft node-affinity stickiness for AgentRuntime and CodeInte…#412Turbo-Jiaxxx789 wants to merge 1 commit into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @Turbo-Jiaxxx789! It looks like this is your first PR to volcano-sh/agentcube 🎉 |
There was a problem hiding this comment.
Code Review
This pull request introduces a node stickiness feature for AgentRuntime and CodeInterpreter workloads, allowing subsequent sessions to prefer scheduling on the same node as the previous session. The feedback suggests using a detached context with a timeout for the best-effort workload patching operation to ensure it succeeds even if the client disconnects immediately after sandbox creation.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if sandboxEntry.StickyWorkloadName != "" && nodeName != "" { | ||
| if err := s.k8sClient.PatchWorkloadLastNode(ctx, sandbox.Namespace, sandboxEntry.StickyWorkloadKind, sandboxEntry.StickyWorkloadName, nodeName); err != nil { | ||
| klog.Warningf("failed to patch %s %s/%s last-node=%s: %v", sandboxEntry.StickyWorkloadKind, sandbox.Namespace, sandboxEntry.StickyWorkloadName, nodeName, err) | ||
| } | ||
| } |
There was a problem hiding this comment.
Using the request context ctx for the best-effort PatchWorkloadLastNode call means that if the client disconnects (canceling ctx) immediately after the sandbox becomes ready, the patch operation will fail. Since the sandbox has already been successfully created and scheduled, we should ensure the scheduled node is recorded regardless of client connection status. Consider using a detached context with a reasonable timeout (e.g., 5 seconds) to make this best-effort write-back more robust.
| if sandboxEntry.StickyWorkloadName != "" && nodeName != "" { | |
| if err := s.k8sClient.PatchWorkloadLastNode(ctx, sandbox.Namespace, sandboxEntry.StickyWorkloadKind, sandboxEntry.StickyWorkloadName, nodeName); err != nil { | |
| klog.Warningf("failed to patch %s %s/%s last-node=%s: %v", sandboxEntry.StickyWorkloadKind, sandbox.Namespace, sandboxEntry.StickyWorkloadName, nodeName, err) | |
| } | |
| } | |
| if sandboxEntry.StickyWorkloadName != "" && nodeName != "" { | |
| patchCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second) | |
| defer cancel() | |
| if err := s.k8sClient.PatchWorkloadLastNode(patchCtx, sandbox.Namespace, sandboxEntry.StickyWorkloadKind, sandboxEntry.StickyWorkloadName, nodeName); err != nil { | |
| klog.Warningf("failed to patch %s %s/%s last-node=%s: %v", sandboxEntry.StickyWorkloadKind, sandbox.Namespace, sandboxEntry.StickyWorkloadName, nodeName, err) | |
| } | |
| } |
There was a problem hiding this comment.
Pull request overview
This PR adds an opt-in “soft node stickiness” feature to the Workload Manager’s sandbox provisioning flow for AgentRuntime and CodeInterpreter. When enabled via spec.nodeStickiness.enabled, new sessions prefer scheduling onto the same node as the previous session by injecting a PreferredDuringSchedulingIgnoredDuringExecution node-affinity term, and the Workload Manager records the actual scheduled node back onto the owning workload via the runtime.agentcube.io/last-node annotation.
Changes:
- Add
spec.nodeStickiness.enabled(defaultfalse) to AgentRuntime and CodeInterpreter CRDs/types and generated deepcopy code. - Inject a preferred hostname node-affinity term when stickiness is enabled and
runtime.agentcube.io/last-nodeis present. - After sandbox readiness/entrypoint verification, patch the owning workload’s
runtime.agentcube.io/last-nodeannotation (best-effort) for future sessions.
Reviewed changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/workloadmanager/workload_builder.go | Adds stickiness gating, affinity injection helper, and populates sticky workload identifiers for write-back. |
| pkg/workloadmanager/workload_builder_test.go | Adds unit tests for affinity injection and stickiness behavior for AgentRuntime/CodeInterpreter (incl. warm pool behavior). |
| pkg/workloadmanager/k8s_client.go | Extends pod lookup to return node name and adds workload annotation patch helper (last-node). |
| pkg/workloadmanager/k8s_client_test.go | Updates tests for new GetSandboxPodIP return signature. |
| pkg/workloadmanager/handlers.go | Records scheduled node back onto the owning workload after sandbox becomes usable (best-effort). |
| pkg/workloadmanager/handlers_test.go | Refactors patching approach for stability and updates mocks for new pod lookup signature. |
| pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go | Updates generated deepcopy implementations for the new NodeStickiness spec pointers/types. |
| pkg/apis/runtime/v1alpha1/codeinterpreter_types.go | Adds NodeStickiness to CodeInterpreterSpec and defines NodeStickinessSpec. |
| pkg/apis/runtime/v1alpha1/agent_type.go | Adds NodeStickiness to AgentRuntimeSpec. |
| manifests/charts/base/crds/runtime.agentcube.volcano.sh_codeinterpreters.yaml | CRD schema update to expose spec.nodeStickiness.enabled. |
| manifests/charts/base/crds/runtime.agentcube.volcano.sh_agentruntimes.yaml | CRD schema update to expose spec.nodeStickiness.enabled. |
Files not reviewed (1)
- pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go: Generated file
| // LastNodeAnnotationKey records the node the AgentRuntime's previous session | ||
| // landed on, so the next session can prefer the same node (node stickiness). | ||
| LastNodeAnnotationKey = "runtime.agentcube.io/last-node" |
| func injectPreferredNodeAffinity(podSpec *corev1.PodSpec, nodeName string) { | ||
| term := corev1.PreferredSchedulingTerm{ | ||
| Weight: 100, | ||
| Preference: corev1.NodeSelectorTerm{ | ||
| MatchExpressions: []corev1.NodeSelectorRequirement{ | ||
| { | ||
| Key: "kubernetes.io/hostname", | ||
| Operator: corev1.NodeSelectorOpIn, | ||
| Values: []string{nodeName}, | ||
| }, | ||
| }, | ||
| }, | ||
| } | ||
| if podSpec.Affinity == nil { | ||
| podSpec.Affinity = &corev1.Affinity{} | ||
| } | ||
| if podSpec.Affinity.NodeAffinity == nil { | ||
| podSpec.Affinity.NodeAffinity = &corev1.NodeAffinity{} | ||
| } | ||
| podSpec.Affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution = append( | ||
| podSpec.Affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution, | ||
| term, | ||
| ) | ||
| } |
| // PatchWorkloadLastNode records the node the latest session landed on as an | ||
| // annotation on the owning workload (AgentRuntime or CodeInterpreter), so the | ||
| // next session can prefer the same node. | ||
| func (c *K8sClient) PatchWorkloadLastNode(ctx context.Context, namespace, kind, name, nodeName string) error { | ||
| patch := []byte(fmt.Sprintf( | ||
| `{"metadata":{"annotations":{%q:%q}}}`, | ||
| LastNodeAnnotationKey, nodeName, | ||
| )) | ||
| var err error | ||
| switch kind { | ||
| case runtimev1alpha1.AgentRuntimeKind: | ||
| _, err = c.cubeClientset.RuntimeV1alpha1().AgentRuntimes(namespace).Patch( | ||
| ctx, name, k8stypes.MergePatchType, patch, metav1.PatchOptions{}, | ||
| ) | ||
| case runtimev1alpha1.CodeInterpreterKind: | ||
| _, err = c.cubeClientset.RuntimeV1alpha1().CodeInterpreters(namespace).Patch( | ||
| ctx, name, k8stypes.MergePatchType, patch, metav1.PatchOptions{}, | ||
| ) | ||
| default: | ||
| return fmt.Errorf("patch last-node: unsupported workload kind %q", kind) | ||
| } | ||
| if err != nil { | ||
| return fmt.Errorf("patch %s %s/%s last-node: %w", kind, namespace, name, err) | ||
| } | ||
| return nil |
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #412 +/- ##
===========================================
+ Coverage 47.57% 59.26% +11.69%
===========================================
Files 30 37 +7
Lines 2819 3584 +765
===========================================
+ Hits 1341 2124 +783
+ Misses 1338 1249 -89
- Partials 140 211 +71
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
29007a7 to
b148009
Compare
| if podSpec.Affinity != nil && podSpec.Affinity.NodeAffinity != nil { | ||
| for _, existing := range podSpec.Affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution { | ||
| for _, expr := range existing.Preference.MatchExpressions { | ||
| if expr.Key != corev1.LabelHostname || expr.Operator != corev1.NodeSelectorOpIn { | ||
| continue | ||
| } | ||
| for _, v := range expr.Values { | ||
| if v == nodeName { | ||
| return | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } |
…rpreter When enabled, a new session prefers (soft / PreferredDuringScheduling) the node the previous session for the same workload landed on. The scheduled node is recorded back onto the workload via a last-node annotation after the pod is ready. Gated by a CRD-level nodeStickiness.enabled switch that defaults to off, so the feature is fully inert unless explicitly turned on. Has no effect for CodeInterpreter warm pools, since those sandboxes are pre-scheduled. Signed-off-by: Turbo-Jiaxxx789 <222392535+Turbo-Jiaxxx789@users.noreply.github.com>
b148009 to
d193a8f
Compare
…rpreter
When enabled, a new session prefers (soft / PreferredDuringScheduling) the node the previous session for the same workload landed on. The scheduled node is recorded back onto the workload via a last-node annotation after the pod is ready. Gated by a CRD-level nodeStickiness.enabled switch that defaults to off, so the feature is fully inert unless explicitly turned on. Has no effect for CodeInterpreter warm pools, since those sandboxes are pre-scheduled.
What type of PR is this?
/kind feature
What this PR does / why we need it:
Adds opt-in soft node-affinity stickiness for AgentRuntime and CodeInterpreter. When spec.nodeStickiness.enabled is set, a new session prefers to schedule onto the node where the previous session for the same workload ran, improving warm-cache / local-data reuse across sessions.
In our RL experiments, enable node-affinity stickiness for AgentRuntime reduced startup time by 70%+ across ~20k containers + ~20k images.
New optional field spec.nodeStickiness.enabled (bool, default false) on both CRDs.
When enabled and a last node is recorded, a PreferredDuringSchedulingIgnoredDuringExecution node-affinity term (weight 100, key kubernetes.io/hostname) is appended to the sandbox pod spec — user-defined affinity is preserved, never overwritten.
After the session pod becomes ready, the scheduled node is written back onto the owning workload via the runtime.agentcube.io/last-node annotation.
It is a soft preference only, so scheduling never fails if the preferred node is cordoned, deleted, or full.
Which issue(s) this PR fixes: Fixes #
Special notes for your reviewer:
Backward compatible: a nil/omitted nodeStickiness is treated as disabled, so existing CRs need no migration.
Has no effect for CodeInterpreter with warmPoolSize > 0, since warm-pool sandboxes are pre-scheduled before a session claims them.
The last-node write-back is a no-op when stickiness is disabled.
Does this PR introduce a user-facing change?:
Add opt-in soft node-affinity stickiness for AgentRuntime and CodeInterpreter via
spec.nodeStickiness.enabled(default false). When enabled, sessions prefer the node used by the previous session for the same workload.