feat: add soft node-affinity stickiness for AgentRuntime and CodeInte… by Turbo-Jiaxxx789 · Pull Request #412 · volcano-sh/agentcube

Turbo-Jiaxxx789 · 2026-06-29T02:02:49Z

…rpreter

When enabled, a new session prefers (soft / PreferredDuringScheduling) the node the previous session for the same workload landed on. The scheduled node is recorded back onto the workload via a last-node annotation after the pod is ready. Gated by a CRD-level nodeStickiness.enabled switch that defaults to off, so the feature is fully inert unless explicitly turned on. Has no effect for CodeInterpreter warm pools, since those sandboxes are pre-scheduled.

What type of PR is this?
/kind feature

What this PR does / why we need it:
Adds opt-in soft node-affinity stickiness for AgentRuntime and CodeInterpreter. When spec.nodeStickiness.enabled is set, a new session prefers to schedule onto the node where the previous session for the same workload ran, improving warm-cache / local-data reuse across sessions.
In our RL experiments, enable node-affinity stickiness for AgentRuntime reduced startup time by 70%+ across ~20k containers + ~20k images.
New optional field spec.nodeStickiness.enabled (bool, default false) on both CRDs.
When enabled and a last node is recorded, a PreferredDuringSchedulingIgnoredDuringExecution node-affinity term (weight 100, key kubernetes.io/hostname) is appended to the sandbox pod spec — user-defined affinity is preserved, never overwritten.
After the session pod becomes ready, the scheduled node is written back onto the owning workload via the runtime.agentcube.io/last-node annotation.
It is a soft preference only, so scheduling never fails if the preferred node is cordoned, deleted, or full.

Which issue(s) this PR fixes: Fixes #

Special notes for your reviewer:
Backward compatible: a nil/omitted nodeStickiness is treated as disabled, so existing CRs need no migration.
Has no effect for CodeInterpreter with warmPoolSize > 0, since warm-pool sandboxes are pre-scheduled before a session claims them.
The last-node write-back is a no-op when stickiness is disabled.
Does this PR introduce a user-facing change?:

Add opt-in soft node-affinity stickiness for AgentRuntime and CodeInterpreter via spec.nodeStickiness.enabled (default false). When enabled, sessions prefer the node used by the previous session for the same workload.

volcano-sh-bot · 2026-06-29T02:02:55Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hzxuzhonghu for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

volcano-sh-bot · 2026-06-29T02:03:00Z

Welcome @Turbo-Jiaxxx789! It looks like this is your first PR to volcano-sh/agentcube 🎉

gemini-code-assist

Code Review

This pull request introduces a node stickiness feature for AgentRuntime and CodeInterpreter workloads, allowing subsequent sessions to prefer scheduling on the same node as the previous session. The feedback suggests using a detached context with a timeout for the best-effort workload patching operation to ensure it succeeds even if the client disconnects immediately after sandbox creation.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-29T02:04:58Z

+	if sandboxEntry.StickyWorkloadName != "" && nodeName != "" {
+		if err := s.k8sClient.PatchWorkloadLastNode(ctx, sandbox.Namespace, sandboxEntry.StickyWorkloadKind, sandboxEntry.StickyWorkloadName, nodeName); err != nil {
+			klog.Warningf("failed to patch %s %s/%s last-node=%s: %v", sandboxEntry.StickyWorkloadKind, sandbox.Namespace, sandboxEntry.StickyWorkloadName, nodeName, err)
+		}
+	}


Using the request context ctx for the best-effort PatchWorkloadLastNode call means that if the client disconnects (canceling ctx) immediately after the sandbox becomes ready, the patch operation will fail. Since the sandbox has already been successfully created and scheduled, we should ensure the scheduled node is recorded regardless of client connection status. Consider using a detached context with a reasonable timeout (e.g., 5 seconds) to make this best-effort write-back more robust.

Suggested change

if sandboxEntry.StickyWorkloadName != "" && nodeName != "" {

if err := s.k8sClient.PatchWorkloadLastNode(ctx, sandbox.Namespace, sandboxEntry.StickyWorkloadKind, sandboxEntry.StickyWorkloadName, nodeName); err != nil {

klog.Warningf("failed to patch %s %s/%s last-node=%s: %v", sandboxEntry.StickyWorkloadKind, sandbox.Namespace, sandboxEntry.StickyWorkloadName, nodeName, err)

}

}

if sandboxEntry.StickyWorkloadName != "" && nodeName != "" {

patchCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)

defer cancel()

if err := s.k8sClient.PatchWorkloadLastNode(patchCtx, sandbox.Namespace, sandboxEntry.StickyWorkloadKind, sandboxEntry.StickyWorkloadName, nodeName); err != nil {

klog.Warningf("failed to patch %s %s/%s last-node=%s: %v", sandboxEntry.StickyWorkloadKind, sandbox.Namespace, sandboxEntry.StickyWorkloadName, nodeName, err)

}

}

Copilot

Pull request overview

This PR adds an opt-in “soft node stickiness” feature to the Workload Manager’s sandbox provisioning flow for AgentRuntime and CodeInterpreter. When enabled via spec.nodeStickiness.enabled, new sessions prefer scheduling onto the same node as the previous session by injecting a PreferredDuringSchedulingIgnoredDuringExecution node-affinity term, and the Workload Manager records the actual scheduled node back onto the owning workload via the runtime.agentcube.io/last-node annotation.

Changes:

Add spec.nodeStickiness.enabled (default false) to AgentRuntime and CodeInterpreter CRDs/types and generated deepcopy code.
Inject a preferred hostname node-affinity term when stickiness is enabled and runtime.agentcube.io/last-node is present.
After sandbox readiness/entrypoint verification, patch the owning workload’s runtime.agentcube.io/last-node annotation (best-effort) for future sessions.

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
pkg/workloadmanager/workload_builder.go	Adds stickiness gating, affinity injection helper, and populates sticky workload identifiers for write-back.
pkg/workloadmanager/workload_builder_test.go	Adds unit tests for affinity injection and stickiness behavior for AgentRuntime/CodeInterpreter (incl. warm pool behavior).
pkg/workloadmanager/k8s_client.go	Extends pod lookup to return node name and adds workload annotation patch helper (`last-node`).
pkg/workloadmanager/k8s_client_test.go	Updates tests for new `GetSandboxPodIP` return signature.
pkg/workloadmanager/handlers.go	Records scheduled node back onto the owning workload after sandbox becomes usable (best-effort).
pkg/workloadmanager/handlers_test.go	Refactors patching approach for stability and updates mocks for new pod lookup signature.
pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go	Updates generated deepcopy implementations for the new NodeStickiness spec pointers/types.
pkg/apis/runtime/v1alpha1/codeinterpreter_types.go	Adds `NodeStickiness` to CodeInterpreterSpec and defines `NodeStickinessSpec`.
pkg/apis/runtime/v1alpha1/agent_type.go	Adds `NodeStickiness` to AgentRuntimeSpec.
manifests/charts/base/crds/runtime.agentcube.volcano.sh_codeinterpreters.yaml	CRD schema update to expose `spec.nodeStickiness.enabled`.
manifests/charts/base/crds/runtime.agentcube.volcano.sh_agentruntimes.yaml	CRD schema update to expose `spec.nodeStickiness.enabled`.

Files not reviewed (1)

pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go: Generated file

+	// LastNodeAnnotationKey records the node the AgentRuntime's previous session
+	// landed on, so the next session can prefer the same node (node stickiness).
+	LastNodeAnnotationKey = "runtime.agentcube.io/last-node"


+func injectPreferredNodeAffinity(podSpec *corev1.PodSpec, nodeName string) {
+	term := corev1.PreferredSchedulingTerm{
+		Weight: 100,
+		Preference: corev1.NodeSelectorTerm{
+			MatchExpressions: []corev1.NodeSelectorRequirement{
+				{
+					Key:      "kubernetes.io/hostname",
+					Operator: corev1.NodeSelectorOpIn,
+					Values:   []string{nodeName},
+				},
+			},
+		},
+	}
+	if podSpec.Affinity == nil {
+		podSpec.Affinity = &corev1.Affinity{}
+	}
+	if podSpec.Affinity.NodeAffinity == nil {
+		podSpec.Affinity.NodeAffinity = &corev1.NodeAffinity{}
+	}
+	podSpec.Affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution = append(
+		podSpec.Affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution,
+		term,
+	)
+}


+// PatchWorkloadLastNode records the node the latest session landed on as an
+// annotation on the owning workload (AgentRuntime or CodeInterpreter), so the
+// next session can prefer the same node.
+func (c *K8sClient) PatchWorkloadLastNode(ctx context.Context, namespace, kind, name, nodeName string) error {
+	patch := []byte(fmt.Sprintf(
+		`{"metadata":{"annotations":{%q:%q}}}`,
+		LastNodeAnnotationKey, nodeName,
+	))
+	var err error
+	switch kind {
+	case runtimev1alpha1.AgentRuntimeKind:
+		_, err = c.cubeClientset.RuntimeV1alpha1().AgentRuntimes(namespace).Patch(
+			ctx, name, k8stypes.MergePatchType, patch, metav1.PatchOptions{},
+		)
+	case runtimev1alpha1.CodeInterpreterKind:
+		_, err = c.cubeClientset.RuntimeV1alpha1().CodeInterpreters(namespace).Patch(
+			ctx, name, k8stypes.MergePatchType, patch, metav1.PatchOptions{},
+		)
+	default:
+		return fmt.Errorf("patch last-node: unsupported workload kind %q", kind)
+	}
+	if err != nil {
+		return fmt.Errorf("patch %s %s/%s last-node: %w", kind, namespace, name, err)
+	}
+	return nil


codecov-commenter · 2026-06-29T02:07:58Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 79.80769% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.26%. Comparing base (524e55e) to head (d193a8f).
⚠️ Report is 146 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go	50.00%	8 Missing ⚠️
pkg/workloadmanager/k8s_client.go	80.00%	5 Missing and 1 partial ⚠️
pkg/workloadmanager/handlers.go	55.55%	4 Missing ⚠️
pkg/workloadmanager/workload_builder.go	93.87%	2 Missing and 1 partial ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@             Coverage Diff             @@
##             main     #412       +/-   ##
===========================================
+ Coverage   47.57%   59.26%   +11.69%     
===========================================
  Files          30       37        +7     
  Lines        2819     3584      +765     
===========================================
+ Hits         1341     2124      +783     
+ Misses       1338     1249       -89     
- Partials      140      211       +71

Flag	Coverage Δ
unittests	`59.26% <79.80%> (+11.69%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

Copilot reviewed 11 out of 13 changed files in this pull request and generated 1 comment.

Files not reviewed (1)

pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go: Generated file

+	if podSpec.Affinity != nil && podSpec.Affinity.NodeAffinity != nil {
+		for _, existing := range podSpec.Affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution {
+			for _, expr := range existing.Preference.MatchExpressions {
+				if expr.Key != corev1.LabelHostname || expr.Operator != corev1.NodeSelectorOpIn {
+					continue
+				}
+				for _, v := range expr.Values {
+					if v == nodeName {
+						return
+					}
+				}
+			}
+		}
+	}


…rpreter When enabled, a new session prefers (soft / PreferredDuringScheduling) the node the previous session for the same workload landed on. The scheduled node is recorded back onto the workload via a last-node annotation after the pod is ready. Gated by a CRD-level nodeStickiness.enabled switch that defaults to off, so the feature is fully inert unless explicitly turned on. Has no effect for CodeInterpreter warm pools, since those sandboxes are pre-scheduled. Signed-off-by: Turbo-Jiaxxx789 <222392535+Turbo-Jiaxxx789@users.noreply.github.com>

Copilot AI review requested due to automatic review settings June 29, 2026 02:02

volcano-sh-bot added the kind/feature label Jun 29, 2026

volcano-sh-bot requested review from LiZhenCheng9527, VanderChen and acsoto June 29, 2026 02:02

volcano-sh-bot added the size/XL label Jun 29, 2026

Copilot started reviewing on behalf of Turbo-Jiaxxx789 June 29, 2026 02:03 View session

gemini-code-assist Bot reviewed Jun 29, 2026

View reviewed changes

Copilot AI reviewed Jun 29, 2026

View reviewed changes

Turbo-Jiaxxx789 force-pushed the feat/node-affinity-stickiness branch 2 times, most recently from 29007a7 to b148009 Compare June 29, 2026 03:14

Copilot AI review requested due to automatic review settings June 29, 2026 03:14

Copilot started reviewing on behalf of Turbo-Jiaxxx789 June 29, 2026 03:14 View session

Copilot AI reviewed Jun 29, 2026

View reviewed changes

Turbo-Jiaxxx789 force-pushed the feat/node-affinity-stickiness branch from b148009 to d193a8f Compare June 29, 2026 03:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add soft node-affinity stickiness for AgentRuntime and CodeInte…#412

feat: add soft node-affinity stickiness for AgentRuntime and CodeInte…#412
Turbo-Jiaxxx789 wants to merge 1 commit into
volcano-sh:mainfrom
Turbo-Jiaxxx789:feat/node-affinity-stickiness

Turbo-Jiaxxx789 commented Jun 29, 2026

Uh oh!

volcano-sh-bot commented Jun 29, 2026

Uh oh!

volcano-sh-bot commented Jun 29, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

codecov-commenter commented Jun 29, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

Turbo-Jiaxxx789 commented Jun 29, 2026

Uh oh!

volcano-sh-bot commented Jun 29, 2026

Uh oh!

volcano-sh-bot commented Jun 29, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

codecov-commenter commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Jun 29, 2026 •

edited

Loading