Skip to content

feat: add soft node-affinity stickiness for AgentRuntime and CodeInte…#412

Open
Turbo-Jiaxxx789 wants to merge 1 commit into
volcano-sh:mainfrom
Turbo-Jiaxxx789:feat/node-affinity-stickiness
Open

feat: add soft node-affinity stickiness for AgentRuntime and CodeInte…#412
Turbo-Jiaxxx789 wants to merge 1 commit into
volcano-sh:mainfrom
Turbo-Jiaxxx789:feat/node-affinity-stickiness

Conversation

@Turbo-Jiaxxx789

Copy link
Copy Markdown

…rpreter

When enabled, a new session prefers (soft / PreferredDuringScheduling) the node the previous session for the same workload landed on. The scheduled node is recorded back onto the workload via a last-node annotation after the pod is ready. Gated by a CRD-level nodeStickiness.enabled switch that defaults to off, so the feature is fully inert unless explicitly turned on. Has no effect for CodeInterpreter warm pools, since those sandboxes are pre-scheduled.

What type of PR is this?
/kind feature

What this PR does / why we need it:
Adds opt-in soft node-affinity stickiness for AgentRuntime and CodeInterpreter. When spec.nodeStickiness.enabled is set, a new session prefers to schedule onto the node where the previous session for the same workload ran, improving warm-cache / local-data reuse across sessions.
In our RL experiments, enable node-affinity stickiness for AgentRuntime reduced startup time by 70%+ across ~20k containers + ~20k images.
New optional field spec.nodeStickiness.enabled (bool, default false) on both CRDs.
When enabled and a last node is recorded, a PreferredDuringSchedulingIgnoredDuringExecution node-affinity term (weight 100, key kubernetes.io/hostname) is appended to the sandbox pod spec — user-defined affinity is preserved, never overwritten.
After the session pod becomes ready, the scheduled node is written back onto the owning workload via the runtime.agentcube.io/last-node annotation.
It is a soft preference only, so scheduling never fails if the preferred node is cordoned, deleted, or full.

Which issue(s) this PR fixes: Fixes #

Special notes for your reviewer:
Backward compatible: a nil/omitted nodeStickiness is treated as disabled, so existing CRs need no migration.
Has no effect for CodeInterpreter with warmPoolSize > 0, since warm-pool sandboxes are pre-scheduled before a session claims them.
The last-node write-back is a no-op when stickiness is disabled.
Does this PR introduce a user-facing change?:

Add opt-in soft node-affinity stickiness for AgentRuntime and CodeInterpreter via spec.nodeStickiness.enabled (default false). When enabled, sessions prefer the node used by the previous session for the same workload.

Copilot AI review requested due to automatic review settings June 29, 2026 02:02
@volcano-sh-bot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hzxuzhonghu for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot

Copy link
Copy Markdown
Contributor

Welcome @Turbo-Jiaxxx789! It looks like this is your first PR to volcano-sh/agentcube 🎉

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a node stickiness feature for AgentRuntime and CodeInterpreter workloads, allowing subsequent sessions to prefer scheduling on the same node as the previous session. The feedback suggests using a detached context with a timeout for the best-effort workload patching operation to ensure it succeeds even if the client disconnects immediately after sandbox creation.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread pkg/workloadmanager/handlers.go Outdated
Comment on lines +293 to +297
if sandboxEntry.StickyWorkloadName != "" && nodeName != "" {
if err := s.k8sClient.PatchWorkloadLastNode(ctx, sandbox.Namespace, sandboxEntry.StickyWorkloadKind, sandboxEntry.StickyWorkloadName, nodeName); err != nil {
klog.Warningf("failed to patch %s %s/%s last-node=%s: %v", sandboxEntry.StickyWorkloadKind, sandbox.Namespace, sandboxEntry.StickyWorkloadName, nodeName, err)
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using the request context ctx for the best-effort PatchWorkloadLastNode call means that if the client disconnects (canceling ctx) immediately after the sandbox becomes ready, the patch operation will fail. Since the sandbox has already been successfully created and scheduled, we should ensure the scheduled node is recorded regardless of client connection status. Consider using a detached context with a reasonable timeout (e.g., 5 seconds) to make this best-effort write-back more robust.

Suggested change
if sandboxEntry.StickyWorkloadName != "" && nodeName != "" {
if err := s.k8sClient.PatchWorkloadLastNode(ctx, sandbox.Namespace, sandboxEntry.StickyWorkloadKind, sandboxEntry.StickyWorkloadName, nodeName); err != nil {
klog.Warningf("failed to patch %s %s/%s last-node=%s: %v", sandboxEntry.StickyWorkloadKind, sandbox.Namespace, sandboxEntry.StickyWorkloadName, nodeName, err)
}
}
if sandboxEntry.StickyWorkloadName != "" && nodeName != "" {
patchCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
if err := s.k8sClient.PatchWorkloadLastNode(patchCtx, sandbox.Namespace, sandboxEntry.StickyWorkloadKind, sandboxEntry.StickyWorkloadName, nodeName); err != nil {
klog.Warningf("failed to patch %s %s/%s last-node=%s: %v", sandboxEntry.StickyWorkloadKind, sandbox.Namespace, sandboxEntry.StickyWorkloadName, nodeName, err)
}
}

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an opt-in “soft node stickiness” feature to the Workload Manager’s sandbox provisioning flow for AgentRuntime and CodeInterpreter. When enabled via spec.nodeStickiness.enabled, new sessions prefer scheduling onto the same node as the previous session by injecting a PreferredDuringSchedulingIgnoredDuringExecution node-affinity term, and the Workload Manager records the actual scheduled node back onto the owning workload via the runtime.agentcube.io/last-node annotation.

Changes:

  • Add spec.nodeStickiness.enabled (default false) to AgentRuntime and CodeInterpreter CRDs/types and generated deepcopy code.
  • Inject a preferred hostname node-affinity term when stickiness is enabled and runtime.agentcube.io/last-node is present.
  • After sandbox readiness/entrypoint verification, patch the owning workload’s runtime.agentcube.io/last-node annotation (best-effort) for future sessions.

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pkg/workloadmanager/workload_builder.go Adds stickiness gating, affinity injection helper, and populates sticky workload identifiers for write-back.
pkg/workloadmanager/workload_builder_test.go Adds unit tests for affinity injection and stickiness behavior for AgentRuntime/CodeInterpreter (incl. warm pool behavior).
pkg/workloadmanager/k8s_client.go Extends pod lookup to return node name and adds workload annotation patch helper (last-node).
pkg/workloadmanager/k8s_client_test.go Updates tests for new GetSandboxPodIP return signature.
pkg/workloadmanager/handlers.go Records scheduled node back onto the owning workload after sandbox becomes usable (best-effort).
pkg/workloadmanager/handlers_test.go Refactors patching approach for stability and updates mocks for new pod lookup signature.
pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go Updates generated deepcopy implementations for the new NodeStickiness spec pointers/types.
pkg/apis/runtime/v1alpha1/codeinterpreter_types.go Adds NodeStickiness to CodeInterpreterSpec and defines NodeStickinessSpec.
pkg/apis/runtime/v1alpha1/agent_type.go Adds NodeStickiness to AgentRuntimeSpec.
manifests/charts/base/crds/runtime.agentcube.volcano.sh_codeinterpreters.yaml CRD schema update to expose spec.nodeStickiness.enabled.
manifests/charts/base/crds/runtime.agentcube.volcano.sh_agentruntimes.yaml CRD schema update to expose spec.nodeStickiness.enabled.
Files not reviewed (1)
  • pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go: Generated file

Comment thread pkg/workloadmanager/k8s_client.go Outdated
Comment on lines +64 to +66
// LastNodeAnnotationKey records the node the AgentRuntime's previous session
// landed on, so the next session can prefer the same node (node stickiness).
LastNodeAnnotationKey = "runtime.agentcube.io/last-node"
Comment on lines +332 to +355
func injectPreferredNodeAffinity(podSpec *corev1.PodSpec, nodeName string) {
term := corev1.PreferredSchedulingTerm{
Weight: 100,
Preference: corev1.NodeSelectorTerm{
MatchExpressions: []corev1.NodeSelectorRequirement{
{
Key: "kubernetes.io/hostname",
Operator: corev1.NodeSelectorOpIn,
Values: []string{nodeName},
},
},
},
}
if podSpec.Affinity == nil {
podSpec.Affinity = &corev1.Affinity{}
}
if podSpec.Affinity.NodeAffinity == nil {
podSpec.Affinity.NodeAffinity = &corev1.NodeAffinity{}
}
podSpec.Affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution = append(
podSpec.Affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution,
term,
)
}
Comment on lines +367 to +391
// PatchWorkloadLastNode records the node the latest session landed on as an
// annotation on the owning workload (AgentRuntime or CodeInterpreter), so the
// next session can prefer the same node.
func (c *K8sClient) PatchWorkloadLastNode(ctx context.Context, namespace, kind, name, nodeName string) error {
patch := []byte(fmt.Sprintf(
`{"metadata":{"annotations":{%q:%q}}}`,
LastNodeAnnotationKey, nodeName,
))
var err error
switch kind {
case runtimev1alpha1.AgentRuntimeKind:
_, err = c.cubeClientset.RuntimeV1alpha1().AgentRuntimes(namespace).Patch(
ctx, name, k8stypes.MergePatchType, patch, metav1.PatchOptions{},
)
case runtimev1alpha1.CodeInterpreterKind:
_, err = c.cubeClientset.RuntimeV1alpha1().CodeInterpreters(namespace).Patch(
ctx, name, k8stypes.MergePatchType, patch, metav1.PatchOptions{},
)
default:
return fmt.Errorf("patch last-node: unsupported workload kind %q", kind)
}
if err != nil {
return fmt.Errorf("patch %s %s/%s last-node: %w", kind, namespace, name, err)
}
return nil
@codecov-commenter

codecov-commenter commented Jun 29, 2026

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 79.80769% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.26%. Comparing base (524e55e) to head (d193a8f).
⚠️ Report is 146 commits behind head on main.

Files with missing lines Patch % Lines
pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go 50.00% 8 Missing ⚠️
pkg/workloadmanager/k8s_client.go 80.00% 5 Missing and 1 partial ⚠️
pkg/workloadmanager/handlers.go 55.55% 4 Missing ⚠️
pkg/workloadmanager/workload_builder.go 93.87% 2 Missing and 1 partial ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@             Coverage Diff             @@
##             main     #412       +/-   ##
===========================================
+ Coverage   47.57%   59.26%   +11.69%     
===========================================
  Files          30       37        +7     
  Lines        2819     3584      +765     
===========================================
+ Hits         1341     2124      +783     
+ Misses       1338     1249       -89     
- Partials      140      211       +71     
Flag Coverage Δ
unittests 59.26% <79.80%> (+11.69%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Turbo-Jiaxxx789 Turbo-Jiaxxx789 force-pushed the feat/node-affinity-stickiness branch 2 times, most recently from 29007a7 to b148009 Compare June 29, 2026 03:14
Copilot AI review requested due to automatic review settings June 29, 2026 03:14

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 13 changed files in this pull request and generated 1 comment.

Files not reviewed (1)
  • pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go: Generated file

Comment on lines +338 to +351
if podSpec.Affinity != nil && podSpec.Affinity.NodeAffinity != nil {
for _, existing := range podSpec.Affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution {
for _, expr := range existing.Preference.MatchExpressions {
if expr.Key != corev1.LabelHostname || expr.Operator != corev1.NodeSelectorOpIn {
continue
}
for _, v := range expr.Values {
if v == nodeName {
return
}
}
}
}
}
…rpreter

When enabled, a new session prefers (soft / PreferredDuringScheduling) the
node the previous session for the same workload landed on. The scheduled node
is recorded back onto the workload via a last-node annotation after the pod is
ready. Gated by a CRD-level nodeStickiness.enabled switch that defaults to off,
so the feature is fully inert unless explicitly turned on. Has no effect for
CodeInterpreter warm pools, since those sandboxes are pre-scheduled.

Signed-off-by: Turbo-Jiaxxx789 <222392535+Turbo-Jiaxxx789@users.noreply.github.com>
@Turbo-Jiaxxx789 Turbo-Jiaxxx789 force-pushed the feat/node-affinity-stickiness branch from b148009 to d193a8f Compare June 29, 2026 03:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants