Skip to content

WIP - feat: Downstream llm-d#22

Open
albertoperdomo2 wants to merge 4 commits intoopenshift-psap:mainfrom
albertoperdomo2:feat/downstream-llm-d
Open

WIP - feat: Downstream llm-d#22
albertoperdomo2 wants to merge 4 commits intoopenshift-psap:mainfrom
albertoperdomo2:feat/downstream-llm-d

Conversation

@albertoperdomo2
Copy link
Copy Markdown
Collaborator

@albertoperdomo2 albertoperdomo2 commented Apr 13, 2026

Summary by CodeRabbit

Release Notes

  • New Features
    • Added LLM inference service deployment and management for large language models
    • Introduced GPU operator and cluster policy configuration for GPU-accelerated inference
    • Added automated smoke testing and benchmarking capabilities for deployed models
    • Configured support for multiple LLM models (Qwen and Llama variants)
    • Enhanced platform configuration for OpenShift deployment environments

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 13, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 13, 2026

Warning

Rate limit exceeded

@albertoperdomo2 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 3 minutes and 55 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 3 minutes and 55 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a0fb8948-a1e1-4946-97bc-53c2081827f1

📥 Commits

Reviewing files that changed from the base of the PR and between 7d85dc5 and 9298b94.

📒 Files selected for processing (27)
  • projects/core/library/config.py
  • projects/llm_d/README.md
  • projects/llm_d/orchestration/ci.py
  • projects/llm_d/orchestration/cli.py
  • projects/llm_d/orchestration/config.d/model_cache.yaml
  • projects/llm_d/orchestration/config.d/models.yaml
  • projects/llm_d/orchestration/config.d/platform.yaml
  • projects/llm_d/orchestration/config.d/runtime.yaml
  • projects/llm_d/orchestration/config.d/workloads.yaml
  • projects/llm_d/orchestration/config.yaml
  • projects/llm_d/orchestration/llmd_runtime.py
  • projects/llm_d/orchestration/manifests/datasciencecluster.yaml
  • projects/llm_d/orchestration/manifests/epp-approximate-prefix-cache.yaml
  • projects/llm_d/orchestration/manifests/gateway.yaml
  • projects/llm_d/orchestration/manifests/gpu-clusterpolicy.yaml
  • projects/llm_d/orchestration/manifests/llminferenceservice.yaml
  • projects/llm_d/orchestration/manifests/nfd-nodefeaturediscovery.yaml
  • projects/llm_d/orchestration/prepare_llmd.py
  • projects/llm_d/orchestration/presets.d/cks.yaml
  • projects/llm_d/orchestration/presets.d/presets.yaml
  • projects/llm_d/orchestration/test_llmd.py
  • projects/llm_d/toolbox/capture_isvc_state/main.py
  • projects/llm_d/toolbox/cleanup/main.py
  • projects/llm_d/toolbox/prepare/main.py
  • projects/llm_d/toolbox/prepare_model_cache/main.py
  • projects/llm_d/toolbox/test/main.py
  • tests/llm_d/test_runtime.py
📝 Walkthrough

Walkthrough

Added comprehensive Kubernetes orchestration infrastructure for LLM inference deployment on OpenShift, including declarative manifests for cluster components, configuration files defining models and workloads, and runtime orchestration logic implementing prepare/test/cleanup phases with operator installation, resource provisioning, and endpoint validation workflows.

Changes

Cohort / File(s) Summary
Kubernetes Manifests
config/llm_d/manifests/datasciencecluster.yaml, config/llm_d/manifests/epp-approximate-prefix-cache.yaml, config/llm_d/manifests/gateway.yaml, config/llm_d/manifests/gpu-clusterpolicy.yaml, config/llm_d/manifests/llminferenceservice.yaml, config/llm_d/manifests/nfd-nodefeaturediscovery.yaml
Declares six Kubernetes custom resources: DataScienceCluster with component lifecycle management, EndpointPickerConfig for scheduling plugins, Gateway for traffic routing, NVIDIA ClusterPolicy for GPU operator, KServe LLMInferenceService with probes/resources/metrics, and NodeFeatureDiscovery instance.
Configuration Files
config/llm_d/models.yaml, config/llm_d/platform.yaml, config/llm_d/presets.yaml, config/llm_d/workloads.yaml
Defines deployment configuration: model URIs and resource requirements, platform-wide constraints and operator specifications, test presets (smoke/benchmark) with aliases, and workload templates for smoke requests and GuideLLM benchmark jobs.
Runtime Orchestration
projects/llm_d/orchestration/llmd_runtime.py
Implements 695-line core runtime module providing configuration resolution, Kubernetes/OpenShift command wrappers, manifest rendering helpers for inference/gateway/datasciencecluster/benchmark resources, operator subscription management, and conditional polling utilities for resource readiness checks.
Prepare Phase
projects/llm_d/orchestration/prepare_llmd.py
Expands from stub to full 422-line orchestration flow: verifies cluster access/version, installs cert-manager/leader-worker-set/nfd/gpu-operator/rhods, applies DataScienceCluster and gateway, waits for readiness, verifies GPU node presence, and captures cluster state artifacts.
Test Phase
projects/llm_d/orchestration/test_llmd.py
Implements 475-line test orchestration: deploys inference service, waits for pod readiness, resolves gateway endpoint URL, executes smoke request with retry/backoff, optionally runs GuideLLM benchmark with PVC/Job manifests and result copying, and captures final cluster state.
CLI Refactoring
projects/llm_d/orchestration/ci.py, projects/llm_d/orchestration/cli.py
Removes context-based Click pattern, adds FORGE_HOME path setup, renames command functions with explicit names (prepare/test/pre-cleanup/post-cleanup), adds return type annotations, and changes exit handling to direct SystemExit wrapping.
Test Suite
tests/llm_d/test_runtime.py
Adds 208-line test module covering namespace derivation, override parsing, config loading with preset resolution, manifest rendering for inference/benchmark resources, GPU operator subscription behavior, and endpoint resolution error handling.
Documentation and Toolbox
projects/llm_d/README.md, projects/llm_d/toolbox/capture_isvc_state/main.py
Replaces generic template README with implementation-scoped description; applies formatting standardization to toolbox imports, string literals, and function signatures without behavioral changes.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Test Client
    participant Runtime as Orchestration<br/>Runtime
    participant K8s as Kubernetes<br/>API
    participant Ops as Operators<br/>(cert-manager,<br/>gpu-op, rhods)
    participant ISvc as Inference<br/>Service
    participant GW as Gateway

    Client->>Runtime: init()
    Runtime->>Runtime: load config (platform, models,<br/>workloads, presets)
    
    rect rgb(100, 150, 200, 0.5)
    Note over Client,GW: Prepare Phase
    Runtime->>K8s: verify cluster access & version
    Runtime->>Ops: subscribe to operators
    Ops->>K8s: install cert-manager, nfd, gpu-op, rhods
    K8s->>Ops: operator running
    Runtime->>K8s: apply DataScienceCluster manifest
    K8s->>K8s: reconcile DSC components
    Runtime->>K8s: wait for DSC ready
    Runtime->>K8s: apply Gateway manifest
    K8s->>GW: gateway provisioned
    Runtime->>K8s: wait for gateway ready
    Runtime->>K8s: verify GPU nodes & NFD labels
    end

    rect rgb(150, 200, 100, 0.5)
    Note over Client,GW: Test Phase
    Runtime->>K8s: render & apply<br/>LLMInferenceService
    K8s->>ISvc: create inference pod
    Runtime->>K8s: wait for ISvc pod ready
    Runtime->>GW: resolve endpoint URL
    GW->>ISvc: route traffic
    Client->>ISvc: POST smoke request<br/>(with retry/backoff)
    ISvc->>ISvc: inference
    ISvc-->>Client: response
    Client->>Runtime: capture state
    end

    rect rgb(200, 150, 100, 0.5)
    Note over Client,GW: Benchmark Phase (optional)
    Runtime->>K8s: render & apply PVC,<br/>benchmark Job
    K8s->>ISvc: Job runs GuideLLM<br/>against endpoint
    ISvc->>ISvc: serve requests
    K8s->>K8s: benchmark pod completes
    Runtime->>K8s: copy /results to artifacts
    Client->>Runtime: cleanup resources
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~70 minutes

Possibly related PRs

Poem

🐰 Hops through K8s manifests with glee,
Operators spinning, LLMs set free!
Gateways and endpoints, benchmarks run fast,
Orchestration magic—this test won't be the last! 🚀

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.93% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title is vague and generic, using 'WIP' (Work In Progress) and 'feat: Downstream llm-d' which lacks specific descriptive information about the actual changes being introduced. Replace with a more descriptive title that clearly summarizes the main change, such as: 'Add llm-d Forge project with configuration manifests and orchestration workflows' or similar phrasing that conveys the primary scope of this changeset.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 13, 2026
@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 13, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mml-coder for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
projects/llm_d/toolbox/capture_isvc_state/main.py (1)

285-293: ⚠️ Potential issue | 🔴 Critical

Replace undefined capture_isvc_state with run function on lines 285 and 292.

Lines 285 and 292 reference capture_isvc_state, which is not defined or imported in this module. The function should call run instead, which is the actual function defined in this module with the matching signature.

🔧 Proposed fix
-    parser = create_dynamic_parser(capture_isvc_state, positional_args=["llmisvc_name"])
+    parser = create_dynamic_parser(run, positional_args=["llmisvc_name"])
     args = parser.parse_args()
 
     # Convert args to kwargs for function call
     kwargs = vars(args)
     env.init(daily_artifact_dir=True)
     try:
-        capture_isvc_state(**kwargs)
+        run(**kwargs)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/llm_d/toolbox/capture_isvc_state/main.py` around lines 285 - 293,
The code calls an undefined symbol capture_isvc_state when building the CLI and
invoking the function; replace both references to capture_isvc_state with the
actual function name run (used by create_dynamic_parser and the invocation), so
that parser = create_dynamic_parser(run, positional_args=["llmisvc_name"]) and
the try block calls run(**kwargs) after env.init(daily_artifact_dir=True),
leaving args parsing and kwargs conversion unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@config/llm_d/manifests/gateway.yaml`:
- Around line 12-14: The Gateway's allowedRoutes.namespaces is set to "from:
All", which permits any namespace to bind routes to this listener; change the
setting in the Gateway manifest (allowedRoutes -> namespaces -> from) from "All"
to "Same" to enforce namespace isolation for this listener (unless you have a
documented, explicit need for cross-namespace routing, in which case add
justification and tighten other Gateway controls).

In `@config/llm_d/manifests/llminferenceservice.yaml`:
- Around line 78-96: The current extreme failureThresholds on livenessProbe and
readinessProbe (livenessProbe.failureThreshold and
readinessProbe.failureThreshold) will hide real outages—replace the long-tailing
behavior by adding a startupProbe for the long model cold-start and revert
livenessProbe and readinessProbe to responsive values (e.g., small
failureThreshold like 3 and shorter periodSeconds/timeouts) so restarts and
traffic removal happen quickly; specifically, introduce a startupProbe that
targets the same /health endpoint on port 8000 with long
initialDelaySeconds/periodSeconds suitable for model load, and update
livenessProbe and readinessProbe in the manifest to remove the huge
failureThresholds and long delays so they remain sensitive once the startupProbe
succeeds.

In `@config/llm_d/models.yaml`:
- Around line 16-25: The llama model entry "llama-3-1-8b-instruct-fp8" is
missing resources.limits.cpu and resources.limits.memory so llmd_runtime.py's
replacement of the entire resources block results in no CPU/memory limits being
applied; update the model stanza to include limits.cpu and limits.memory (e.g.,
set limits.cpu: "4" and limits.memory: 8Gi to match the requests) while keeping
the existing nvidia.com/gpu limits, so resources.requests and resources.limits
are fully specified for "llama-3-1-8b-instruct-fp8".

In `@projects/llm_d/orchestration/llmd_runtime.py`:
- Around line 326-341: The wait loop in wait_until is swallowing terminal
RuntimeError from predicates like wait_for_datasciencecluster_ready; change the
except block inside wait_until so that after catching Exception as exc it
immediately re-raises if isinstance(exc, RuntimeError) (or any other terminal
exception class you choose), otherwise treat it as a transient error (log and
continue as now). Update the except clause in wait_until (the try/except
surrounding predicate()) to raise exc when it's a RuntimeError, and only set
last_error/log and sleep for non-terminal exceptions.
- Around line 654-666: The init container "permission-fixer" currently requests
root via securityContext keys runAsUser: 0 and allowPrivilegeEscalation: True
which will be rejected by OpenShift's restricted SCC; update the pod spec
generation in llmd_runtime.py to avoid root: either (A) remove runAsUser: 0 and
allowPrivilegeEscalation and instead set a pod-level securityContext.fsGroup
(e.g., fsGroup: 1001) so the container can chmod/chown mounted PVCs without
running as root, ensure the init container runs as a non-root uid (e.g., 1001)
and keep the command unchanged, or (B) if root is unavoidable, set an explicit
serviceAccountName and document that the service account must be bound to an
appropriate SCC — reference the "permission-fixer" init container, the runAsUser
and allowPrivilegeEscalation fields, the pod securityContext.fsGroup option, and
serviceAccountName when making the patch.
- Around line 230-259: The run_command wrapper can hang indefinitely; add a
timeout parameter (e.g., timeout_seconds: float | None = 300) to run_command and
pass it into subprocess.run via the timeout argument, updating the function
signature and logging as needed; then update the oc(...) wrapper to accept the
same timeout_seconds parameter and forward it to run_command so callers can
override the default for long-running ops (reference functions: run_command and
oc).
- Around line 281-301: oc_get_json currently treats any non-zero oc exit as "not
found" when ignore_not_found=True; change oc_get_json (and callers like
resource_exists) to inspect result.stderr on failure and only return None when
stderr indicates a true NotFound (e.g., contains "Error from server (NotFound)"
or messages like "No resources found" / "not found"); for any other stderr
(RBAC/Forbidden, "no matches for kind", API errors, etc.) raise an exception (or
re-raise a CalledProcessError) so callers fail fast instead of assuming absence.
Ensure you reference and update oc_get_json's result handling logic and adjust
resource_exists to propagate the error rather than returning False for
non-NotFound oc errors.

In `@projects/llm_d/orchestration/prepare_llmd.py`:
- Around line 146-167: The NodeFeatureDiscovery bootstrap currently always
reapplies the manifest; change prepare_llmd.py to first check for the existing
NFD CR using llmd_runtime.resource_exists with the same name/namespace
(manifest["metadata"]["name"], manifest["metadata"]["namespace"]) and, if it
exists, skip llmd_runtime.apply_manifest and only call llmd_runtime.wait_until/
wait_for_nfd_gpu_labels; otherwise proceed to load_manifest_template,
apply_manifest and then wait_until and wait_for_nfd_gpu_labels as now. Ensure
the predicate and timeout usage remain identical to the existing wait_until
call.
- Around line 57-83: The deletes currently fire and return immediately; modify
the teardown after the llmd_runtime.oc delete calls to poll until the objects
are actually gone: repeatedly call llmd_runtime.oc("get", ...) for the
LLMInferenceService (inference_service_name), the Job and PVC (benchmark_name),
and the pod f"{benchmark_name}-copy" in config.namespace, sleeping briefly
between attempts and enforcing an overall timeout; stop polling when each get
indicates the resource is not found (or returns non-zero), and log/raise an
error if the timeout elapses to avoid proceeding while resources are still
terminating.

In `@projects/llm_d/orchestration/test_llmd.py`:
- Around line 318-333: The test currently ignores failures when running
llmd_runtime.oc to cat /results/benchmarks.json; update the block that calls
llmd_runtime.oc (the variable result from llmd_runtime.oc and the subsequent
conditional that calls llmd_runtime.write_text) to treat a non-zero returncode
or empty stdout as a test failure: if result.returncode != 0 or not
result.stdout raise an exception or use the test framework's fail/assert
mechanism with a clear message including result.stderr/output; only call
llmd_runtime.write_text to write the artifact when stdout is present and the
command succeeded.
- Around line 41-61: Only perform the benchmark resource deletes when a
benchmark was actually created: wrap the calls to llmd_runtime.oc that delete
"job,pvc" and the pod ("{benchmark_name}-copy") in a conditional that checks
config.benchmark (truthy) before computing benchmark_name and invoking
llmd_runtime.oc; reference the existing benchmark_name variable and the
llmd_runtime.oc calls, and do not run these deletes when config.benchmark is
None (optionally also skip if namespace_is_managed is False, if that flag
semantics require avoiding cross-namespace deletes).
- Around line 62-74: The test currently always captures namespace events; guard
this behavior by checking the config flag artifacts.capture_namespace_events
before invoking llmd_runtime.oc and writing the file: only run the existing
block that calls llmd_runtime.oc(...), checks events.returncode and writes to
artifacts_dir / "namespace.events.txt" when the configuration flag
artifacts.capture_namespace_events (from config/llm_d/platform.yaml) is true;
locate the block that defines events and uses llmd_runtime.write_text and wrap
it with a conditional that reads the runtime/config object’s
artifacts.capture_namespace_events flag.
- Around line 193-211: The oc exec call that runs curl (the llmd_runtime.oc
invocation producing result) must be bounded per attempt so a hung curl doesn't
stall retries: add a per-attempt timeout (either by passing a timeout argument
to llmd_runtime.oc if it supports one, or by adding curl's --max-time flag in
the command) and catch the timeout/TimeoutExpired outcome around the
llmd_runtime.oc call so the retry loop can move to the next attempt when the
timeout fires; update handling of result/timeouts in the surrounding retry loop
accordingly.

In `@projects/llm_d/README.md`:
- Around line 13-16: The README contains machine-local absolute links to files
(ci.py, prepare_llmd.py, test_llmd.py, llmd_runtime.py); replace each absolute
path with a repo-relative link (e.g., orchestration/ci.py or
./orchestration/ci.py) so the links work on GitHub and other environments,
updating the four entries that mention ci.py, prepare_llmd.py, test_llmd.py, and
llmd_runtime.py accordingly.

---

Outside diff comments:
In `@projects/llm_d/toolbox/capture_isvc_state/main.py`:
- Around line 285-293: The code calls an undefined symbol capture_isvc_state
when building the CLI and invoking the function; replace both references to
capture_isvc_state with the actual function name run (used by
create_dynamic_parser and the invocation), so that parser =
create_dynamic_parser(run, positional_args=["llmisvc_name"]) and the try block
calls run(**kwargs) after env.init(daily_artifact_dir=True), leaving args
parsing and kwargs conversion unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a6025d00-1aab-4bc5-9011-d6cda857033c

📥 Commits

Reviewing files that changed from the base of the PR and between 24d4584 and 7d85dc5.

📒 Files selected for processing (18)
  • config/llm_d/manifests/datasciencecluster.yaml
  • config/llm_d/manifests/epp-approximate-prefix-cache.yaml
  • config/llm_d/manifests/gateway.yaml
  • config/llm_d/manifests/gpu-clusterpolicy.yaml
  • config/llm_d/manifests/llminferenceservice.yaml
  • config/llm_d/manifests/nfd-nodefeaturediscovery.yaml
  • config/llm_d/models.yaml
  • config/llm_d/platform.yaml
  • config/llm_d/presets.yaml
  • config/llm_d/workloads.yaml
  • projects/llm_d/README.md
  • projects/llm_d/orchestration/ci.py
  • projects/llm_d/orchestration/cli.py
  • projects/llm_d/orchestration/llmd_runtime.py
  • projects/llm_d/orchestration/prepare_llmd.py
  • projects/llm_d/orchestration/test_llmd.py
  • projects/llm_d/toolbox/capture_isvc_state/main.py
  • tests/llm_d/test_runtime.py

Comment on lines +12 to +14
allowedRoutes:
namespaces:
from: All
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -path "*/config/llm_d/manifests/gateway.yaml" -type f

Repository: openshift-psap/forge

Length of output: 101


🏁 Script executed:

cat -n config/llm_d/manifests/gateway.yaml 2>/dev/null || find . -name "gateway.yaml" -type f | head -20

Repository: openshift-psap/forge

Length of output: 468


🌐 Web query:

Kubernetes Gateway API allowedRoutes namespaces from All vs Same security

💡 Result:

In Kubernetes Gateway API, the allowedRoutes.namespaces.from field in a Gateway listener controls which namespaces' Routes can attach to that listener. The key options are "All" and "Same", with significant security differences. From: Same (default) - Only allows Routes from the same namespace as the Gateway to attach. - Provides strong namespace isolation: prevents cross-namespace attachment without explicit configuration. - Best for single-team or high-security environments where you want to avoid unintended exposure. - Example: Gateway in "infra-ns" only accepts Routes from "infra-ns". From: All - Allows Routes from all namespaces in the cluster to attach. - Highly permissive: any Route anywhere can bind, increasing risk of conflicts or unauthorized exposure. - Use only in fully trusted clusters; avoid in multi-tenant setups. - Example: Gateway accepts Routes from any namespace, e.g., a shared infra Gateway. Other options: - Selector: Allows Routes from namespaces matching a label selector (balanced security, recommended for multi-team clusters). - None: Blocks all Route attachments (for Gateway-only setups). Security implications: - "Same" enforces least privilege by default, reducing blast radius. - "All" requires additional safeguards like RBAC, NetworkPolicies, or admission controllers to prevent abuse. - Official docs recommend explicit constraints as best practice, especially for shared Gateways. - Cross-namespace Route->Gateway uses this handshake; separate from ReferenceGrant (for Route->Service/Secret). Default behavior: If allowedRoutes is unspecified, namespaces.from defaults to "Same". Examples: Same (secure): listeners: - name: http allowedRoutes: namespaces: from: Same All (permissive): listeners: - name: http allowedRoutes: namespaces: from: All Selector (recommended for shared): listeners: - name: http allowedRoutes: namespaces: from: Selector selector: matchLabels: gateway-access: "true"

Citations:


Restrict route attachment scope on this Gateway.

from: All allows any namespace in the cluster to bind routes to this listener. Since the Gateway is shared in openshift-ingress, this creates a cross-namespace exposure risk. Use from: Same (the namespace-isolation default) unless multi-namespace access is explicitly required.

🔒 Proposed safer default
       allowedRoutes:
         namespaces:
-          from: All
+          from: Same
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
allowedRoutes:
namespaces:
from: All
allowedRoutes:
namespaces:
from: Same
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/llm_d/manifests/gateway.yaml` around lines 12 - 14, The Gateway's
allowedRoutes.namespaces is set to "from: All", which permits any namespace to
bind routes to this listener; change the setting in the Gateway manifest
(allowedRoutes -> namespaces -> from) from "All" to "Same" to enforce namespace
isolation for this listener (unless you have a documented, explicit need for
cross-namespace routing, in which case add justification and tighten other
Gateway controls).

Comment on lines +78 to +96
livenessProbe:
httpGet:
path: /health
port: 8000
scheme: HTTPS
initialDelaySeconds: 900
periodSeconds: 60
timeoutSeconds: 60
failureThreshold: 1000
readinessProbe:
failureThreshold: 10000
httpGet:
path: /health
port: 8000
scheme: HTTPS
initialDelaySeconds: 60
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 30
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

These probe thresholds will mask real outages.

readinessProbe.failureThreshold: 10000 with a 30s period means the pod can keep receiving traffic for ~83 hours after /health starts failing. livenessProbe.failureThreshold: 1000 with a 60s period defers restarts for ~17 hours. If the goal is to tolerate long model startup, use a startupProbe and keep readiness/liveness responsive.

Suggested direction
       - name: main
+        startupProbe:
+          httpGet:
+            path: /health
+            port: 8000
+            scheme: HTTPS
+          periodSeconds: 30
+          timeoutSeconds: 30
+          failureThreshold: 60
         livenessProbe:
           httpGet:
             path: /health
             port: 8000
             scheme: HTTPS
-          initialDelaySeconds: 900
-          periodSeconds: 60
-          timeoutSeconds: 60
-          failureThreshold: 1000
+          periodSeconds: 30
+          timeoutSeconds: 10
+          failureThreshold: 3
         readinessProbe:
-          failureThreshold: 10000
           httpGet:
             path: /health
             port: 8000
             scheme: HTTPS
-          initialDelaySeconds: 60
-          periodSeconds: 30
+          periodSeconds: 10
           successThreshold: 1
-          timeoutSeconds: 30
+          timeoutSeconds: 10
+          failureThreshold: 3
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/llm_d/manifests/llminferenceservice.yaml` around lines 78 - 96, The
current extreme failureThresholds on livenessProbe and readinessProbe
(livenessProbe.failureThreshold and readinessProbe.failureThreshold) will hide
real outages—replace the long-tailing behavior by adding a startupProbe for the
long model cold-start and revert livenessProbe and readinessProbe to responsive
values (e.g., small failureThreshold like 3 and shorter periodSeconds/timeouts)
so restarts and traffic removal happen quickly; specifically, introduce a
startupProbe that targets the same /health endpoint on port 8000 with long
initialDelaySeconds/periodSeconds suitable for model load, and update
livenessProbe and readinessProbe in the manifest to remove the huge
failureThresholds and long delays so they remain sensitive once the startupProbe
succeeds.

Comment thread config/llm_d/models.yaml Outdated
Comment on lines +16 to +25
llama-3-1-8b-instruct-fp8:
served_model_name: llama-3-1-8b-instruct-fp8
uri: oci://registry.redhat.io/rhelai1/modelcar-llama-3-1-8b-instruct-fp8-dynamic:1.5
resources:
requests:
cpu: "4"
memory: 8Gi
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add CPU and memory limits for the llama model.

projects/llm_d/orchestration/llmd_runtime.py replaces the whole resources block when rendering the inference service, so omitting limits.cpu and limits.memory here does not inherit the template defaults. The llama deployment will run without CPU/memory limits while the qwen deployment is constrained.

Suggested fix
   llama-3-1-8b-instruct-fp8:
     served_model_name: llama-3-1-8b-instruct-fp8
     uri: oci://registry.redhat.io/rhelai1/modelcar-llama-3-1-8b-instruct-fp8-dynamic:1.5
     resources:
       requests:
         cpu: "4"
         memory: 8Gi
         nvidia.com/gpu: "1"
       limits:
+        cpu: "4"
+        memory: 8Gi
         nvidia.com/gpu: "1"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
llama-3-1-8b-instruct-fp8:
served_model_name: llama-3-1-8b-instruct-fp8
uri: oci://registry.redhat.io/rhelai1/modelcar-llama-3-1-8b-instruct-fp8-dynamic:1.5
resources:
requests:
cpu: "4"
memory: 8Gi
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"
llama-3-1-8b-instruct-fp8:
served_model_name: llama-3-1-8b-instruct-fp8
uri: oci://registry.redhat.io/rhelai1/modelcar-llama-3-1-8b-instruct-fp8-dynamic:1.5
resources:
requests:
cpu: "4"
memory: 8Gi
nvidia.com/gpu: "1"
limits:
cpu: "4"
memory: 8Gi
nvidia.com/gpu: "1"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/llm_d/models.yaml` around lines 16 - 25, The llama model entry
"llama-3-1-8b-instruct-fp8" is missing resources.limits.cpu and
resources.limits.memory so llmd_runtime.py's replacement of the entire resources
block results in no CPU/memory limits being applied; update the model stanza to
include limits.cpu and limits.memory (e.g., set limits.cpu: "4" and
limits.memory: 8Gi to match the requests) while keeping the existing
nvidia.com/gpu limits, so resources.requests and resources.limits are fully
specified for "llama-3-1-8b-instruct-fp8".

Comment on lines +230 to +259
def run_command(
args: Iterable[str],
*,
check: bool = True,
capture_output: bool = True,
input_text: str | None = None,
) -> subprocess.CompletedProcess[str]:
cmd = [str(arg) for arg in args]
LOGGER.info("run: %s", " ".join(shlex.quote(arg) for arg in cmd))
result = subprocess.run(
cmd,
check=False,
text=True,
capture_output=capture_output,
input=input_text,
)

if capture_output:
if result.stdout:
LOGGER.info("stdout:\n%s", result.stdout.rstrip())
if result.stderr:
LOGGER.info("stderr:\n%s", result.stderr.rstrip())

if check and result.returncode != 0:
raise CommandError(
f"Command failed with exit code {result.returncode}: "
f"{' '.join(shlex.quote(arg) for arg in cmd)}"
)

return result
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate and examine the file
find . -name "llmd_runtime.py" -type f

Repository: openshift-psap/forge

Length of output: 110


🏁 Script executed:

# Read the target file to verify the current implementation
wc -l projects/llm_d/orchestration/llmd_runtime.py

Repository: openshift-psap/forge

Length of output: 112


🏁 Script executed:

# Get the specific section mentioned in the review (lines 230-259)
sed -n '230,259p' projects/llm_d/orchestration/llmd_runtime.py

Repository: openshift-psap/forge

Length of output: 930


🏁 Script executed:

# Search for all calls to run_command to understand current usage patterns
rg "run_command\(" --no-heading -A 2 -B 2 projects/llm_d/

Repository: openshift-psap/forge

Length of output: 709


🏁 Script executed:

# Check if there are any existing timeout patterns in the orchestration module
rg "timeout" projects/llm_d/orchestration/ -i

Repository: openshift-psap/forge

Length of output: 3365


🏁 Script executed:

# More comprehensive search for run_command calls across entire repo
rg "run_command\(" --no-heading -B 1 -A 5

Repository: openshift-psap/forge

Length of output: 1021


🏁 Script executed:

# Also search for oc_get_json and other wrapper functions mentioned in review
rg "def oc_" projects/llm_d/orchestration/llmd_runtime.py -A 8

Repository: openshift-psap/forge

Length of output: 287


🏁 Script executed:

# Check all function definitions in llmd_runtime that might call run_command
rg "def " projects/llm_d/orchestration/llmd_runtime.py | head -20

Repository: openshift-psap/forge

Length of output: 951


🏁 Script executed:

# Get the full oc() function definition to see how it wraps run_command
sed -n '260,280p' projects/llm_d/orchestration/llmd_runtime.py

Repository: openshift-psap/forge

Length of output: 525


🏁 Script executed:

# Search for direct run_command calls vs via oc() wrapper
rg "run_command" projects/llm_d/orchestration/ -B 2 -A 2 --context=2

Repository: openshift-psap/forge

Length of output: 709


🏁 Script executed:

# Check if there are any calls passing positional arguments to run_command
rg "run_command\(\[" projects/llm_d/ -A 3

Repository: openshift-psap/forge

Length of output: 46


🏁 Script executed:

# Look at the actual function signature once more and check if there's documentation
sed -n '229,260p' projects/llm_d/orchestration/llmd_runtime.py

Repository: openshift-psap/forge

Length of output: 932


🏁 Script executed:

# Check if there are timeout patterns and what values are used elsewhere
rg "timeout_seconds" projects/llm_d/orchestration/ | grep -E "(=|:)" | head -15

Repository: openshift-psap/forge

Length of output: 1715


🏁 Script executed:

# Search for subprocess.run usage elsewhere in the codebase to see if timeout is used
rg "subprocess\.run" projects/ -A 3 -B 1

Repository: openshift-psap/forge

Length of output: 10065


🏁 Script executed:

# Check what exceptions are handled that might catch TimeoutExpired
rg "TimeoutExpired|timeout|Timeout" projects/llm_d/ -i

Repository: openshift-psap/forge

Length of output: 3365


🏁 Script executed:

# Verify if the oc() wrapper needs update by checking its full implementation
sed -n '260,275p' projects/llm_d/orchestration/llmd_runtime.py

Repository: openshift-psap/forge

Length of output: 366


🏁 Script executed:

# Check if TimeoutExpired is caught anywhere in prepare_llmd.py or test_llmd.py
rg "TimeoutExpired" projects/llm_d/orchestration/

Repository: openshift-psap/forge

Length of output: 46


🏁 Script executed:

# Look for exception handling around oc() or run_command() calls
rg "except.*Error|except.*Exception" projects/llm_d/orchestration/prepare_llmd.py -A 2 -B 2

Repository: openshift-psap/forge

Length of output: 46


Add a timeout to the command wrapper to prevent indefinite hangs.

Every cluster interaction goes through run_command(), so a hung oc call can stall prepare/test/cleanup indefinitely. A default timeout would fail fast instead of hanging until the CI job times out.

The proposed fix is sound, but the oc() wrapper function also needs to be updated to accept and forward the timeout_seconds parameter, so callers can override the default when needed for longer-running operations.

Proposed changes

Update run_command() signature and the oc() wrapper:

 def run_command(
     args: Iterable[str],
     *,
     check: bool = True,
     capture_output: bool = True,
     input_text: str | None = None,
+    timeout_seconds: int = 300,
 ) -> subprocess.CompletedProcess[str]:
     cmd = [str(arg) for arg in args]
     LOGGER.info("run: %s", " ".join(shlex.quote(arg) for arg in cmd))
     result = subprocess.run(
         cmd,
         check=False,
         text=True,
         capture_output=capture_output,
         input=input_text,
+        timeout=timeout_seconds,
     )

Also update the oc() wrapper to accept and pass through timeout_seconds:

 def oc(
     *args: str,
     check: bool = True,
     capture_output: bool = True,
     input_text: str | None = None,
+    timeout_seconds: int = 300,
 ) -> subprocess.CompletedProcess[str]:
     return run_command(
         ["oc", *args],
         check=check,
         capture_output=capture_output,
         input_text=input_text,
+        timeout_seconds=timeout_seconds,
     )
🧰 Tools
🪛 Ruff (0.15.9)

[error] 239-239: subprocess call: check for execution of untrusted input

(S603)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/llm_d/orchestration/llmd_runtime.py` around lines 230 - 259, The
run_command wrapper can hang indefinitely; add a timeout parameter (e.g.,
timeout_seconds: float | None = 300) to run_command and pass it into
subprocess.run via the timeout argument, updating the function signature and
logging as needed; then update the oc(...) wrapper to accept the same
timeout_seconds parameter and forward it to run_command so callers can override
the default for long-running ops (reference functions: run_command and oc).

Comment on lines +281 to +301
def oc_get_json(
kind: str,
*,
name: str | None = None,
namespace: str | None = None,
selector: str | None = None,
ignore_not_found: bool = False,
) -> dict[str, Any] | None:
args = ["get", kind]
if name:
args.append(name)
if namespace:
args.extend(["-n", namespace])
if selector:
args.extend(["-l", selector])
args.extend(["-o", "json"])

result = oc(*args, check=not ignore_not_found, capture_output=True)
if ignore_not_found and result.returncode != 0:
return None
return json.loads(result.stdout)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

head -n 320 projects/llm_d/orchestration/llmd_runtime.py | tail -n 50

Repository: openshift-psap/forge

Length of output: 1284


🏁 Script executed:

grep -n "^def oc(" projects/llm_d/orchestration/llmd_runtime.py

Repository: openshift-psap/forge

Length of output: 75


🏁 Script executed:

sed -n '262,280p' projects/llm_d/orchestration/llmd_runtime.py

Repository: openshift-psap/forge

Length of output: 523


🏁 Script executed:

grep -n "^def run_command(" projects/llm_d/orchestration/llmd_runtime.py

Repository: openshift-psap/forge

Length of output: 84


🏁 Script executed:

sed -n '230,260p' projects/llm_d/orchestration/llmd_runtime.py

Repository: openshift-psap/forge

Length of output: 931


🏁 Script executed:

rg "oc_get_json|resource_exists" projects/llm_d/orchestration/llmd_runtime.py -A 2 -B 2

Repository: openshift-psap/forge

Length of output: 1463


🏁 Script executed:

rg "ignore_not_found=True" projects/llm_d/orchestration/llmd_runtime.py -B 5 -A 3

Repository: openshift-psap/forge

Length of output: 1130


🏁 Script executed:

rg "stderr" projects/llm_d/orchestration/llmd_runtime.py

Repository: openshift-psap/forge

Length of output: 152


🏁 Script executed:

rg "Error from server|NotFound" projects/llm_d/

Repository: openshift-psap/forge

Length of output: 46


🏁 Script executed:

find projects/llm_d -name "*.py" -type f | xargs grep -l "oc_get_json\|resource_exists" | grep -i test

Repository: openshift-psap/forge

Length of output: 105


🏁 Script executed:

cat projects/llm_d/orchestration/test_llmd.py

Repository: openshift-psap/forge

Length of output: 14463


🏁 Script executed:

grep -r "NotFound\|Forbidden\|Error from server" projects/llm_d/

Repository: openshift-psap/forge

Length of output: 46


🏁 Script executed:

grep -r "returncode\|stderr" projects/llm_d/orchestration/llmd_runtime.py -B 2 -A 2

Repository: openshift-psap/forge

Length of output: 706


🏁 Script executed:

python3 << 'EOF'
# Let's verify OpenShift/Kubernetes oc behavior
# The oc get command returns different error messages in stderr for different scenarios

# Common oc get error patterns:
errors = {
    "NotFound": "Error from server (NotFound): <resource> \"<name>\" not found",
    "Forbidden": "Error from server (Forbidden): <resource> is forbidden: User <user> cannot get <resource>",
    "ServerError": "Error from server (InternalError): Internal error occurred:",
    "CRDNotReady": "no matches for kind \"<Kind>\"",
}

for error_type, pattern in errors.items():
    print(f"{error_type}: {pattern}")
EOF

Repository: openshift-psap/forge

Length of output: 349


🏁 Script executed:

rg "payload\s*=" projects/llm_d/orchestration/llmd_runtime.py | head -20

Repository: openshift-psap/forge

Length of output: 94


Check stderr to distinguish NotFound from other oc errors.

oc_get_json(..., ignore_not_found=True) and resource_exists() currently return None/False for any non-zero oc get exit code. OpenShift's oc command returns distinct error messages in stderr (e.g., "Error from server (NotFound):", "Error from server (Forbidden):", "no matches for kind"), but the code ignores these and treats all failures as resource absence. This masks RBAC denials, API outages, and unregistered CRDs, causing callers to attempt create/apply operations on broken clusters instead of failing fast.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/llm_d/orchestration/llmd_runtime.py` around lines 281 - 301,
oc_get_json currently treats any non-zero oc exit as "not found" when
ignore_not_found=True; change oc_get_json (and callers like resource_exists) to
inspect result.stderr on failure and only return None when stderr indicates a
true NotFound (e.g., contains "Error from server (NotFound)" or messages like
"No resources found" / "not found"); for any other stderr (RBAC/Forbidden, "no
matches for kind", API errors, etc.) raise an exception (or re-raise a
CalledProcessError) so callers fail fast instead of assuming absence. Ensure you
reference and update oc_get_json's result handling logic and adjust
resource_exists to propagate the error rather than returning False for
non-NotFound oc errors.

Comment on lines +41 to +61
benchmark_name = (
config.benchmark["job_name"] if config.benchmark else "guidellm-benchmark"
)
llmd_runtime.oc(
"delete",
"job,pvc",
benchmark_name,
"-n",
namespace,
"--ignore-not-found=true",
check=False,
)
llmd_runtime.oc(
"delete",
"pod",
f"{benchmark_name}-copy",
"-n",
namespace,
"--ignore-not-found=true",
check=False,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Only clean up benchmark resources when a benchmark was actually created.

When config.benchmark is None, this still deletes job,pvc guidellm-benchmark and pod guidellm-benchmark-copy in the target namespace. That is risky for smoke-only runs, especially when namespace_is_managed is False and the namespace can contain unrelated resources.

Suggested fix
-        benchmark_name = (
-            config.benchmark["job_name"] if config.benchmark else "guidellm-benchmark"
-        )
-        llmd_runtime.oc(
-            "delete",
-            "job,pvc",
-            benchmark_name,
-            "-n",
-            namespace,
-            "--ignore-not-found=true",
-            check=False,
-        )
-        llmd_runtime.oc(
-            "delete",
-            "pod",
-            f"{benchmark_name}-copy",
-            "-n",
-            namespace,
-            "--ignore-not-found=true",
-            check=False,
-        )
+        if config.benchmark:
+            benchmark_name = config.benchmark["job_name"]
+            llmd_runtime.oc(
+                "delete",
+                "job,pvc",
+                benchmark_name,
+                "-n",
+                namespace,
+                "--ignore-not-found=true",
+                check=False,
+            )
+            llmd_runtime.oc(
+                "delete",
+                "pod",
+                f"{benchmark_name}-copy",
+                "-n",
+                namespace,
+                "--ignore-not-found=true",
+                check=False,
+            )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
benchmark_name = (
config.benchmark["job_name"] if config.benchmark else "guidellm-benchmark"
)
llmd_runtime.oc(
"delete",
"job,pvc",
benchmark_name,
"-n",
namespace,
"--ignore-not-found=true",
check=False,
)
llmd_runtime.oc(
"delete",
"pod",
f"{benchmark_name}-copy",
"-n",
namespace,
"--ignore-not-found=true",
check=False,
)
if config.benchmark:
benchmark_name = config.benchmark["job_name"]
llmd_runtime.oc(
"delete",
"job,pvc",
benchmark_name,
"-n",
namespace,
"--ignore-not-found=true",
check=False,
)
llmd_runtime.oc(
"delete",
"pod",
f"{benchmark_name}-copy",
"-n",
namespace,
"--ignore-not-found=true",
check=False,
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/llm_d/orchestration/test_llmd.py` around lines 41 - 61, Only perform
the benchmark resource deletes when a benchmark was actually created: wrap the
calls to llmd_runtime.oc that delete "job,pvc" and the pod
("{benchmark_name}-copy") in a conditional that checks config.benchmark (truthy)
before computing benchmark_name and invoking llmd_runtime.oc; reference the
existing benchmark_name variable and the llmd_runtime.oc calls, and do not run
these deletes when config.benchmark is None (optionally also skip if
namespace_is_managed is False, if that flag semantics require avoiding
cross-namespace deletes).

Comment on lines +62 to +74
events = llmd_runtime.oc(
"get",
"events",
"-n",
namespace,
"--sort-by=.metadata.creationTimestamp",
check=False,
capture_output=True,
)
if events.returncode == 0 and events.stdout:
llmd_runtime.write_text(
artifacts_dir / "namespace.events.txt", events.stdout
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Honor artifacts.capture_namespace_events.

config/llm_d/platform.yaml exposes this as a knob, but this block always captures and writes namespace events. That makes the config ineffective and forces extra artifact collection even when disabled.

Suggested fix
-        events = llmd_runtime.oc(
-            "get",
-            "events",
-            "-n",
-            namespace,
-            "--sort-by=.metadata.creationTimestamp",
-            check=False,
-            capture_output=True,
-        )
-        if events.returncode == 0 and events.stdout:
-            llmd_runtime.write_text(
-                artifacts_dir / "namespace.events.txt", events.stdout
-            )
+        if config.platform["artifacts"].get("capture_namespace_events", False):
+            events = llmd_runtime.oc(
+                "get",
+                "events",
+                "-n",
+                namespace,
+                "--sort-by=.metadata.creationTimestamp",
+                check=False,
+                capture_output=True,
+            )
+            if events.returncode == 0 and events.stdout:
+                llmd_runtime.write_text(
+                    artifacts_dir / "namespace.events.txt", events.stdout
+                )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
events = llmd_runtime.oc(
"get",
"events",
"-n",
namespace,
"--sort-by=.metadata.creationTimestamp",
check=False,
capture_output=True,
)
if events.returncode == 0 and events.stdout:
llmd_runtime.write_text(
artifacts_dir / "namespace.events.txt", events.stdout
)
if config.platform["artifacts"].get("capture_namespace_events", False):
events = llmd_runtime.oc(
"get",
"events",
"-n",
namespace,
"--sort-by=.metadata.creationTimestamp",
check=False,
capture_output=True,
)
if events.returncode == 0 and events.stdout:
llmd_runtime.write_text(
artifacts_dir / "namespace.events.txt", events.stdout
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/llm_d/orchestration/test_llmd.py` around lines 62 - 74, The test
currently always captures namespace events; guard this behavior by checking the
config flag artifacts.capture_namespace_events before invoking llmd_runtime.oc
and writing the file: only run the existing block that calls
llmd_runtime.oc(...), checks events.returncode and writes to artifacts_dir /
"namespace.events.txt" when the configuration flag
artifacts.capture_namespace_events (from config/llm_d/platform.yaml) is true;
locate the block that defines events and uses llmd_runtime.write_text and wrap
it with a conditional that reads the runtime/config object’s
artifacts.capture_namespace_events flag.

Comment on lines +193 to +211
result = llmd_runtime.oc(
"exec",
"-n",
namespace,
f"deployment/{deployment_name}",
"-c",
"main",
"--",
"curl",
"-k",
"-sSf",
f"{endpoint_url}{config.platform['smoke']['endpoint_path']}",
"-H",
"Content-Type: application/json",
"-d",
json.dumps(payload),
check=False,
capture_output=True,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bound each smoke-request attempt with a timeout.

The retry loop does not help if curl blocks inside oc exec; one hung request can stall the whole CI run beyond the configured retry budget. Add per-attempt timeouts so failures actually progress to the next retry.

Suggested fix
         result = llmd_runtime.oc(
             "exec",
             "-n",
             namespace,
             f"deployment/{deployment_name}",
             "-c",
             "main",
             "--",
             "curl",
             "-k",
             "-sSf",
+            "--connect-timeout",
+            "10",
+            "--max-time",
+            str(delay),
             f"{endpoint_url}{config.platform['smoke']['endpoint_path']}",
             "-H",
             "Content-Type: application/json",
             "-d",
             json.dumps(payload),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/llm_d/orchestration/test_llmd.py` around lines 193 - 211, The oc
exec call that runs curl (the llmd_runtime.oc invocation producing result) must
be bounded per attempt so a hung curl doesn't stall retries: add a per-attempt
timeout (either by passing a timeout argument to llmd_runtime.oc if it supports
one, or by adding curl's --max-time flag in the command) and catch the
timeout/TimeoutExpired outcome around the llmd_runtime.oc call so the retry loop
can move to the next attempt when the timeout fires; update handling of
result/timeouts in the surrounding retry loop accordingly.

Comment on lines +318 to +333
result = llmd_runtime.oc(
"exec",
"-n",
namespace,
f"{benchmark_name}-copy",
"--",
"cat",
"/results/benchmarks.json",
check=False,
capture_output=True,
)
if result.returncode == 0 and result.stdout:
llmd_runtime.write_text(
config.artifact_dir / "artifacts" / "results" / "benchmarks.json",
result.stdout,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Treat missing benchmark output as a failure.

If oc exec ... cat /results/benchmarks.json fails or returns empty output, the run still succeeds and just skips writing the artifact. That can make benchmark jobs look green while producing no usable result.

Suggested fix
     result = llmd_runtime.oc(
         "exec",
         "-n",
         namespace,
         f"{benchmark_name}-copy",
         "--",
         "cat",
         "/results/benchmarks.json",
         check=False,
         capture_output=True,
     )
-    if result.returncode == 0 and result.stdout:
-        llmd_runtime.write_text(
-            config.artifact_dir / "artifacts" / "results" / "benchmarks.json",
-            result.stdout,
-        )
+    if result.returncode != 0 or not result.stdout:
+        raise RuntimeError(
+            f"Failed to copy GuideLLM results for {benchmark_name} from /results/benchmarks.json"
+        )
+    llmd_runtime.write_text(
+        config.artifact_dir / "artifacts" / "results" / "benchmarks.json",
+        result.stdout,
+    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
result = llmd_runtime.oc(
"exec",
"-n",
namespace,
f"{benchmark_name}-copy",
"--",
"cat",
"/results/benchmarks.json",
check=False,
capture_output=True,
)
if result.returncode == 0 and result.stdout:
llmd_runtime.write_text(
config.artifact_dir / "artifacts" / "results" / "benchmarks.json",
result.stdout,
)
result = llmd_runtime.oc(
"exec",
"-n",
namespace,
f"{benchmark_name}-copy",
"--",
"cat",
"/results/benchmarks.json",
check=False,
capture_output=True,
)
if result.returncode != 0 or not result.stdout:
raise RuntimeError(
f"Failed to copy GuideLLM results for {benchmark_name} from /results/benchmarks.json"
)
llmd_runtime.write_text(
config.artifact_dir / "artifacts" / "results" / "benchmarks.json",
result.stdout,
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/llm_d/orchestration/test_llmd.py` around lines 318 - 333, The test
currently ignores failures when running llmd_runtime.oc to cat
/results/benchmarks.json; update the block that calls llmd_runtime.oc (the
variable result from llmd_runtime.oc and the subsequent conditional that calls
llmd_runtime.write_text) to treat a non-zero returncode or empty stdout as a
test failure: if result.returncode != 0 or not result.stdout raise an exception
or use the test framework's fail/assert mechanism with a clear message including
result.stderr/output; only call llmd_runtime.write_text to write the artifact
when stdout is present and the command succeeded.

Comment thread projects/llm_d/README.md Outdated
@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 13, 2026
Comment thread config/llm_d/models.yaml Outdated
@@ -0,0 +1,25 @@
models:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed offline, we'll need a mechanism that supports multiple configuration files

this will have to go to projects/llm_d/orchestration/config.d

Comment thread projects/llm_d/orchestration/ci.py Outdated
Comment on lines 6 to 9
FORGE_HOME = Path(__file__).resolve().parents[3]
if str(FORGE_HOME) not in sys.path:
sys.path.insert(0, str(FORGE_HOME))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

env.FORGE_HOME

Comment on lines +20 to +33
verify_oc_access()
verify_cluster_version(config)
prepare_cert_manager(config)
prepare_leader_worker_set(config)
prepare_nfd(config)
prepare_gpu_operator(config)
prepare_rhoai_operator(config)
apply_datasciencecluster(config)
wait_for_datasciencecluster_ready(config)
ensure_required_crds(config.platform["rhoai"]["required_crds_after_dsc"], config)
ensure_gateway(config)
ensure_test_namespace(config)
verify_gpu_nodes(config)
capture_prepare_state(config)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will have to become toolbox commands

global config is at projects.core.library.config --> config.project.get_config

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 15, 2026
@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants