WIP - feat: Downstream llm-d#22
Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 3 minutes and 55 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (27)
📝 WalkthroughWalkthroughAdded comprehensive Kubernetes orchestration infrastructure for LLM inference deployment on OpenShift, including declarative manifests for cluster components, configuration files defining models and workloads, and runtime orchestration logic implementing prepare/test/cleanup phases with operator installation, resource provisioning, and endpoint validation workflows. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Test Client
participant Runtime as Orchestration<br/>Runtime
participant K8s as Kubernetes<br/>API
participant Ops as Operators<br/>(cert-manager,<br/>gpu-op, rhods)
participant ISvc as Inference<br/>Service
participant GW as Gateway
Client->>Runtime: init()
Runtime->>Runtime: load config (platform, models,<br/>workloads, presets)
rect rgb(100, 150, 200, 0.5)
Note over Client,GW: Prepare Phase
Runtime->>K8s: verify cluster access & version
Runtime->>Ops: subscribe to operators
Ops->>K8s: install cert-manager, nfd, gpu-op, rhods
K8s->>Ops: operator running
Runtime->>K8s: apply DataScienceCluster manifest
K8s->>K8s: reconcile DSC components
Runtime->>K8s: wait for DSC ready
Runtime->>K8s: apply Gateway manifest
K8s->>GW: gateway provisioned
Runtime->>K8s: wait for gateway ready
Runtime->>K8s: verify GPU nodes & NFD labels
end
rect rgb(150, 200, 100, 0.5)
Note over Client,GW: Test Phase
Runtime->>K8s: render & apply<br/>LLMInferenceService
K8s->>ISvc: create inference pod
Runtime->>K8s: wait for ISvc pod ready
Runtime->>GW: resolve endpoint URL
GW->>ISvc: route traffic
Client->>ISvc: POST smoke request<br/>(with retry/backoff)
ISvc->>ISvc: inference
ISvc-->>Client: response
Client->>Runtime: capture state
end
rect rgb(200, 150, 100, 0.5)
Note over Client,GW: Benchmark Phase (optional)
Runtime->>K8s: render & apply PVC,<br/>benchmark Job
K8s->>ISvc: Job runs GuideLLM<br/>against endpoint
ISvc->>ISvc: serve requests
K8s->>K8s: benchmark pod completes
Runtime->>K8s: copy /results to artifacts
Client->>Runtime: cleanup resources
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~70 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 14
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
projects/llm_d/toolbox/capture_isvc_state/main.py (1)
285-293:⚠️ Potential issue | 🔴 CriticalReplace undefined
capture_isvc_statewithrunfunction on lines 285 and 292.Lines 285 and 292 reference
capture_isvc_state, which is not defined or imported in this module. The function should callruninstead, which is the actual function defined in this module with the matching signature.🔧 Proposed fix
- parser = create_dynamic_parser(capture_isvc_state, positional_args=["llmisvc_name"]) + parser = create_dynamic_parser(run, positional_args=["llmisvc_name"]) args = parser.parse_args() # Convert args to kwargs for function call kwargs = vars(args) env.init(daily_artifact_dir=True) try: - capture_isvc_state(**kwargs) + run(**kwargs)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm_d/toolbox/capture_isvc_state/main.py` around lines 285 - 293, The code calls an undefined symbol capture_isvc_state when building the CLI and invoking the function; replace both references to capture_isvc_state with the actual function name run (used by create_dynamic_parser and the invocation), so that parser = create_dynamic_parser(run, positional_args=["llmisvc_name"]) and the try block calls run(**kwargs) after env.init(daily_artifact_dir=True), leaving args parsing and kwargs conversion unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@config/llm_d/manifests/gateway.yaml`:
- Around line 12-14: The Gateway's allowedRoutes.namespaces is set to "from:
All", which permits any namespace to bind routes to this listener; change the
setting in the Gateway manifest (allowedRoutes -> namespaces -> from) from "All"
to "Same" to enforce namespace isolation for this listener (unless you have a
documented, explicit need for cross-namespace routing, in which case add
justification and tighten other Gateway controls).
In `@config/llm_d/manifests/llminferenceservice.yaml`:
- Around line 78-96: The current extreme failureThresholds on livenessProbe and
readinessProbe (livenessProbe.failureThreshold and
readinessProbe.failureThreshold) will hide real outages—replace the long-tailing
behavior by adding a startupProbe for the long model cold-start and revert
livenessProbe and readinessProbe to responsive values (e.g., small
failureThreshold like 3 and shorter periodSeconds/timeouts) so restarts and
traffic removal happen quickly; specifically, introduce a startupProbe that
targets the same /health endpoint on port 8000 with long
initialDelaySeconds/periodSeconds suitable for model load, and update
livenessProbe and readinessProbe in the manifest to remove the huge
failureThresholds and long delays so they remain sensitive once the startupProbe
succeeds.
In `@config/llm_d/models.yaml`:
- Around line 16-25: The llama model entry "llama-3-1-8b-instruct-fp8" is
missing resources.limits.cpu and resources.limits.memory so llmd_runtime.py's
replacement of the entire resources block results in no CPU/memory limits being
applied; update the model stanza to include limits.cpu and limits.memory (e.g.,
set limits.cpu: "4" and limits.memory: 8Gi to match the requests) while keeping
the existing nvidia.com/gpu limits, so resources.requests and resources.limits
are fully specified for "llama-3-1-8b-instruct-fp8".
In `@projects/llm_d/orchestration/llmd_runtime.py`:
- Around line 326-341: The wait loop in wait_until is swallowing terminal
RuntimeError from predicates like wait_for_datasciencecluster_ready; change the
except block inside wait_until so that after catching Exception as exc it
immediately re-raises if isinstance(exc, RuntimeError) (or any other terminal
exception class you choose), otherwise treat it as a transient error (log and
continue as now). Update the except clause in wait_until (the try/except
surrounding predicate()) to raise exc when it's a RuntimeError, and only set
last_error/log and sleep for non-terminal exceptions.
- Around line 654-666: The init container "permission-fixer" currently requests
root via securityContext keys runAsUser: 0 and allowPrivilegeEscalation: True
which will be rejected by OpenShift's restricted SCC; update the pod spec
generation in llmd_runtime.py to avoid root: either (A) remove runAsUser: 0 and
allowPrivilegeEscalation and instead set a pod-level securityContext.fsGroup
(e.g., fsGroup: 1001) so the container can chmod/chown mounted PVCs without
running as root, ensure the init container runs as a non-root uid (e.g., 1001)
and keep the command unchanged, or (B) if root is unavoidable, set an explicit
serviceAccountName and document that the service account must be bound to an
appropriate SCC — reference the "permission-fixer" init container, the runAsUser
and allowPrivilegeEscalation fields, the pod securityContext.fsGroup option, and
serviceAccountName when making the patch.
- Around line 230-259: The run_command wrapper can hang indefinitely; add a
timeout parameter (e.g., timeout_seconds: float | None = 300) to run_command and
pass it into subprocess.run via the timeout argument, updating the function
signature and logging as needed; then update the oc(...) wrapper to accept the
same timeout_seconds parameter and forward it to run_command so callers can
override the default for long-running ops (reference functions: run_command and
oc).
- Around line 281-301: oc_get_json currently treats any non-zero oc exit as "not
found" when ignore_not_found=True; change oc_get_json (and callers like
resource_exists) to inspect result.stderr on failure and only return None when
stderr indicates a true NotFound (e.g., contains "Error from server (NotFound)"
or messages like "No resources found" / "not found"); for any other stderr
(RBAC/Forbidden, "no matches for kind", API errors, etc.) raise an exception (or
re-raise a CalledProcessError) so callers fail fast instead of assuming absence.
Ensure you reference and update oc_get_json's result handling logic and adjust
resource_exists to propagate the error rather than returning False for
non-NotFound oc errors.
In `@projects/llm_d/orchestration/prepare_llmd.py`:
- Around line 146-167: The NodeFeatureDiscovery bootstrap currently always
reapplies the manifest; change prepare_llmd.py to first check for the existing
NFD CR using llmd_runtime.resource_exists with the same name/namespace
(manifest["metadata"]["name"], manifest["metadata"]["namespace"]) and, if it
exists, skip llmd_runtime.apply_manifest and only call llmd_runtime.wait_until/
wait_for_nfd_gpu_labels; otherwise proceed to load_manifest_template,
apply_manifest and then wait_until and wait_for_nfd_gpu_labels as now. Ensure
the predicate and timeout usage remain identical to the existing wait_until
call.
- Around line 57-83: The deletes currently fire and return immediately; modify
the teardown after the llmd_runtime.oc delete calls to poll until the objects
are actually gone: repeatedly call llmd_runtime.oc("get", ...) for the
LLMInferenceService (inference_service_name), the Job and PVC (benchmark_name),
and the pod f"{benchmark_name}-copy" in config.namespace, sleeping briefly
between attempts and enforcing an overall timeout; stop polling when each get
indicates the resource is not found (or returns non-zero), and log/raise an
error if the timeout elapses to avoid proceeding while resources are still
terminating.
In `@projects/llm_d/orchestration/test_llmd.py`:
- Around line 318-333: The test currently ignores failures when running
llmd_runtime.oc to cat /results/benchmarks.json; update the block that calls
llmd_runtime.oc (the variable result from llmd_runtime.oc and the subsequent
conditional that calls llmd_runtime.write_text) to treat a non-zero returncode
or empty stdout as a test failure: if result.returncode != 0 or not
result.stdout raise an exception or use the test framework's fail/assert
mechanism with a clear message including result.stderr/output; only call
llmd_runtime.write_text to write the artifact when stdout is present and the
command succeeded.
- Around line 41-61: Only perform the benchmark resource deletes when a
benchmark was actually created: wrap the calls to llmd_runtime.oc that delete
"job,pvc" and the pod ("{benchmark_name}-copy") in a conditional that checks
config.benchmark (truthy) before computing benchmark_name and invoking
llmd_runtime.oc; reference the existing benchmark_name variable and the
llmd_runtime.oc calls, and do not run these deletes when config.benchmark is
None (optionally also skip if namespace_is_managed is False, if that flag
semantics require avoiding cross-namespace deletes).
- Around line 62-74: The test currently always captures namespace events; guard
this behavior by checking the config flag artifacts.capture_namespace_events
before invoking llmd_runtime.oc and writing the file: only run the existing
block that calls llmd_runtime.oc(...), checks events.returncode and writes to
artifacts_dir / "namespace.events.txt" when the configuration flag
artifacts.capture_namespace_events (from config/llm_d/platform.yaml) is true;
locate the block that defines events and uses llmd_runtime.write_text and wrap
it with a conditional that reads the runtime/config object’s
artifacts.capture_namespace_events flag.
- Around line 193-211: The oc exec call that runs curl (the llmd_runtime.oc
invocation producing result) must be bounded per attempt so a hung curl doesn't
stall retries: add a per-attempt timeout (either by passing a timeout argument
to llmd_runtime.oc if it supports one, or by adding curl's --max-time flag in
the command) and catch the timeout/TimeoutExpired outcome around the
llmd_runtime.oc call so the retry loop can move to the next attempt when the
timeout fires; update handling of result/timeouts in the surrounding retry loop
accordingly.
In `@projects/llm_d/README.md`:
- Around line 13-16: The README contains machine-local absolute links to files
(ci.py, prepare_llmd.py, test_llmd.py, llmd_runtime.py); replace each absolute
path with a repo-relative link (e.g., orchestration/ci.py or
./orchestration/ci.py) so the links work on GitHub and other environments,
updating the four entries that mention ci.py, prepare_llmd.py, test_llmd.py, and
llmd_runtime.py accordingly.
---
Outside diff comments:
In `@projects/llm_d/toolbox/capture_isvc_state/main.py`:
- Around line 285-293: The code calls an undefined symbol capture_isvc_state
when building the CLI and invoking the function; replace both references to
capture_isvc_state with the actual function name run (used by
create_dynamic_parser and the invocation), so that parser =
create_dynamic_parser(run, positional_args=["llmisvc_name"]) and the try block
calls run(**kwargs) after env.init(daily_artifact_dir=True), leaving args
parsing and kwargs conversion unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a6025d00-1aab-4bc5-9011-d6cda857033c
📒 Files selected for processing (18)
config/llm_d/manifests/datasciencecluster.yamlconfig/llm_d/manifests/epp-approximate-prefix-cache.yamlconfig/llm_d/manifests/gateway.yamlconfig/llm_d/manifests/gpu-clusterpolicy.yamlconfig/llm_d/manifests/llminferenceservice.yamlconfig/llm_d/manifests/nfd-nodefeaturediscovery.yamlconfig/llm_d/models.yamlconfig/llm_d/platform.yamlconfig/llm_d/presets.yamlconfig/llm_d/workloads.yamlprojects/llm_d/README.mdprojects/llm_d/orchestration/ci.pyprojects/llm_d/orchestration/cli.pyprojects/llm_d/orchestration/llmd_runtime.pyprojects/llm_d/orchestration/prepare_llmd.pyprojects/llm_d/orchestration/test_llmd.pyprojects/llm_d/toolbox/capture_isvc_state/main.pytests/llm_d/test_runtime.py
| allowedRoutes: | ||
| namespaces: | ||
| from: All |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -path "*/config/llm_d/manifests/gateway.yaml" -type fRepository: openshift-psap/forge
Length of output: 101
🏁 Script executed:
cat -n config/llm_d/manifests/gateway.yaml 2>/dev/null || find . -name "gateway.yaml" -type f | head -20Repository: openshift-psap/forge
Length of output: 468
🌐 Web query:
Kubernetes Gateway API allowedRoutes namespaces from All vs Same security
💡 Result:
In Kubernetes Gateway API, the allowedRoutes.namespaces.from field in a Gateway listener controls which namespaces' Routes can attach to that listener. The key options are "All" and "Same", with significant security differences. From: Same (default) - Only allows Routes from the same namespace as the Gateway to attach. - Provides strong namespace isolation: prevents cross-namespace attachment without explicit configuration. - Best for single-team or high-security environments where you want to avoid unintended exposure. - Example: Gateway in "infra-ns" only accepts Routes from "infra-ns". From: All - Allows Routes from all namespaces in the cluster to attach. - Highly permissive: any Route anywhere can bind, increasing risk of conflicts or unauthorized exposure. - Use only in fully trusted clusters; avoid in multi-tenant setups. - Example: Gateway accepts Routes from any namespace, e.g., a shared infra Gateway. Other options: - Selector: Allows Routes from namespaces matching a label selector (balanced security, recommended for multi-team clusters). - None: Blocks all Route attachments (for Gateway-only setups). Security implications: - "Same" enforces least privilege by default, reducing blast radius. - "All" requires additional safeguards like RBAC, NetworkPolicies, or admission controllers to prevent abuse. - Official docs recommend explicit constraints as best practice, especially for shared Gateways. - Cross-namespace Route->Gateway uses this handshake; separate from ReferenceGrant (for Route->Service/Secret). Default behavior: If allowedRoutes is unspecified, namespaces.from defaults to "Same". Examples: Same (secure): listeners: - name: http allowedRoutes: namespaces: from: Same All (permissive): listeners: - name: http allowedRoutes: namespaces: from: All Selector (recommended for shared): listeners: - name: http allowedRoutes: namespaces: from: Selector selector: matchLabels: gateway-access: "true"
Citations:
- 1: https://gateway-api.sigs.k8s.io/guides/multiple-ns/
- 2: https://gateway-api.sigs.k8s.io/concepts/security/
- 3: https://gateway-api.f5se.io/guides/allowedroutes/
- 4: https://gateway-api.sigs.k8s.io/reference/spec/
- 5: https://gateway-api.sigs.k8s.io/guides/multiple-ns
- 6: https://kubernetes.io/docs/concepts/services-networking/gateway
Restrict route attachment scope on this Gateway.
from: All allows any namespace in the cluster to bind routes to this listener. Since the Gateway is shared in openshift-ingress, this creates a cross-namespace exposure risk. Use from: Same (the namespace-isolation default) unless multi-namespace access is explicitly required.
🔒 Proposed safer default
allowedRoutes:
namespaces:
- from: All
+ from: Same📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| allowedRoutes: | |
| namespaces: | |
| from: All | |
| allowedRoutes: | |
| namespaces: | |
| from: Same |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@config/llm_d/manifests/gateway.yaml` around lines 12 - 14, The Gateway's
allowedRoutes.namespaces is set to "from: All", which permits any namespace to
bind routes to this listener; change the setting in the Gateway manifest
(allowedRoutes -> namespaces -> from) from "All" to "Same" to enforce namespace
isolation for this listener (unless you have a documented, explicit need for
cross-namespace routing, in which case add justification and tighten other
Gateway controls).
| livenessProbe: | ||
| httpGet: | ||
| path: /health | ||
| port: 8000 | ||
| scheme: HTTPS | ||
| initialDelaySeconds: 900 | ||
| periodSeconds: 60 | ||
| timeoutSeconds: 60 | ||
| failureThreshold: 1000 | ||
| readinessProbe: | ||
| failureThreshold: 10000 | ||
| httpGet: | ||
| path: /health | ||
| port: 8000 | ||
| scheme: HTTPS | ||
| initialDelaySeconds: 60 | ||
| periodSeconds: 30 | ||
| successThreshold: 1 | ||
| timeoutSeconds: 30 |
There was a problem hiding this comment.
These probe thresholds will mask real outages.
readinessProbe.failureThreshold: 10000 with a 30s period means the pod can keep receiving traffic for ~83 hours after /health starts failing. livenessProbe.failureThreshold: 1000 with a 60s period defers restarts for ~17 hours. If the goal is to tolerate long model startup, use a startupProbe and keep readiness/liveness responsive.
Suggested direction
- name: main
+ startupProbe:
+ httpGet:
+ path: /health
+ port: 8000
+ scheme: HTTPS
+ periodSeconds: 30
+ timeoutSeconds: 30
+ failureThreshold: 60
livenessProbe:
httpGet:
path: /health
port: 8000
scheme: HTTPS
- initialDelaySeconds: 900
- periodSeconds: 60
- timeoutSeconds: 60
- failureThreshold: 1000
+ periodSeconds: 30
+ timeoutSeconds: 10
+ failureThreshold: 3
readinessProbe:
- failureThreshold: 10000
httpGet:
path: /health
port: 8000
scheme: HTTPS
- initialDelaySeconds: 60
- periodSeconds: 30
+ periodSeconds: 10
successThreshold: 1
- timeoutSeconds: 30
+ timeoutSeconds: 10
+ failureThreshold: 3🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@config/llm_d/manifests/llminferenceservice.yaml` around lines 78 - 96, The
current extreme failureThresholds on livenessProbe and readinessProbe
(livenessProbe.failureThreshold and readinessProbe.failureThreshold) will hide
real outages—replace the long-tailing behavior by adding a startupProbe for the
long model cold-start and revert livenessProbe and readinessProbe to responsive
values (e.g., small failureThreshold like 3 and shorter periodSeconds/timeouts)
so restarts and traffic removal happen quickly; specifically, introduce a
startupProbe that targets the same /health endpoint on port 8000 with long
initialDelaySeconds/periodSeconds suitable for model load, and update
livenessProbe and readinessProbe in the manifest to remove the huge
failureThresholds and long delays so they remain sensitive once the startupProbe
succeeds.
| llama-3-1-8b-instruct-fp8: | ||
| served_model_name: llama-3-1-8b-instruct-fp8 | ||
| uri: oci://registry.redhat.io/rhelai1/modelcar-llama-3-1-8b-instruct-fp8-dynamic:1.5 | ||
| resources: | ||
| requests: | ||
| cpu: "4" | ||
| memory: 8Gi | ||
| nvidia.com/gpu: "1" | ||
| limits: | ||
| nvidia.com/gpu: "1" |
There was a problem hiding this comment.
Add CPU and memory limits for the llama model.
projects/llm_d/orchestration/llmd_runtime.py replaces the whole resources block when rendering the inference service, so omitting limits.cpu and limits.memory here does not inherit the template defaults. The llama deployment will run without CPU/memory limits while the qwen deployment is constrained.
Suggested fix
llama-3-1-8b-instruct-fp8:
served_model_name: llama-3-1-8b-instruct-fp8
uri: oci://registry.redhat.io/rhelai1/modelcar-llama-3-1-8b-instruct-fp8-dynamic:1.5
resources:
requests:
cpu: "4"
memory: 8Gi
nvidia.com/gpu: "1"
limits:
+ cpu: "4"
+ memory: 8Gi
nvidia.com/gpu: "1"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| llama-3-1-8b-instruct-fp8: | |
| served_model_name: llama-3-1-8b-instruct-fp8 | |
| uri: oci://registry.redhat.io/rhelai1/modelcar-llama-3-1-8b-instruct-fp8-dynamic:1.5 | |
| resources: | |
| requests: | |
| cpu: "4" | |
| memory: 8Gi | |
| nvidia.com/gpu: "1" | |
| limits: | |
| nvidia.com/gpu: "1" | |
| llama-3-1-8b-instruct-fp8: | |
| served_model_name: llama-3-1-8b-instruct-fp8 | |
| uri: oci://registry.redhat.io/rhelai1/modelcar-llama-3-1-8b-instruct-fp8-dynamic:1.5 | |
| resources: | |
| requests: | |
| cpu: "4" | |
| memory: 8Gi | |
| nvidia.com/gpu: "1" | |
| limits: | |
| cpu: "4" | |
| memory: 8Gi | |
| nvidia.com/gpu: "1" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@config/llm_d/models.yaml` around lines 16 - 25, The llama model entry
"llama-3-1-8b-instruct-fp8" is missing resources.limits.cpu and
resources.limits.memory so llmd_runtime.py's replacement of the entire resources
block results in no CPU/memory limits being applied; update the model stanza to
include limits.cpu and limits.memory (e.g., set limits.cpu: "4" and
limits.memory: 8Gi to match the requests) while keeping the existing
nvidia.com/gpu limits, so resources.requests and resources.limits are fully
specified for "llama-3-1-8b-instruct-fp8".
| def run_command( | ||
| args: Iterable[str], | ||
| *, | ||
| check: bool = True, | ||
| capture_output: bool = True, | ||
| input_text: str | None = None, | ||
| ) -> subprocess.CompletedProcess[str]: | ||
| cmd = [str(arg) for arg in args] | ||
| LOGGER.info("run: %s", " ".join(shlex.quote(arg) for arg in cmd)) | ||
| result = subprocess.run( | ||
| cmd, | ||
| check=False, | ||
| text=True, | ||
| capture_output=capture_output, | ||
| input=input_text, | ||
| ) | ||
|
|
||
| if capture_output: | ||
| if result.stdout: | ||
| LOGGER.info("stdout:\n%s", result.stdout.rstrip()) | ||
| if result.stderr: | ||
| LOGGER.info("stderr:\n%s", result.stderr.rstrip()) | ||
|
|
||
| if check and result.returncode != 0: | ||
| raise CommandError( | ||
| f"Command failed with exit code {result.returncode}: " | ||
| f"{' '.join(shlex.quote(arg) for arg in cmd)}" | ||
| ) | ||
|
|
||
| return result |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, locate and examine the file
find . -name "llmd_runtime.py" -type fRepository: openshift-psap/forge
Length of output: 110
🏁 Script executed:
# Read the target file to verify the current implementation
wc -l projects/llm_d/orchestration/llmd_runtime.pyRepository: openshift-psap/forge
Length of output: 112
🏁 Script executed:
# Get the specific section mentioned in the review (lines 230-259)
sed -n '230,259p' projects/llm_d/orchestration/llmd_runtime.pyRepository: openshift-psap/forge
Length of output: 930
🏁 Script executed:
# Search for all calls to run_command to understand current usage patterns
rg "run_command\(" --no-heading -A 2 -B 2 projects/llm_d/Repository: openshift-psap/forge
Length of output: 709
🏁 Script executed:
# Check if there are any existing timeout patterns in the orchestration module
rg "timeout" projects/llm_d/orchestration/ -iRepository: openshift-psap/forge
Length of output: 3365
🏁 Script executed:
# More comprehensive search for run_command calls across entire repo
rg "run_command\(" --no-heading -B 1 -A 5Repository: openshift-psap/forge
Length of output: 1021
🏁 Script executed:
# Also search for oc_get_json and other wrapper functions mentioned in review
rg "def oc_" projects/llm_d/orchestration/llmd_runtime.py -A 8Repository: openshift-psap/forge
Length of output: 287
🏁 Script executed:
# Check all function definitions in llmd_runtime that might call run_command
rg "def " projects/llm_d/orchestration/llmd_runtime.py | head -20Repository: openshift-psap/forge
Length of output: 951
🏁 Script executed:
# Get the full oc() function definition to see how it wraps run_command
sed -n '260,280p' projects/llm_d/orchestration/llmd_runtime.pyRepository: openshift-psap/forge
Length of output: 525
🏁 Script executed:
# Search for direct run_command calls vs via oc() wrapper
rg "run_command" projects/llm_d/orchestration/ -B 2 -A 2 --context=2Repository: openshift-psap/forge
Length of output: 709
🏁 Script executed:
# Check if there are any calls passing positional arguments to run_command
rg "run_command\(\[" projects/llm_d/ -A 3Repository: openshift-psap/forge
Length of output: 46
🏁 Script executed:
# Look at the actual function signature once more and check if there's documentation
sed -n '229,260p' projects/llm_d/orchestration/llmd_runtime.pyRepository: openshift-psap/forge
Length of output: 932
🏁 Script executed:
# Check if there are timeout patterns and what values are used elsewhere
rg "timeout_seconds" projects/llm_d/orchestration/ | grep -E "(=|:)" | head -15Repository: openshift-psap/forge
Length of output: 1715
🏁 Script executed:
# Search for subprocess.run usage elsewhere in the codebase to see if timeout is used
rg "subprocess\.run" projects/ -A 3 -B 1Repository: openshift-psap/forge
Length of output: 10065
🏁 Script executed:
# Check what exceptions are handled that might catch TimeoutExpired
rg "TimeoutExpired|timeout|Timeout" projects/llm_d/ -iRepository: openshift-psap/forge
Length of output: 3365
🏁 Script executed:
# Verify if the oc() wrapper needs update by checking its full implementation
sed -n '260,275p' projects/llm_d/orchestration/llmd_runtime.pyRepository: openshift-psap/forge
Length of output: 366
🏁 Script executed:
# Check if TimeoutExpired is caught anywhere in prepare_llmd.py or test_llmd.py
rg "TimeoutExpired" projects/llm_d/orchestration/Repository: openshift-psap/forge
Length of output: 46
🏁 Script executed:
# Look for exception handling around oc() or run_command() calls
rg "except.*Error|except.*Exception" projects/llm_d/orchestration/prepare_llmd.py -A 2 -B 2Repository: openshift-psap/forge
Length of output: 46
Add a timeout to the command wrapper to prevent indefinite hangs.
Every cluster interaction goes through run_command(), so a hung oc call can stall prepare/test/cleanup indefinitely. A default timeout would fail fast instead of hanging until the CI job times out.
The proposed fix is sound, but the oc() wrapper function also needs to be updated to accept and forward the timeout_seconds parameter, so callers can override the default when needed for longer-running operations.
Proposed changes
Update run_command() signature and the oc() wrapper:
def run_command(
args: Iterable[str],
*,
check: bool = True,
capture_output: bool = True,
input_text: str | None = None,
+ timeout_seconds: int = 300,
) -> subprocess.CompletedProcess[str]:
cmd = [str(arg) for arg in args]
LOGGER.info("run: %s", " ".join(shlex.quote(arg) for arg in cmd))
result = subprocess.run(
cmd,
check=False,
text=True,
capture_output=capture_output,
input=input_text,
+ timeout=timeout_seconds,
)Also update the oc() wrapper to accept and pass through timeout_seconds:
def oc(
*args: str,
check: bool = True,
capture_output: bool = True,
input_text: str | None = None,
+ timeout_seconds: int = 300,
) -> subprocess.CompletedProcess[str]:
return run_command(
["oc", *args],
check=check,
capture_output=capture_output,
input_text=input_text,
+ timeout_seconds=timeout_seconds,
)🧰 Tools
🪛 Ruff (0.15.9)
[error] 239-239: subprocess call: check for execution of untrusted input
(S603)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/llm_d/orchestration/llmd_runtime.py` around lines 230 - 259, The
run_command wrapper can hang indefinitely; add a timeout parameter (e.g.,
timeout_seconds: float | None = 300) to run_command and pass it into
subprocess.run via the timeout argument, updating the function signature and
logging as needed; then update the oc(...) wrapper to accept the same
timeout_seconds parameter and forward it to run_command so callers can override
the default for long-running ops (reference functions: run_command and oc).
| def oc_get_json( | ||
| kind: str, | ||
| *, | ||
| name: str | None = None, | ||
| namespace: str | None = None, | ||
| selector: str | None = None, | ||
| ignore_not_found: bool = False, | ||
| ) -> dict[str, Any] | None: | ||
| args = ["get", kind] | ||
| if name: | ||
| args.append(name) | ||
| if namespace: | ||
| args.extend(["-n", namespace]) | ||
| if selector: | ||
| args.extend(["-l", selector]) | ||
| args.extend(["-o", "json"]) | ||
|
|
||
| result = oc(*args, check=not ignore_not_found, capture_output=True) | ||
| if ignore_not_found and result.returncode != 0: | ||
| return None | ||
| return json.loads(result.stdout) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
head -n 320 projects/llm_d/orchestration/llmd_runtime.py | tail -n 50Repository: openshift-psap/forge
Length of output: 1284
🏁 Script executed:
grep -n "^def oc(" projects/llm_d/orchestration/llmd_runtime.pyRepository: openshift-psap/forge
Length of output: 75
🏁 Script executed:
sed -n '262,280p' projects/llm_d/orchestration/llmd_runtime.pyRepository: openshift-psap/forge
Length of output: 523
🏁 Script executed:
grep -n "^def run_command(" projects/llm_d/orchestration/llmd_runtime.pyRepository: openshift-psap/forge
Length of output: 84
🏁 Script executed:
sed -n '230,260p' projects/llm_d/orchestration/llmd_runtime.pyRepository: openshift-psap/forge
Length of output: 931
🏁 Script executed:
rg "oc_get_json|resource_exists" projects/llm_d/orchestration/llmd_runtime.py -A 2 -B 2Repository: openshift-psap/forge
Length of output: 1463
🏁 Script executed:
rg "ignore_not_found=True" projects/llm_d/orchestration/llmd_runtime.py -B 5 -A 3Repository: openshift-psap/forge
Length of output: 1130
🏁 Script executed:
rg "stderr" projects/llm_d/orchestration/llmd_runtime.pyRepository: openshift-psap/forge
Length of output: 152
🏁 Script executed:
rg "Error from server|NotFound" projects/llm_d/Repository: openshift-psap/forge
Length of output: 46
🏁 Script executed:
find projects/llm_d -name "*.py" -type f | xargs grep -l "oc_get_json\|resource_exists" | grep -i testRepository: openshift-psap/forge
Length of output: 105
🏁 Script executed:
cat projects/llm_d/orchestration/test_llmd.pyRepository: openshift-psap/forge
Length of output: 14463
🏁 Script executed:
grep -r "NotFound\|Forbidden\|Error from server" projects/llm_d/Repository: openshift-psap/forge
Length of output: 46
🏁 Script executed:
grep -r "returncode\|stderr" projects/llm_d/orchestration/llmd_runtime.py -B 2 -A 2Repository: openshift-psap/forge
Length of output: 706
🏁 Script executed:
python3 << 'EOF'
# Let's verify OpenShift/Kubernetes oc behavior
# The oc get command returns different error messages in stderr for different scenarios
# Common oc get error patterns:
errors = {
"NotFound": "Error from server (NotFound): <resource> \"<name>\" not found",
"Forbidden": "Error from server (Forbidden): <resource> is forbidden: User <user> cannot get <resource>",
"ServerError": "Error from server (InternalError): Internal error occurred:",
"CRDNotReady": "no matches for kind \"<Kind>\"",
}
for error_type, pattern in errors.items():
print(f"{error_type}: {pattern}")
EOFRepository: openshift-psap/forge
Length of output: 349
🏁 Script executed:
rg "payload\s*=" projects/llm_d/orchestration/llmd_runtime.py | head -20Repository: openshift-psap/forge
Length of output: 94
Check stderr to distinguish NotFound from other oc errors.
oc_get_json(..., ignore_not_found=True) and resource_exists() currently return None/False for any non-zero oc get exit code. OpenShift's oc command returns distinct error messages in stderr (e.g., "Error from server (NotFound):", "Error from server (Forbidden):", "no matches for kind"), but the code ignores these and treats all failures as resource absence. This masks RBAC denials, API outages, and unregistered CRDs, causing callers to attempt create/apply operations on broken clusters instead of failing fast.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/llm_d/orchestration/llmd_runtime.py` around lines 281 - 301,
oc_get_json currently treats any non-zero oc exit as "not found" when
ignore_not_found=True; change oc_get_json (and callers like resource_exists) to
inspect result.stderr on failure and only return None when stderr indicates a
true NotFound (e.g., contains "Error from server (NotFound)" or messages like
"No resources found" / "not found"); for any other stderr (RBAC/Forbidden, "no
matches for kind", API errors, etc.) raise an exception (or re-raise a
CalledProcessError) so callers fail fast instead of assuming absence. Ensure you
reference and update oc_get_json's result handling logic and adjust
resource_exists to propagate the error rather than returning False for
non-NotFound oc errors.
| benchmark_name = ( | ||
| config.benchmark["job_name"] if config.benchmark else "guidellm-benchmark" | ||
| ) | ||
| llmd_runtime.oc( | ||
| "delete", | ||
| "job,pvc", | ||
| benchmark_name, | ||
| "-n", | ||
| namespace, | ||
| "--ignore-not-found=true", | ||
| check=False, | ||
| ) | ||
| llmd_runtime.oc( | ||
| "delete", | ||
| "pod", | ||
| f"{benchmark_name}-copy", | ||
| "-n", | ||
| namespace, | ||
| "--ignore-not-found=true", | ||
| check=False, | ||
| ) |
There was a problem hiding this comment.
Only clean up benchmark resources when a benchmark was actually created.
When config.benchmark is None, this still deletes job,pvc guidellm-benchmark and pod guidellm-benchmark-copy in the target namespace. That is risky for smoke-only runs, especially when namespace_is_managed is False and the namespace can contain unrelated resources.
Suggested fix
- benchmark_name = (
- config.benchmark["job_name"] if config.benchmark else "guidellm-benchmark"
- )
- llmd_runtime.oc(
- "delete",
- "job,pvc",
- benchmark_name,
- "-n",
- namespace,
- "--ignore-not-found=true",
- check=False,
- )
- llmd_runtime.oc(
- "delete",
- "pod",
- f"{benchmark_name}-copy",
- "-n",
- namespace,
- "--ignore-not-found=true",
- check=False,
- )
+ if config.benchmark:
+ benchmark_name = config.benchmark["job_name"]
+ llmd_runtime.oc(
+ "delete",
+ "job,pvc",
+ benchmark_name,
+ "-n",
+ namespace,
+ "--ignore-not-found=true",
+ check=False,
+ )
+ llmd_runtime.oc(
+ "delete",
+ "pod",
+ f"{benchmark_name}-copy",
+ "-n",
+ namespace,
+ "--ignore-not-found=true",
+ check=False,
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| benchmark_name = ( | |
| config.benchmark["job_name"] if config.benchmark else "guidellm-benchmark" | |
| ) | |
| llmd_runtime.oc( | |
| "delete", | |
| "job,pvc", | |
| benchmark_name, | |
| "-n", | |
| namespace, | |
| "--ignore-not-found=true", | |
| check=False, | |
| ) | |
| llmd_runtime.oc( | |
| "delete", | |
| "pod", | |
| f"{benchmark_name}-copy", | |
| "-n", | |
| namespace, | |
| "--ignore-not-found=true", | |
| check=False, | |
| ) | |
| if config.benchmark: | |
| benchmark_name = config.benchmark["job_name"] | |
| llmd_runtime.oc( | |
| "delete", | |
| "job,pvc", | |
| benchmark_name, | |
| "-n", | |
| namespace, | |
| "--ignore-not-found=true", | |
| check=False, | |
| ) | |
| llmd_runtime.oc( | |
| "delete", | |
| "pod", | |
| f"{benchmark_name}-copy", | |
| "-n", | |
| namespace, | |
| "--ignore-not-found=true", | |
| check=False, | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/llm_d/orchestration/test_llmd.py` around lines 41 - 61, Only perform
the benchmark resource deletes when a benchmark was actually created: wrap the
calls to llmd_runtime.oc that delete "job,pvc" and the pod
("{benchmark_name}-copy") in a conditional that checks config.benchmark (truthy)
before computing benchmark_name and invoking llmd_runtime.oc; reference the
existing benchmark_name variable and the llmd_runtime.oc calls, and do not run
these deletes when config.benchmark is None (optionally also skip if
namespace_is_managed is False, if that flag semantics require avoiding
cross-namespace deletes).
| events = llmd_runtime.oc( | ||
| "get", | ||
| "events", | ||
| "-n", | ||
| namespace, | ||
| "--sort-by=.metadata.creationTimestamp", | ||
| check=False, | ||
| capture_output=True, | ||
| ) | ||
| if events.returncode == 0 and events.stdout: | ||
| llmd_runtime.write_text( | ||
| artifacts_dir / "namespace.events.txt", events.stdout | ||
| ) |
There was a problem hiding this comment.
Honor artifacts.capture_namespace_events.
config/llm_d/platform.yaml exposes this as a knob, but this block always captures and writes namespace events. That makes the config ineffective and forces extra artifact collection even when disabled.
Suggested fix
- events = llmd_runtime.oc(
- "get",
- "events",
- "-n",
- namespace,
- "--sort-by=.metadata.creationTimestamp",
- check=False,
- capture_output=True,
- )
- if events.returncode == 0 and events.stdout:
- llmd_runtime.write_text(
- artifacts_dir / "namespace.events.txt", events.stdout
- )
+ if config.platform["artifacts"].get("capture_namespace_events", False):
+ events = llmd_runtime.oc(
+ "get",
+ "events",
+ "-n",
+ namespace,
+ "--sort-by=.metadata.creationTimestamp",
+ check=False,
+ capture_output=True,
+ )
+ if events.returncode == 0 and events.stdout:
+ llmd_runtime.write_text(
+ artifacts_dir / "namespace.events.txt", events.stdout
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| events = llmd_runtime.oc( | |
| "get", | |
| "events", | |
| "-n", | |
| namespace, | |
| "--sort-by=.metadata.creationTimestamp", | |
| check=False, | |
| capture_output=True, | |
| ) | |
| if events.returncode == 0 and events.stdout: | |
| llmd_runtime.write_text( | |
| artifacts_dir / "namespace.events.txt", events.stdout | |
| ) | |
| if config.platform["artifacts"].get("capture_namespace_events", False): | |
| events = llmd_runtime.oc( | |
| "get", | |
| "events", | |
| "-n", | |
| namespace, | |
| "--sort-by=.metadata.creationTimestamp", | |
| check=False, | |
| capture_output=True, | |
| ) | |
| if events.returncode == 0 and events.stdout: | |
| llmd_runtime.write_text( | |
| artifacts_dir / "namespace.events.txt", events.stdout | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/llm_d/orchestration/test_llmd.py` around lines 62 - 74, The test
currently always captures namespace events; guard this behavior by checking the
config flag artifacts.capture_namespace_events before invoking llmd_runtime.oc
and writing the file: only run the existing block that calls
llmd_runtime.oc(...), checks events.returncode and writes to artifacts_dir /
"namespace.events.txt" when the configuration flag
artifacts.capture_namespace_events (from config/llm_d/platform.yaml) is true;
locate the block that defines events and uses llmd_runtime.write_text and wrap
it with a conditional that reads the runtime/config object’s
artifacts.capture_namespace_events flag.
| result = llmd_runtime.oc( | ||
| "exec", | ||
| "-n", | ||
| namespace, | ||
| f"deployment/{deployment_name}", | ||
| "-c", | ||
| "main", | ||
| "--", | ||
| "curl", | ||
| "-k", | ||
| "-sSf", | ||
| f"{endpoint_url}{config.platform['smoke']['endpoint_path']}", | ||
| "-H", | ||
| "Content-Type: application/json", | ||
| "-d", | ||
| json.dumps(payload), | ||
| check=False, | ||
| capture_output=True, | ||
| ) |
There was a problem hiding this comment.
Bound each smoke-request attempt with a timeout.
The retry loop does not help if curl blocks inside oc exec; one hung request can stall the whole CI run beyond the configured retry budget. Add per-attempt timeouts so failures actually progress to the next retry.
Suggested fix
result = llmd_runtime.oc(
"exec",
"-n",
namespace,
f"deployment/{deployment_name}",
"-c",
"main",
"--",
"curl",
"-k",
"-sSf",
+ "--connect-timeout",
+ "10",
+ "--max-time",
+ str(delay),
f"{endpoint_url}{config.platform['smoke']['endpoint_path']}",
"-H",
"Content-Type: application/json",
"-d",
json.dumps(payload),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/llm_d/orchestration/test_llmd.py` around lines 193 - 211, The oc
exec call that runs curl (the llmd_runtime.oc invocation producing result) must
be bounded per attempt so a hung curl doesn't stall retries: add a per-attempt
timeout (either by passing a timeout argument to llmd_runtime.oc if it supports
one, or by adding curl's --max-time flag in the command) and catch the
timeout/TimeoutExpired outcome around the llmd_runtime.oc call so the retry loop
can move to the next attempt when the timeout fires; update handling of
result/timeouts in the surrounding retry loop accordingly.
| result = llmd_runtime.oc( | ||
| "exec", | ||
| "-n", | ||
| namespace, | ||
| f"{benchmark_name}-copy", | ||
| "--", | ||
| "cat", | ||
| "/results/benchmarks.json", | ||
| check=False, | ||
| capture_output=True, | ||
| ) | ||
| if result.returncode == 0 and result.stdout: | ||
| llmd_runtime.write_text( | ||
| config.artifact_dir / "artifacts" / "results" / "benchmarks.json", | ||
| result.stdout, | ||
| ) |
There was a problem hiding this comment.
Treat missing benchmark output as a failure.
If oc exec ... cat /results/benchmarks.json fails or returns empty output, the run still succeeds and just skips writing the artifact. That can make benchmark jobs look green while producing no usable result.
Suggested fix
result = llmd_runtime.oc(
"exec",
"-n",
namespace,
f"{benchmark_name}-copy",
"--",
"cat",
"/results/benchmarks.json",
check=False,
capture_output=True,
)
- if result.returncode == 0 and result.stdout:
- llmd_runtime.write_text(
- config.artifact_dir / "artifacts" / "results" / "benchmarks.json",
- result.stdout,
- )
+ if result.returncode != 0 or not result.stdout:
+ raise RuntimeError(
+ f"Failed to copy GuideLLM results for {benchmark_name} from /results/benchmarks.json"
+ )
+ llmd_runtime.write_text(
+ config.artifact_dir / "artifacts" / "results" / "benchmarks.json",
+ result.stdout,
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| result = llmd_runtime.oc( | |
| "exec", | |
| "-n", | |
| namespace, | |
| f"{benchmark_name}-copy", | |
| "--", | |
| "cat", | |
| "/results/benchmarks.json", | |
| check=False, | |
| capture_output=True, | |
| ) | |
| if result.returncode == 0 and result.stdout: | |
| llmd_runtime.write_text( | |
| config.artifact_dir / "artifacts" / "results" / "benchmarks.json", | |
| result.stdout, | |
| ) | |
| result = llmd_runtime.oc( | |
| "exec", | |
| "-n", | |
| namespace, | |
| f"{benchmark_name}-copy", | |
| "--", | |
| "cat", | |
| "/results/benchmarks.json", | |
| check=False, | |
| capture_output=True, | |
| ) | |
| if result.returncode != 0 or not result.stdout: | |
| raise RuntimeError( | |
| f"Failed to copy GuideLLM results for {benchmark_name} from /results/benchmarks.json" | |
| ) | |
| llmd_runtime.write_text( | |
| config.artifact_dir / "artifacts" / "results" / "benchmarks.json", | |
| result.stdout, | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/llm_d/orchestration/test_llmd.py` around lines 318 - 333, The test
currently ignores failures when running llmd_runtime.oc to cat
/results/benchmarks.json; update the block that calls llmd_runtime.oc (the
variable result from llmd_runtime.oc and the subsequent conditional that calls
llmd_runtime.write_text) to treat a non-zero returncode or empty stdout as a
test failure: if result.returncode != 0 or not result.stdout raise an exception
or use the test framework's fail/assert mechanism with a clear message including
result.stderr/output; only call llmd_runtime.write_text to write the artifact
when stdout is present and the command succeeded.
7d85dc5 to
107ddd8
Compare
| @@ -0,0 +1,25 @@ | |||
| models: | |||
There was a problem hiding this comment.
as discussed offline, we'll need a mechanism that supports multiple configuration files
this will have to go to projects/llm_d/orchestration/config.d
| FORGE_HOME = Path(__file__).resolve().parents[3] | ||
| if str(FORGE_HOME) not in sys.path: | ||
| sys.path.insert(0, str(FORGE_HOME)) | ||
|
|
| verify_oc_access() | ||
| verify_cluster_version(config) | ||
| prepare_cert_manager(config) | ||
| prepare_leader_worker_set(config) | ||
| prepare_nfd(config) | ||
| prepare_gpu_operator(config) | ||
| prepare_rhoai_operator(config) | ||
| apply_datasciencecluster(config) | ||
| wait_for_datasciencecluster_ready(config) | ||
| ensure_required_crds(config.platform["rhoai"]["required_crds_after_dsc"], config) | ||
| ensure_gateway(config) | ||
| ensure_test_namespace(config) | ||
| verify_gpu_nodes(config) | ||
| capture_prepare_state(config) |
There was a problem hiding this comment.
this will have to become toolbox commands
global config is at projects.core.library.config --> config.project.get_config
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
107ddd8 to
9298b94
Compare
Summary by CodeRabbit
Release Notes