Adding Forge rhaiis test entrypoint#23
Adding Forge rhaiis test entrypoint#23MML-coder wants to merge 3 commits intoopenshift-psap:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughIntroduces a comprehensive workflow orchestration system for RHAIIS benchmarking on OpenShift/Kubernetes, including: YAML configuration infrastructure (defaults, models, workloads), a reusable workflow engine with context/step/executor abstractions, a scenario generation system supporting matrix expansion and config inheritance, OpenShift CLI utilities with retry logic, concrete steps for vLLM deployment/monitoring and artifact collection, and RHAIIS-specific implementations combining these systems for prepare/benchmark/cleanup phases. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Executor
participant Workflow
participant Step1
participant Step2
participant Finally1
User->>Executor: execute(workflow)
activate Executor
Executor->>Workflow: steps (lazy init)
activate Workflow
Workflow->>Workflow: define_steps()
Workflow-->>Executor: [Step1, Step2]
deactivate Workflow
Executor->>Step1: execute(ctx)
activate Step1
Step1-->>Executor: StepResult(success=true)
deactivate Step1
Executor->>Step2: execute(ctx)
activate Step2
Step2-->>Executor: StepResult(success=true)
deactivate Step2
Executor->>Workflow: finally_steps
activate Workflow
Workflow-->>Executor: [Finally1]
deactivate Workflow
Executor->>Finally1: execute(ctx)
activate Finally1
Finally1-->>Executor: StepResult(success=true)
deactivate Finally1
Executor->>Executor: aggregate results
Executor-->>User: WorkflowResult
deactivate Executor
sequenceDiagram
participant App
participant ScenarioGenerator
participant ConfigLoader
participant DefaultsYAML as defaults.yaml
participant ModelsYAML as models.yaml
participant WorkloadsYAML as workloads.yaml
App->>ScenarioGenerator: __init__(config_dir, accelerator)
activate ScenarioGenerator
ScenarioGenerator->>ConfigLoader: __init__(config_dir, accelerator)
activate ConfigLoader
ConfigLoader->>DefaultsYAML: load (cached)
ConfigLoader->>ModelsYAML: load (cached)
ConfigLoader->>WorkloadsYAML: load (cached)
deactivate ConfigLoader
deactivate ScenarioGenerator
App->>ScenarioGenerator: expand()
activate ScenarioGenerator
ScenarioGenerator->>ConfigLoader: load_model(model_key)
activate ConfigLoader
ConfigLoader->>ConfigLoader: deep_merge(global → accelerator → model → model.accelerator_overrides)
ConfigLoader-->>ScenarioGenerator: ResolvedModelConfig
deactivate ConfigLoader
ScenarioGenerator->>ConfigLoader: load_workload(workload_key)
activate ConfigLoader
ConfigLoader->>ConfigLoader: merge guidellm + vllm_args
ConfigLoader-->>ScenarioGenerator: ResolvedWorkloadConfig
deactivate ConfigLoader
ScenarioGenerator->>ScenarioGenerator: matrix product(workloads × routing × tensor_parallel)
ScenarioGenerator->>ScenarioGenerator: compute scenario_id, merge runtime_args
ScenarioGenerator-->>App: [ExpandedScenario, ...]
deactivate ScenarioGenerator
sequenceDiagram
participant CLI
participant test_rhaiis
participant ConfigLoader
participant BenchmarkWorkflow
participant DeployVLLMStep
participant WaitForReadyStep
participant RunGuideLLMStep
participant OC as oc (CLI)
CLI->>test_rhaiis: run_test(model, workloads, ...)
activate test_rhaiis
test_rhaiis->>ConfigLoader: load_model(model_key)
activate ConfigLoader
ConfigLoader-->>test_rhaiis: ResolvedModelConfig
deactivate ConfigLoader
alt Single Workload
test_rhaiis->>BenchmarkWorkflow: __init__(model, workload, ...)
test_rhaiis->>BenchmarkWorkflow: execute()
else Multi-Workload (Deploy-Once)
loop for each workload group
test_rhaiis->>DeployVLLMStep: execute(ctx)
activate DeployVLLMStep
DeployVLLMStep->>OC: apply -f <kserve.yaml>
DeployVLLMStep-->>test_rhaiis: StepResult
deactivate DeployVLLMStep
test_rhaiis->>WaitForReadyStep: execute(ctx)
activate WaitForReadyStep
loop poll InferenceService
WaitForReadyStep->>OC: get inferenceservice
end
WaitForReadyStep->>OC: exec pod curl health check
WaitForReadyStep-->>test_rhaiis: StepResult(service_url)
deactivate WaitForReadyStep
loop for each workload in group
test_rhaiis->>RunGuideLLMStep: execute(ctx)
activate RunGuideLLMStep
RunGuideLLMStep->>OC: apply Pod manifest
RunGuideLLMStep->>OC: exec pod logs (await BENCHMARK_COMPLETE)
RunGuideLLMStep-->>test_rhaiis: StepResult
deactivate RunGuideLLMStep
end
end
end
test_rhaiis-->>CLI: exit_code
deactivate test_rhaiis
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 20
🧹 Nitpick comments (19)
tests/rhaiis/test_workflows.py (1)
15-23: Consider consolidating duplicated fixtures.
temp_artifact_dirandcontextfixtures are repeated in three classes; moving them to module-level fixtures will reduce test boilerplate.Also applies to: 80-87, 103-110
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/rhaiis/test_workflows.py` around lines 15 - 23, The temp_artifact_dir and context fixtures are duplicated across test classes; move them to module-level by defining pytest fixtures named temp_artifact_dir and context at top of the test module and reuse them in tests instead of redefining inside each class; ensure the context fixture still uses WorkflowContext.from_environment(artifact_base=str(temp_artifact_dir)) and yields Path(tmpdir) semantics for temp_artifact_dir so existing tests referencing those fixture names continue to work.tests/core/steps/test_guidellm.py (1)
49-52: UseWorkflowContextAPI instead of rebuilding internal path format.Line [49] manually mirrors artifact directory naming. Prefer
context.get_step_artifact_dir("benchmark")to avoid brittle coupling.💡 Suggested change
- step_dir = context.artifact_dir / f"{context.step_number:03d}__{context.current_step_name}" - step_dir.mkdir(parents=True, exist_ok=True) + step_dir = context.get_step_artifact_dir("benchmark") output_file = step_dir / "guidellm_results.json" output_file.write_text('{"results": []}')🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/core/steps/test_guidellm.py` around lines 49 - 52, The test rebuilds the artifact path manually using context.step_number and context.current_step_name (step_dir/step_dir.mkdir/output_file), which couples to internal formatting; replace that with the WorkflowContext API by calling context.get_step_artifact_dir("benchmark") to obtain the correct step_dir, then create the directory and write the results to that path (output_file) using the returned Path; update references to step_dir/output_file accordingly (remove manual formatting with f"{context.step_number:03d}__{context.current_step_name}").tests/core/scenarios/test_generator.py (2)
82-82: Consider using ASCIIxinstead of Unicode×in comment.The comment uses the Unicode multiplication sign (
×) which Ruff flags as potentially ambiguous. While this is purely cosmetic, using the ASCIIxcharacter improves consistency.📝 Suggested fix
- # 2 workloads × 1 routing × 2 TP = 4 scenarios + # 2 workloads x 1 routing x 2 TP = 4 scenarios🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/core/scenarios/test_generator.py` at line 82, Replace the Unicode multiplication sign in the comment "# 2 workloads × 1 routing × 2 TP = 4 scenarios" with the ASCII "x" to avoid Ruff warnings; update the comment in tests/core/scenarios/test_generator.py (the line containing that scenario count comment) so it reads "# 2 workloads x 1 routing x 2 TP = 4 scenarios".
60-64: Temporary file created withdelete=Falseis not explicitly cleaned up.The temp file is created with
delete=Falseand yielded, but there's no explicit cleanup. While pytest typically handles this, consider usingtmp_pathfixture or adding cleanup in a finally block for cleaner resource management.♻️ Alternative using tmp_path
`@pytest.fixture` - def sample_config_path(self): + def sample_config_path(self, tmp_path): """Create a sample scenarios.yaml file using new format.""" config = { ... } - with tempfile.NamedTemporaryFile( - mode="w", suffix=".yaml", delete=False - ) as f: - yaml.safe_dump(config, f) - yield Path(f.name) + config_path = tmp_path / "scenarios.yaml" + config_path.write_text(yaml.safe_dump(config)) + yield config_path🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/core/scenarios/test_generator.py` around lines 60 - 64, The temporary YAML file created with tempfile.NamedTemporaryFile(delete=False) in the test_generator (where yaml.safe_dump writes and the code yields Path(f.name)) isn't explicitly removed; either switch to pytest's tmp_path fixture to create and write the file there, or wrap the yield in a try/finally that calls Path(f.name).unlink() to ensure cleanup after the test completes; update the block that creates the NamedTemporaryFile accordingly (or replace it with tmp_path.joinpath and open/write) so the temp file is always removed.tests/rhaiis/test_operators.py (1)
56-61: Test path construction relies on internal context state.The test constructs the step directory path manually using
context.step_numberandcontext.current_step_name. This couples the test to internal implementation details ofWorkflowContext. Consider usingcontext.get_step_artifact_dir("install_nfd")directly to get the path, or capturing the return value from the fixture's call.♻️ Suggested approach
`@pytest.fixture` def context(self, temp_artifact_dir): ctx = WorkflowContext.from_environment(artifact_base=str(temp_artifact_dir)) - ctx.get_step_artifact_dir("install_nfd") - return ctx + step_dir = ctx.get_step_artifact_dir("install_nfd") + ctx._test_step_dir = step_dir # Store for test access + return ctx # Then in test: - step_dir = context.artifact_dir / f"{context.step_number:03d}__{context.current_step_name}" + step_dir = context._test_step_dirOr simply call
get_step_artifact_diragain with the same name if it's idempotent.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/rhaiis/test_operators.py` around lines 56 - 61, The test directly constructs the step artifact path using context.step_number and context.current_step_name which couples it to WorkflowContext internals; change the assertion to obtain the step directory via the public API instead (call context.get_step_artifact_dir("install_nfd") or use the fixture return value that provided the path) and then check that (e.g., set step_dir = context.get_step_artifact_dir("install_nfd") and use that to locate "nfd-subscription.yaml" and assert its contents), leaving WorkflowContext internals untouched.projects/core/workflow/executor.py (2)
51-52: Unused variablestep_artifact_dir.The
step_artifact_diris computed on lines 52 and 87 but never used. If this is intended to ensure the directory exists as a side effect ofget_step_artifact_dir(), consider adding a comment to clarify. Otherwise, remove the assignment.♻️ Option 1: Add clarifying comment if side effect is intentional
# Get artifact directory for this step - step_artifact_dir = ctx.get_step_artifact_dir(step_name) + # Creates the step artifact directory (side effect) + _ = ctx.get_step_artifact_dir(step_name)♻️ Option 2: Remove if truly unused
- # Get artifact directory for this step - step_artifact_dir = ctx.get_step_artifact_dir(step_name) + # Ensure step artifact directory exists + ctx.get_step_artifact_dir(step_name)Also applies to: 87-87
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/core/workflow/executor.py` around lines 51 - 52, The variable step_artifact_dir assigned from ctx.get_step_artifact_dir(step_name) is unused; either remove the assignment or make the intended side effect explicit—if calling get_step_artifact_dir(...) is meant to ensure the directory exists, replace the assignment with a call (ctx.get_step_artifact_dir(step_name)) and add a clarifying comment, otherwise delete the line(s) that assign step_artifact_dir (both occurrences) to eliminate the unused variable.
82-82:finally_errorslist is populated but never used.The
finally_errorslist accumulates exceptions from finally steps (lines 99, 112) but is never included in theWorkflowResultor used for any other purpose. Consider either using this data (e.g., logging a summary, including in result) or removing the collection.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/core/workflow/executor.py` at line 82, The finally_errors list in executor.py (variable finally_errors) is being appended with exceptions from the finally steps but never used; either remove the collection or surface it in the result/logs: add a field to WorkflowResult (e.g., finally_errors: list[Exception] or finally_error_summary) and populate it with the collected finally_errors before returning from the executor function, or if you prefer not to change the result type, log a concise summary (count and messages) of finally_errors via the existing logger right before returning; update uses in the function handling finally steps (where finally_errors is appended) accordingly.projects/rhaiis/LLM_D_EXTENSIBILITY_REPORT.md (1)
22-34: Consider adding language identifiers to code blocks.The directory structure code blocks on lines 22 and 37 are missing language specifiers. While markdown renders these fine, adding a language identifier (e.g.,
textorplaintext) improves linting compliance and editor syntax highlighting.📝 Suggested fix
-``` +```text projects/core/ ├── workflow/Also applies to: 37-49
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/rhaiis/LLM_D_EXTENSIBILITY_REPORT.md` around lines 22 - 34, The markdown code blocks showing directory trees (the blocks starting with "projects/core/" around the two sections in LLM_D_EXTENSIBILITY_REPORT.md) lack language identifiers; update each triple-backtick fence to include a language token like ```text or ```plaintext so linters and editors recognize these as plain text (e.g., change ``` to ```text for both directory-structure blocks).projects/rhaiis/IMPLEMENTATION.md (1)
7-34: Consider addingtextlanguage identifier to ASCII diagram blocks.The ASCII art diagrams render correctly but lack language specifiers, which triggers markdownlint warnings. Adding
textorplaintextas the language identifier would satisfy the linter without affecting rendering.Also applies to: 118-180, 212-257, 261-286
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/rhaiis/IMPLEMENTATION.md` around lines 7 - 34, Add a language specifier to the ASCII diagram code fences to silence markdownlint: change the opening ``` to ```text (or ```plaintext) for the ASCII blocks that document the orchestration and workflow (the block showing test_rhaiis.py, ConfigLoader, BenchmarkWorkflow and the WorkflowStep implementations like DeployVLLM, WaitForReady, RunGuideLLM, CollectArtifacts); apply the same change to the other ASCII sections called out (lines ~118-180, 212-257, 261-286) so the diagrams keep rendering but now include the text language identifier.projects/rhaiis/workflows/steps/cleanup.py (1)
41-51: Consider adding additional resource types to the cleanup list.The cleanup list may miss
persistentvolumeclaimandjobresources that could be created during benchmark runs. Consider expanding the list to ensure complete cleanup.🧹 Proposed fix
resource_types = [ "inferenceservice", "servingruntime", "deployment", "service", "route", "configmap", "secret", "pod", + "job", + "persistentvolumeclaim", ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/rhaiis/workflows/steps/cleanup.py` around lines 41 - 51, The cleanup resource_types list in cleanup.py currently omits PVCs and Jobs, so update the resource_types variable (used in the cleanup step) to include "persistentvolumeclaim" and "job" alongside the existing entries; ensure ordering still deletes KServe types first if needed and that any callers of resource_types (e.g., the cleanup loop/function in this module) handle these additional resource kinds.projects/core/steps/guidellm.py (3)
3-7: Unused import:json.The
jsonmodule is imported but never used in this file.🧹 Proposed fix
-import json import logging import subprocess🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/core/steps/guidellm.py` around lines 3 - 7, The import list at the top of guidellm.py includes an unused module `json`; remove the unused `json` import from the imports block (the line containing "import json") so only the used modules (logging, subprocess, time, uuid) remain and avoid linter warnings.
77-78: Inconsistent artifact directory construction.The step manually constructs the artifact path using
ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}"instead of using the context'sget_step_artifact_dir()method. This duplicates logic and could diverge from the canonical implementation.♻️ Proposed fix
def execute(self, ctx: "WorkflowContext") -> StepResult: """Run GuideLLM benchmark as a pod.""" - step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}" - step_dir.mkdir(parents=True, exist_ok=True) + step_dir = ctx.get_step_artifact_dir(self.name)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/core/steps/guidellm.py` around lines 77 - 78, Replace the manual artifact directory construction that assigns step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}" with the canonical context helper by calling ctx.get_step_artifact_dir() (or the appropriate context method) to obtain the step artifact directory; ensure you then call mkdir(parents=True, exist_ok=True) on the returned path so all existing callers (e.g., the variable step_dir and any subsequent uses) keep the same behavior but use the context-managed path.
273-275: Progress logging condition may not trigger reliably.With
poll_interval=10and integer division,elapsed % 60 == 0may not trigger exactly every minute due to timing drift. Consider tracking the last log time instead.🛠️ Proposed fix
+ last_log_time = 0 + while time.monotonic() - start_time < timeout: try: # ... polling logic ... # Still running, no marker yet elapsed = int(time.monotonic() - start_time) - if elapsed % 60 == 0: # Print every minute + if elapsed - last_log_time >= 60: # Print every minute print(f" GuideLLM running... ({elapsed}s elapsed, phase={phase})") + last_log_time = elapsed🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/core/steps/guidellm.py` around lines 273 - 275, The current progress log uses "elapsed = int(time.monotonic() - start_time)" and checks "if elapsed % 60 == 0" which can miss minute boundaries due to timing drift with poll_interval; change to track the last logged time instead (e.g., initialize last_log_time = 0 before the loop) and inside the loop check "if elapsed - last_log_time >= 60" then print the same message (including phase and elapsed) and set last_log_time = elapsed; update occurrences around the elapsed/start_time/phase logging in the GuideLLM run loop to use this new last_log_time logic so logs reliably occur roughly every 60 seconds regardless of poll_interval.tests/rhaiis/test_steps.py (1)
26-31: Context fixture pre-increments step counter but discards result.The fixture calls
ctx.get_step_artifact_dir("deploy")presumably to simulate workflow execution, but the return value is unused. A comment explaining this side-effect would improve clarity.📝 Proposed fix
`@pytest.fixture` def context(self, temp_artifact_dir): ctx = WorkflowContext.from_environment(artifact_base=str(temp_artifact_dir)) - # Pre-increment step number to simulate workflow execution - ctx.get_step_artifact_dir("deploy") + # Pre-increment step counter to simulate workflow execution context + # (return value intentionally discarded - we only need the side effect) + _ = ctx.get_step_artifact_dir("deploy") return ctx🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/rhaiis/test_steps.py` around lines 26 - 31, The context fixture is calling ctx.get_step_artifact_dir("deploy") solely for its side-effect of advancing the step counter but discards the return value; update the fixture to either store the returned path (e.g., assign to a variable) or add a clarifying comment explaining the intentional side-effect so future readers know get_step_artifact_dir is being invoked to pre-increment the step counter; reference the context fixture, WorkflowContext.from_environment, and ctx.get_step_artifact_dir("deploy") when making the change.projects/rhaiis/orchestration/ci.py (1)
24-24: Path construction using multiple parent traversals is fragile.
Path(__file__).parent.parent.parent.parentis brittle and hard to verify. Consider using a more explicit path resolution or a constant defined relative to the project root.🛠️ Alternative approach
+# Find project root by looking for a marker file +def _find_project_root() -> Path: + """Find project root by traversing up to find config directory.""" + current = Path(__file__).resolve().parent + for _ in range(10): # Limit traversal depth + if (current / "config" / "rhaiis").is_dir(): + return current + current = current.parent + raise RuntimeError("Could not find project root with config/rhaiis directory") + # Default config directory -DEFAULT_CONFIG_DIR = Path(__file__).parent.parent.parent.parent / "config" / "rhaiis" +DEFAULT_CONFIG_DIR = _find_project_root() / "config" / "rhaiis"Alternatively, verify the path exists at module load time to fail fast if the structure changes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/rhaiis/orchestration/ci.py` at line 24, The DEFAULT_CONFIG_DIR uses brittle parent chaining (Path(__file__).parent.parent.parent.parent / "config" / "rhaiis"); replace it with a robust resolution strategy: derive the project root from a single authoritative location (e.g., a PROJECT_ROOT constant or using pathlib.Path(__file__).resolve().parents[n] with a named constant) or compute it via an environment variable, then build DEFAULT_CONFIG_DIR = PROJECT_ROOT / "config" / "rhaiis"; also add a runtime existence check for DEFAULT_CONFIG_DIR (raising a clear exception or logging an error) so module import fails fast if the expected path is missing; update references to DEFAULT_CONFIG_DIR accordingly.projects/core/workflow/workflow.py (1)
64-74: Duplicate step names would overwrite results in step_results dict.If two steps have the same name, the executor will overwrite the first step's result in
step_results. Consider validating uniqueness or using a list of tuples instead.🛠️ Proposed fix - add uniqueness validation
def add_step(self, step: WorkflowStep) -> None: """ Add a step to the workflow. Steps run in order of registration. If a step fails, remaining steps are skipped and finally steps run. Args: step: WorkflowStep instance to add + + Raises: + ValueError: If a step with the same name already exists """ + existing_names = {s.name for s in self._steps} | {s.name for s in self._finally_steps} + if step.name in existing_names: + raise ValueError(f"Step with name '{step.name}' already exists") self._steps.append(step)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/core/workflow/workflow.py` around lines 64 - 74, The add_step method currently appends WorkflowStep instances into self._steps allowing duplicate step.name values which will later cause step_results to overwrite earlier results; update add_step to validate that the incoming step.name is unique among existing steps (inspect self._steps for matching WorkflowStep.name) and raise a clear ValueError (or custom exception) if a duplicate is found, so consumers must choose unique names before execution; reference add_step, WorkflowStep.name, self._steps and step_results when making the change.projects/rhaiis/workflows/benchmark.py (1)
117-122: K8s name sanitization may produce invalid names with leading/trailing hyphens.If the model name starts or ends with
/,., or_, the resulting name could have leading or trailing hyphens, which are invalid in Kubernetes resource names. Consider stripping these characters.🛠️ Proposed fix
`@staticmethod` def _sanitize_name(name: str) -> str: """Sanitize model name for K8s resource naming.""" name = name.split("/")[-1].lower() name = name.replace(".", "-").replace("_", "-") + name = name.strip("-") return name[:42]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/rhaiis/workflows/benchmark.py` around lines 117 - 122, The _sanitize_name function can produce leading/trailing hyphens (invalid for K8s); update it to trim unwanted characters before/after substitution: remove leading/trailing "/" "." "_" from the original name (or perform replacements then strip leading/trailing "-" characters), collapse repeated hyphens if present, ensure the final string is trimmed to max 42 chars and not empty (fallback to a safe default if empty); modify the _sanitize_name implementation to perform these steps so Kubernetes names never begin or end with a hyphen.projects/core/scenarios/config.py (1)
79-81: Sanitize name may produce consecutive hyphens.When the input contains consecutive dots, underscores, or slashes (e.g.,
model__nameorfoo..bar), the result will have consecutive hyphens (model--name,foo--bar), which while technically valid in Kubernetes, is unconventional.🛠️ Proposed fix
+import re + `@staticmethod` def sanitize_name(name: str, max_len: int = 42) -> str: - return ( - name.lower().replace("/", "-").replace("_", "-").replace(".", "") - )[:max_len] + sanitized = name.lower().replace("/", "-").replace("_", "-").replace(".", "") + sanitized = re.sub(r"-+", "-", sanitized) # Collapse consecutive hyphens + return sanitized.strip("-")[:max_len]Note:
reis already imported at line 3.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/core/scenarios/config.py` around lines 79 - 81, The current sanitization expression (the return using name.lower().replace("/", "-").replace("_", "-").replace(".", "")) can produce consecutive hyphens; replace that expression with a regex-based collapse: use re.sub to replace any consecutive sequence of slashes, dots, or underscores with a single hyphen, then strip leading/trailing hyphens and apply the existing max_len slice. Locate and replace the expression shown (the name.lower().replace(...).replace(...).replace(...)) in the config sanitization function so the output never contains consecutive hyphens.projects/core/utils/oc.py (1)
192-202: Separate “inherit default namespace” from “no namespace flag”.The current API overloads
Nonefor both meanings. That makes cluster-scoped helpers likecreate_namespace()rely on ambiguous behavior, becausenamespace=Nonestill falls back toself.namespaceinside_build_cmd(). A private sentinel here would make the public API clearer and safer for future cluster-scoped commands.Suggested refactor
+_INHERIT_NAMESPACE = object() + class OC: @@ - def _build_cmd(self, args: list[str], namespace: str | None = None) -> list[str]: + def _build_cmd( + self, + args: list[str], + namespace: str | None | object = _INHERIT_NAMESPACE, + ) -> list[str]: """Build full oc command with namespace.""" cmd = ["oc"] # Use provided namespace, fall back to instance default - ns = namespace if namespace is not None else self.namespace + ns = self.namespace if namespace is _INHERIT_NAMESPACE else namespace if ns: cmd.extend(["-n", ns]) @@ - def run(self, *args: str, namespace: str | None = None, timeout: int | None = None, input: str | None = None) -> OCResult: + def run( + self, + *args: str, + namespace: str | None | object = _INHERIT_NAMESPACE, + timeout: int | None = None, + input: str | None = None, + ) -> OCResult: @@ - def apply(self, *args: str, namespace: str | None = None, timeout: int | None = None, input: str | None = None) -> OCResult: + def apply( + self, + *args: str, + namespace: str | None | object = _INHERIT_NAMESPACE, + timeout: int | None = None, + input: str | None = None, + ) -> OCResult: @@ return self.apply("-f", "-", input=yaml_content, namespace=None)Also applies to: 300-313, 335-351, 514-529
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/core/utils/oc.py` around lines 192 - 202, Introduce a private sentinel to distinguish “inherit default namespace” from “no namespace flag”: define _NO_NAMESPACE = object() and change _build_cmd signature to accept namespace: str | None | object = _NO_NAMESPACE; implement logic so that when namespace is _NO_NAMESPACE it uses self.namespace (inherit), when namespace is None it omits the -n flag (cluster-scoped), and when namespace is a str it uses that value; then update callers (e.g., create_namespace() and the other helper methods that call _build_cmd referenced in the review) to pass None when they intend “no namespace flag” and leave the parameter unset or pass _NO_NAMESPACE when they intend “inherit default”.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@config/rhaiis/defaults.yaml`:
- Around line 16-18: The default config currently sets the YAML key
"trust-remote-code" to true which enables remote model code execution globally;
change the default value of the "trust-remote-code" setting to false in the
defaults.yaml so remote code execution is opt-in, and update any code paths or
config-loading logic that assume a true default (e.g., places reading
"trust-remote-code") to honor per-model overrides or an explicit allowlist so
only explicitly trusted models can enable remote code execution.
In `@config/rhaiis/workloads.yaml`:
- Around line 48-53: The heterogeneous workload contains a placeholder value for
guidellm.data ("ADD ME"); update the heterogeneous -> guidellm -> data field to
a valid dataset or config identifier used by the benchmark (or remove the
guidellm.data key if not applicable) so the "heterogeneous" workload no longer
contains an invalid placeholder and will not fail at runtime; locate the
heterogeneous section in workloads.yaml and replace "ADD ME" with the correct
dataset name expected by your runner.
In `@projects/core/ci_entrypoint/run_ci.py`:
- Around line 499-505: The CLI currently uses context_settings with
allow_interspersed_args=False while the command signature accepts a catch-all
args parameter typed click.UNPROCESSED, so options provided after positional
arguments (e.g., --dry-run) get captured into args and never parsed; fix this by
enabling interspersed option parsing (set allow_interspersed_args=True in the
click.command context_settings) so parent/global options are parsed regardless
of position, or alternatively stop using click.UNPROCESSED for the args
parameter and accept a parsed remainder (e.g., change handling of the args
parameter) so flags like --dry-run are recognized; update the
decorator/context_settings where allow_interspersed_args is declared and adjust
how the args parameter is defined/consumed accordingly.
In `@projects/core/scenarios/config_loader.py`:
- Around line 266-273: load_scenario() is currently building
data["_resolved_defaults"] by merging only get_global_defaults() and
scenario-local "defaults", which omits accelerator-specific defaults; modify the
merge so accelerator defaults from get_accelerator_defaults() are merged between
the global and scenario defaults (e.g., deep_merge(global_defaults,
accelerator_defaults, scenario_defaults)) so that get_accelerator_defaults(),
deep_merge, and data["_resolved_defaults"] are used in that order within
load_scenario().
- Around line 181-205: The deploy merge is missing accelerator-specific entries
so accelerator-level GPU/resource settings are lost; update the merge to read
accel_defaults.get("deploy", {}) and accel_overrides.get("deploy", {}) (e.g.,
accel_deploy and override_deploy) and include them when building final_deploy
(merge base_deploy, accel_deploy, model_deploy, then override_deploy) so
final_deploy reflects accelerator defaults and overrides (ensure the same
final_deploy variable used by ResolvedModelConfig.deploy/num_gpus).
In `@projects/core/scenarios/generator.py`:
- Around line 309-326: The inline model merging code currently always converts
model dicts into ModelConfig via ModelConfig.from_dict (using symbols model_key,
model_data, ModelConfig.from_dict), which drops legacy-only fields like "matrix"
and "deploy" and later causes expand() to skip those entries (see the block that
checks isinstance(ModelConfig) around expand logic). Change the merge so that if
the incoming model_data contains legacy keys (e.g., "matrix", "deploy" or other
runtime-only fields) you preserve the raw dict (or create a distinct
LegacyModelConfig dataclass) instead of converting it with
ModelConfig.from_dict; update the later expansion loop (the code that currently
skips entries that are ModelConfig instances) to only skip real fully-normalized
ModelConfig instances and to handle the preserved legacy dict/LegacyModelConfig
by expanding its matrix/deploy fields as before. Ensure you reference/remove use
of ModelConfig.from_dict for legacy cases and add handling for legacy types in
the expand() path.
In `@projects/core/steps/artifacts.py`:
- Around line 203-246: The cleanup hides failures because _delete_resource(kind,
name, ns_args) only returns False on non-zero exit and execute() never records
error details in errors; change _delete_resource to return (bool, str|None)
where the bool is success and the str carries the subprocess output/error
(prefer result.stderr if present falling back to result.stdout) when returncode
!= 0, and update execute() to capture that tuple: when success append to deleted
as before, when failure append a descriptive message (include kind, name and the
returned error string) into the errors list so real oc delete failures (auth/API
errors) are surfaced in the StepResult data/message.
In `@projects/core/workflow/executor.py`:
- Around line 54-59: The timing fields are incorrect because result.start_time
is assigned after step.execute() returns; move the start timestamp to before
execution by setting result.start_time = datetime.now(timezone.utc) (or
equivalent) immediately after setting step_start and before calling
step.execute(ctx), then invoke result = step.execute(ctx) and compute
result.duration_seconds = time.monotonic() - step_start afterward so
result.start_time reflects the real start; update the block around step_start,
step.execute, result.start_time, and result.duration_seconds accordingly (refer
to step_start, step.execute(ctx), and result.start_time).
In `@projects/rhaiis/IMPLEMENTATION.md`:
- Around line 469-472: Replace the hardcoded developer path used in the example
command (the string "cd /Users/memehta/workspace/forge") with a generic
placeholder or relative path; update the snippet to instruct users to "cd
<path-to-repo>" or "cd /path/to/forge" or simply "cd $REPO_ROOT" (or use "cd
$(pwd)" if running from repo) so it works for all devs, and keep the rest of the
commands (source ~/test_foo/python3_virt/bin/activate and PYTHONPATH=. python -m
pytest ...) unchanged.
In `@projects/rhaiis/orchestration/ci.py`:
- Around line 67-70: The test command currently instantiates WorkflowContext via
WorkflowContext.from_environment() and reads model =
os.environ.get("FORGE_MODEL") even for --dry-run and without validating the
model; modify the test command to (1) defer creating the WorkflowContext until
after the dry-run check (matching prepare/cleanup behavior) so no context is
created when --dry-run is true, and (2) validate model (the variable read from
FORGE_MODEL) before calling run_test — if model is None or empty, log/raise a
clear error and exit instead of passing None into run_test. Ensure you update
places that call run_test to only do so when model is present and
WorkflowContext has been created.
In `@projects/rhaiis/orchestration/cli.py`:
- Around line 154-160: Validate workload inputs before calling run_test(): if
both the single --workload (variable workload) and --workloads (variable
workloads) are provided, raise a clear error (when not dry_run) instead of
silently ignoring one; when using --workloads, build workload_list by splitting
on commas and normalizing entries with trimming and filtering out empty strings
(e.g., workload_list = [w.strip() for w in workloads.split(",") if w.strip()]);
if only the single workload is provided set workload_list = [workload]; keep the
existing skip-for-dry-run behavior (check dry_run first) and ensure all checks
happen prior to dispatching to run_test().
In `@projects/rhaiis/orchestration/test_rhaiis.py`:
- Around line 356-364: The code reads max_seconds from wl_config.guidellm which
is incorrect; max_seconds is a top-level field on the resolved workload object.
In the try block where wl_config = config_loader.load_workload(wl) is used,
change the lookup for wl_max_seconds to read wl_config.get("max_seconds", 300)
(or wl_config.max_seconds if the object exposes attributes) instead of
wl_config.guidellm.get("max_seconds", 300); keep wl_rates/wl_rate_str logic
as-is but ensure wl_rates still comes from wl_config.guidellm.get("rates", ...)
if rates are nested there.
In `@projects/rhaiis/workflows/benchmark.py`:
- Line 54: Change the environment variable key used when populating
self.vllm_image: currently the constructor sets self.vllm_image = vllm_image or
ctx.get_env("VLLM_IMAGE", ""), but tests/CLI use FORGE_VLLM_IMAGE; update the
ctx.get_env call to use "FORGE_VLLM_IMAGE" so the code reads the correct env var
(locate the assignment to self.vllm_image in the class constructor where
ctx.get_env is called).
In `@projects/rhaiis/workflows/cleanup.py`:
- Around line 19-40: The remove_operators flag is currently ignored: update
define_steps to respect self.remove_operators by either (a) raising a clear
NotImplementedError/ValueError when self.remove_operators is True to fail fast
until operator-removal is implemented, or (b) add the operator-removal step when
True; specifically modify the define_steps method to check self.remove_operators
and either raise an error with a descriptive message or append the
operator-removal step alongside the existing CleanupNamespaceStep (referencing
self.remove_operators and CleanupNamespaceStep to locate the change).
In `@projects/rhaiis/workflows/steps/deploy.py`:
- Around line 358-367: The subprocess.run that populates url_result (and other
oc subprocess.run invocations later in the file) lack timeouts and can hang; add
a timeout argument (use a sensible value or a shared constant/the outer wait
deadline) to each subprocess.run call fetching oc resources (e.g., the run that
sets url_result and the similar calls around lines 419-433), and wrap them in
try/except for subprocess.TimeoutExpired to log the timeout (include
deployment_name and namespace) and fail/raise so the workflow’s outer timeout
contract is preserved.
- Around line 73-78: The generated vLLM argument list currently reads
"tensor-parallel-size" from self.runtime_args instead of using the override
stored in self.tensor_parallel, causing a mismatch between GPU requests and vLLM
startup; update the argument-building logic (the _build_args_lines method and
the other args-construction site that mirrors lines ~239-245) to set the
"tensor-parallel-size" CLI arg from self.tensor_parallel when it is provided
(falling back to self.runtime_args only if self.tensor_parallel is falsy),
leaving other runtime_args unchanged so the manifest and vLLM use the same
tensor-parallel size.
In `@projects/rhaiis/workflows/steps/operators.py`:
- Around line 23-38: The code computes step_dir and writes YAML with
yaml_path.write_text(...) but never creates the per-step artifact directory,
causing FileNotFoundError; update the operator steps to use the canonical
ctx.get_step_artifact_dir(self.name) (or call step_dir.mkdir(parents=True,
exist_ok=True) after computing step_dir) before writing files so the directory
exists; apply this change in the blocks that define step_dir and write
nfd-subscription.yaml (and analogous blocks at the other occurrences noted) to
ensure consistent step numbering and avoid write errors.
- Around line 40-44: The code currently runs a dry-run namespace creation and
then applies a YAML but ignores both command outcomes and always returns
StepResult.ok; change the sequence in the method using _run_oc so the namespace
is actually created (either remove the "--dry-run=client -o yaml" or run the
output through "oc apply -f -" so creation is applied), capture and check the
return/result of both _run_oc calls (the namespace creation/apply and the
subsequent _run_oc(["apply", "-f", str(yaml_path)])), and if either fails return
a failure StepResult instead of StepResult.ok; look for the _run_oc invocations
and StepResult.ok call to update behavior.
In `@tests/rhaiis/test_ci.py`:
- Around line 57-59: The test currently uses `assert "balanced" in result.output
or "heterogeneous" in result.output`, which lets the test pass if only one
workload is printed; change this to assert both workloads are present by using
`and` (i.e., require both `"balanced"` and `"heterogeneous"` in `result.output`)
in the same test (the assertion referencing `result`/`result.output`).
- Around line 34-35: The tests currently hardcode
FORGE_ARTIFACT_DIR="/tmp/artifacts" which can cause cross-test interference;
update the test(s) that return CliRunner(env={"FORGE_ARTIFACT_DIR":
"/tmp/artifacts"}) to accept a per-test temporary directory (e.g., pytest's
tmp_path or a tempfile.TemporaryDirectory) and set FORGE_ARTIFACT_DIR to that
path (str(tmp_path) or the temp dir path) when constructing CliRunner; locate
the return site that calls CliRunner and any helper functions that build the
runner and replace the hardcoded string with the injected temp fixture so each
test gets an isolated artifact directory.
---
Nitpick comments:
In `@projects/core/scenarios/config.py`:
- Around line 79-81: The current sanitization expression (the return using
name.lower().replace("/", "-").replace("_", "-").replace(".", "")) can produce
consecutive hyphens; replace that expression with a regex-based collapse: use
re.sub to replace any consecutive sequence of slashes, dots, or underscores with
a single hyphen, then strip leading/trailing hyphens and apply the existing
max_len slice. Locate and replace the expression shown (the
name.lower().replace(...).replace(...).replace(...)) in the config sanitization
function so the output never contains consecutive hyphens.
In `@projects/core/steps/guidellm.py`:
- Around line 3-7: The import list at the top of guidellm.py includes an unused
module `json`; remove the unused `json` import from the imports block (the line
containing "import json") so only the used modules (logging, subprocess, time,
uuid) remain and avoid linter warnings.
- Around line 77-78: Replace the manual artifact directory construction that
assigns step_dir = ctx.artifact_dir /
f"{ctx.step_number:03d}__{ctx.current_step_name}" with the canonical context
helper by calling ctx.get_step_artifact_dir() (or the appropriate context
method) to obtain the step artifact directory; ensure you then call
mkdir(parents=True, exist_ok=True) on the returned path so all existing callers
(e.g., the variable step_dir and any subsequent uses) keep the same behavior but
use the context-managed path.
- Around line 273-275: The current progress log uses "elapsed =
int(time.monotonic() - start_time)" and checks "if elapsed % 60 == 0" which can
miss minute boundaries due to timing drift with poll_interval; change to track
the last logged time instead (e.g., initialize last_log_time = 0 before the
loop) and inside the loop check "if elapsed - last_log_time >= 60" then print
the same message (including phase and elapsed) and set last_log_time = elapsed;
update occurrences around the elapsed/start_time/phase logging in the GuideLLM
run loop to use this new last_log_time logic so logs reliably occur roughly
every 60 seconds regardless of poll_interval.
In `@projects/core/utils/oc.py`:
- Around line 192-202: Introduce a private sentinel to distinguish “inherit
default namespace” from “no namespace flag”: define _NO_NAMESPACE = object() and
change _build_cmd signature to accept namespace: str | None | object =
_NO_NAMESPACE; implement logic so that when namespace is _NO_NAMESPACE it uses
self.namespace (inherit), when namespace is None it omits the -n flag
(cluster-scoped), and when namespace is a str it uses that value; then update
callers (e.g., create_namespace() and the other helper methods that call
_build_cmd referenced in the review) to pass None when they intend “no namespace
flag” and leave the parameter unset or pass _NO_NAMESPACE when they intend
“inherit default”.
In `@projects/core/workflow/executor.py`:
- Around line 51-52: The variable step_artifact_dir assigned from
ctx.get_step_artifact_dir(step_name) is unused; either remove the assignment or
make the intended side effect explicit—if calling get_step_artifact_dir(...) is
meant to ensure the directory exists, replace the assignment with a call
(ctx.get_step_artifact_dir(step_name)) and add a clarifying comment, otherwise
delete the line(s) that assign step_artifact_dir (both occurrences) to eliminate
the unused variable.
- Line 82: The finally_errors list in executor.py (variable finally_errors) is
being appended with exceptions from the finally steps but never used; either
remove the collection or surface it in the result/logs: add a field to
WorkflowResult (e.g., finally_errors: list[Exception] or finally_error_summary)
and populate it with the collected finally_errors before returning from the
executor function, or if you prefer not to change the result type, log a concise
summary (count and messages) of finally_errors via the existing logger right
before returning; update uses in the function handling finally steps (where
finally_errors is appended) accordingly.
In `@projects/core/workflow/workflow.py`:
- Around line 64-74: The add_step method currently appends WorkflowStep
instances into self._steps allowing duplicate step.name values which will later
cause step_results to overwrite earlier results; update add_step to validate
that the incoming step.name is unique among existing steps (inspect self._steps
for matching WorkflowStep.name) and raise a clear ValueError (or custom
exception) if a duplicate is found, so consumers must choose unique names before
execution; reference add_step, WorkflowStep.name, self._steps and step_results
when making the change.
In `@projects/rhaiis/IMPLEMENTATION.md`:
- Around line 7-34: Add a language specifier to the ASCII diagram code fences to
silence markdownlint: change the opening ``` to ```text (or ```plaintext) for
the ASCII blocks that document the orchestration and workflow (the block showing
test_rhaiis.py, ConfigLoader, BenchmarkWorkflow and the WorkflowStep
implementations like DeployVLLM, WaitForReady, RunGuideLLM, CollectArtifacts);
apply the same change to the other ASCII sections called out (lines ~118-180,
212-257, 261-286) so the diagrams keep rendering but now include the text
language identifier.
In `@projects/rhaiis/LLM_D_EXTENSIBILITY_REPORT.md`:
- Around line 22-34: The markdown code blocks showing directory trees (the
blocks starting with "projects/core/" around the two sections in
LLM_D_EXTENSIBILITY_REPORT.md) lack language identifiers; update each
triple-backtick fence to include a language token like ```text or ```plaintext
so linters and editors recognize these as plain text (e.g., change ``` to
```text for both directory-structure blocks).
In `@projects/rhaiis/orchestration/ci.py`:
- Line 24: The DEFAULT_CONFIG_DIR uses brittle parent chaining
(Path(__file__).parent.parent.parent.parent / "config" / "rhaiis"); replace it
with a robust resolution strategy: derive the project root from a single
authoritative location (e.g., a PROJECT_ROOT constant or using
pathlib.Path(__file__).resolve().parents[n] with a named constant) or compute it
via an environment variable, then build DEFAULT_CONFIG_DIR = PROJECT_ROOT /
"config" / "rhaiis"; also add a runtime existence check for DEFAULT_CONFIG_DIR
(raising a clear exception or logging an error) so module import fails fast if
the expected path is missing; update references to DEFAULT_CONFIG_DIR
accordingly.
In `@projects/rhaiis/workflows/benchmark.py`:
- Around line 117-122: The _sanitize_name function can produce leading/trailing
hyphens (invalid for K8s); update it to trim unwanted characters before/after
substitution: remove leading/trailing "/" "." "_" from the original name (or
perform replacements then strip leading/trailing "-" characters), collapse
repeated hyphens if present, ensure the final string is trimmed to max 42 chars
and not empty (fallback to a safe default if empty); modify the _sanitize_name
implementation to perform these steps so Kubernetes names never begin or end
with a hyphen.
In `@projects/rhaiis/workflows/steps/cleanup.py`:
- Around line 41-51: The cleanup resource_types list in cleanup.py currently
omits PVCs and Jobs, so update the resource_types variable (used in the cleanup
step) to include "persistentvolumeclaim" and "job" alongside the existing
entries; ensure ordering still deletes KServe types first if needed and that any
callers of resource_types (e.g., the cleanup loop/function in this module)
handle these additional resource kinds.
In `@tests/core/scenarios/test_generator.py`:
- Line 82: Replace the Unicode multiplication sign in the comment "# 2 workloads
× 1 routing × 2 TP = 4 scenarios" with the ASCII "x" to avoid Ruff warnings;
update the comment in tests/core/scenarios/test_generator.py (the line
containing that scenario count comment) so it reads "# 2 workloads x 1 routing x
2 TP = 4 scenarios".
- Around line 60-64: The temporary YAML file created with
tempfile.NamedTemporaryFile(delete=False) in the test_generator (where
yaml.safe_dump writes and the code yields Path(f.name)) isn't explicitly
removed; either switch to pytest's tmp_path fixture to create and write the file
there, or wrap the yield in a try/finally that calls Path(f.name).unlink() to
ensure cleanup after the test completes; update the block that creates the
NamedTemporaryFile accordingly (or replace it with tmp_path.joinpath and
open/write) so the temp file is always removed.
In `@tests/core/steps/test_guidellm.py`:
- Around line 49-52: The test rebuilds the artifact path manually using
context.step_number and context.current_step_name
(step_dir/step_dir.mkdir/output_file), which couples to internal formatting;
replace that with the WorkflowContext API by calling
context.get_step_artifact_dir("benchmark") to obtain the correct step_dir, then
create the directory and write the results to that path (output_file) using the
returned Path; update references to step_dir/output_file accordingly (remove
manual formatting with
f"{context.step_number:03d}__{context.current_step_name}").
In `@tests/rhaiis/test_operators.py`:
- Around line 56-61: The test directly constructs the step artifact path using
context.step_number and context.current_step_name which couples it to
WorkflowContext internals; change the assertion to obtain the step directory via
the public API instead (call context.get_step_artifact_dir("install_nfd") or use
the fixture return value that provided the path) and then check that (e.g., set
step_dir = context.get_step_artifact_dir("install_nfd") and use that to locate
"nfd-subscription.yaml" and assert its contents), leaving WorkflowContext
internals untouched.
In `@tests/rhaiis/test_steps.py`:
- Around line 26-31: The context fixture is calling
ctx.get_step_artifact_dir("deploy") solely for its side-effect of advancing the
step counter but discards the return value; update the fixture to either store
the returned path (e.g., assign to a variable) or add a clarifying comment
explaining the intentional side-effect so future readers know
get_step_artifact_dir is being invoked to pre-increment the step counter;
reference the context fixture, WorkflowContext.from_environment, and
ctx.get_step_artifact_dir("deploy") when making the change.
In `@tests/rhaiis/test_workflows.py`:
- Around line 15-23: The temp_artifact_dir and context fixtures are duplicated
across test classes; move them to module-level by defining pytest fixtures named
temp_artifact_dir and context at top of the test module and reuse them in tests
instead of redefining inside each class; ensure the context fixture still uses
WorkflowContext.from_environment(artifact_base=str(temp_artifact_dir)) and
yields Path(tmpdir) semantics for temp_artifact_dir so existing tests
referencing those fixture names continue to work.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 028d9622-0f02-447a-aaa5-eaa381df5c94
📒 Files selected for processing (51)
config/rhaiis/defaults.yamlconfig/rhaiis/models.yamlconfig/rhaiis/workloads.yamlprojects/core/ci_entrypoint/run_ci.pyprojects/core/scenarios/__init__.pyprojects/core/scenarios/config.pyprojects/core/scenarios/config_loader.pyprojects/core/scenarios/generator.pyprojects/core/steps/__init__.pyprojects/core/steps/artifacts.pyprojects/core/steps/guidellm.pyprojects/core/utils/__init__.pyprojects/core/utils/oc.pyprojects/core/workflow/__init__.pyprojects/core/workflow/context.pyprojects/core/workflow/executor.pyprojects/core/workflow/step.pyprojects/core/workflow/workflow.pyprojects/rhaiis/IMPLEMENTATION.mdprojects/rhaiis/LLM_D_EXTENSIBILITY_REPORT.mdprojects/rhaiis/__init__.pyprojects/rhaiis/orchestration/__init__.pyprojects/rhaiis/orchestration/ci.pyprojects/rhaiis/orchestration/cli.pyprojects/rhaiis/orchestration/test_rhaiis.pyprojects/rhaiis/workflows/__init__.pyprojects/rhaiis/workflows/benchmark.pyprojects/rhaiis/workflows/cleanup.pyprojects/rhaiis/workflows/prepare.pyprojects/rhaiis/workflows/steps/__init__.pyprojects/rhaiis/workflows/steps/cleanup.pyprojects/rhaiis/workflows/steps/deploy.pyprojects/rhaiis/workflows/steps/operators.pytests/__init__.pytests/core/__init__.pytests/core/scenarios/__init__.pytests/core/scenarios/test_config_loader.pytests/core/scenarios/test_generator.pytests/core/steps/__init__.pytests/core/steps/test_artifacts.pytests/core/steps/test_guidellm.pytests/core/utils/__init__.pytests/core/utils/test_oc.pytests/core/workflow/__init__.pytests/core/workflow/test_context.pytests/core/workflow/test_executor.pytests/rhaiis/__init__.pytests/rhaiis/test_ci.pytests/rhaiis/test_operators.pytests/rhaiis/test_steps.pytests/rhaiis/test_workflows.py
| gpu-memory-utilization: 0.9 | ||
| trust-remote-code: true | ||
| disable-log-requests: true |
There was a problem hiding this comment.
Defaulting trust-remote-code to true is a security risk.
Line [17] enables remote model code execution globally. This should default to false and be opt-in per trusted model.
🔐 Suggested change
vllm_args:
gpu-memory-utilization: 0.9
- trust-remote-code: true
+ trust-remote-code: false
disable-log-requests: true📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| gpu-memory-utilization: 0.9 | |
| trust-remote-code: true | |
| disable-log-requests: true | |
| gpu-memory-utilization: 0.9 | |
| trust-remote-code: false | |
| disable-log-requests: true |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@config/rhaiis/defaults.yaml` around lines 16 - 18, The default config
currently sets the YAML key "trust-remote-code" to true which enables remote
model code execution globally; change the default value of the
"trust-remote-code" setting to false in the defaults.yaml so remote code
execution is opt-in, and update any code paths or config-loading logic that
assume a true default (e.g., places reading "trust-remote-code") to honor
per-model overrides or an explicit allowlist so only explicitly trusted models
can enable remote code execution.
| heterogeneous: | ||
| name: "Heterogeneous" | ||
| description: "Mixed token distributions simulating real traffic" | ||
| guidellm: | ||
| data: "ADD ME" | ||
| rate_type: "concurrent" |
There was a problem hiding this comment.
heterogeneous workload still contains a placeholder payload.
Line [52] uses "ADD ME", which is not a valid benchmark dataset/config value and can cause runtime failure if selected.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@config/rhaiis/workloads.yaml` around lines 48 - 53, The heterogeneous
workload contains a placeholder value for guidellm.data ("ADD ME"); update the
heterogeneous -> guidellm -> data field to a valid dataset or config identifier
used by the benchmark (or remove the guidellm.data key if not applicable) so the
"heterogeneous" workload no longer contains an invalid placeholder and will not
fail at runtime; locate the heterogeneous section in workloads.yaml and replace
"ADD ME" with the correct dataset name expected by your runner.
| @click.command(context_settings=dict( | ||
| ignore_unknown_options=True, | ||
| allow_interspersed_args=False, | ||
| )) | ||
| @click.argument('project', required=False) | ||
| @click.argument('operation', required=False) | ||
| @click.argument('args', nargs=-1) | ||
| @click.argument('args', nargs=-1, type=click.UNPROCESSED) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In Click 8.x, if a command uses context_settings={"allow_interspersed_args": False, "ignore_unknown_options": True} and click.argument("args", nargs=-1, type=click.UNPROCESSED), are options placed after positional args (for example: run project op --dry-run) parsed as top-level options or passed through as arguments?
💡 Result:
In Click 8.x, with context_settings={"allow_interspersed_args": False, "ignore_unknown_options": True} and a command using @click.argument("args", nargs=-1, type=click.UNPROCESSED), options placed after positional args (e.g., "run project op --dry-run") are passed through as arguments to the "args" parameter. The parser first processes all options until it encounters the first non-option argument ("project"). With allow_interspersed_args=False, it then stops processing options and treats all remaining tokens (including "--dry-run") as positional arguments for the nargs=-1 argument. The ignore_unknown_options=True ensures that any unknown options before the first positional are ignored and moved to largs (positionals), but since "--dry-run" appears after, it is not processed as an option and goes directly to args unprocessed.
Citations:
- 1: https://click.palletsprojects.com/en/8.0.x/advanced/
- 2: https://click.palletsprojects.com/en/stable/advanced/
- 3: https://github.com/pallets/click/blob/cdab890e/src/click/parser.py
- 4: https://github.com/pallets/click/blob/cdab890e/docs/advanced.md
🏁 Script executed:
# First, let's locate and examine the file
find . -name "run_ci.py" -type fRepository: openshift-psap/forge
Length of output: 103
🏁 Script executed:
# Once we find the file, let's examine the lines around 499-505 and look for --dry-run definition
head -530 projects/core/ci_entrypoint/run_ci.py | tail -50Repository: openshift-psap/forge
Length of output: 1563
🏁 Script executed:
# Let's search for where --dry-run is defined in this file
grep -n "dry.run\|dry-run" projects/core/ci_entrypoint/run_ci.pyRepository: openshift-psap/forge
Length of output: 544
🏁 Script executed:
# Let's also check the full Click command definition and its context
sed -n '450,550p' projects/core/ci_entrypoint/run_ci.pyRepository: openshift-psap/forge
Length of output: 3307
--dry-run can be silently ignored when placed after positional args.
With allow_interspersed_args=False (line 501) and UNPROCESSED argument type (line 505), a call like run project operation --dry-run passes --dry-run directly to the args parameter instead of parsing it as the parent option. The command executes with dry_run=False even though the user provided the flag, potentially running real operations when expecting a dry-run.
🔧 Proposed fix
`@click.command`(context_settings=dict(
ignore_unknown_options=True,
- allow_interspersed_args=False,
+ allow_interspersed_args=True,
))
`@click.argument`('project', required=False)
`@click.argument`('operation', required=False)
`@click.argument`('args', nargs=-1, type=click.UNPROCESSED)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @click.command(context_settings=dict( | |
| ignore_unknown_options=True, | |
| allow_interspersed_args=False, | |
| )) | |
| @click.argument('project', required=False) | |
| @click.argument('operation', required=False) | |
| @click.argument('args', nargs=-1) | |
| @click.argument('args', nargs=-1, type=click.UNPROCESSED) | |
| `@click.command`(context_settings=dict( | |
| ignore_unknown_options=True, | |
| allow_interspersed_args=True, | |
| )) | |
| `@click.argument`('project', required=False) | |
| `@click.argument`('operation', required=False) | |
| `@click.argument`('args', nargs=-1, type=click.UNPROCESSED) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/core/ci_entrypoint/run_ci.py` around lines 499 - 505, The CLI
currently uses context_settings with allow_interspersed_args=False while the
command signature accepts a catch-all args parameter typed click.UNPROCESSED, so
options provided after positional arguments (e.g., --dry-run) get captured into
args and never parsed; fix this by enabling interspersed option parsing (set
allow_interspersed_args=True in the click.command context_settings) so
parent/global options are parsed regardless of position, or alternatively stop
using click.UNPROCESSED for the args parameter and accept a parsed remainder
(e.g., change handling of the args parameter) so flags like --dry-run are
recognized; update the decorator/context_settings where allow_interspersed_args
is declared and adjust how the args parameter is defined/consumed accordingly.
| # Merge accelerator defaults | ||
| accel_defaults = self.get_accelerator_defaults() | ||
| accel_vllm_args = accel_defaults.get("vllm_args", {}) | ||
| accel_env_vars = accel_defaults.get("env_vars", {}) | ||
|
|
||
| # Merge model config | ||
| model_deploy = raw_config.get("deploy", {}) | ||
| model_vllm_args = raw_config.get("vllm_args", {}) | ||
| model_env_vars = raw_config.get("env_vars", {}) | ||
|
|
||
| # Merge accelerator overrides from model | ||
| accel_overrides = raw_config.get("accelerator_overrides", {}).get(self.accelerator, {}) | ||
| override_vllm_args = accel_overrides.get("vllm_args", {}) | ||
| override_env_vars = accel_overrides.get("env_vars", {}) | ||
|
|
||
| # Build final config through inheritance chain | ||
| final_deploy = deep_merge(base_deploy, model_deploy) | ||
| final_vllm_args = deep_merge( | ||
| deep_merge(deep_merge(base_vllm_args, accel_vllm_args), model_vllm_args), | ||
| override_vllm_args, | ||
| ) | ||
| final_env_vars = deep_merge( | ||
| deep_merge(accel_env_vars, model_env_vars), | ||
| override_env_vars, | ||
| ) |
There was a problem hiding this comment.
Merge accelerator-specific deploy settings into final_deploy.
load_model() applies accelerator inheritance for vllm_args and env_vars, but not for deploy. That drops accelerator-specific GPU/resource settings from both ResolvedModelConfig.deploy and ResolvedModelConfig.num_gpus, even though the module-level resolution order says they should win.
Suggested fix
# Merge accelerator defaults
accel_defaults = self.get_accelerator_defaults()
+ accel_deploy = accel_defaults.get("deploy", {})
accel_vllm_args = accel_defaults.get("vllm_args", {})
accel_env_vars = accel_defaults.get("env_vars", {})
@@
# Merge accelerator overrides from model
accel_overrides = raw_config.get("accelerator_overrides", {}).get(self.accelerator, {})
+ override_deploy = accel_overrides.get("deploy", {})
override_vllm_args = accel_overrides.get("vllm_args", {})
override_env_vars = accel_overrides.get("env_vars", {})
# Build final config through inheritance chain
- final_deploy = deep_merge(base_deploy, model_deploy)
+ final_deploy = deep_merge(
+ deep_merge(deep_merge(base_deploy, accel_deploy), model_deploy),
+ override_deploy,
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Merge accelerator defaults | |
| accel_defaults = self.get_accelerator_defaults() | |
| accel_vllm_args = accel_defaults.get("vllm_args", {}) | |
| accel_env_vars = accel_defaults.get("env_vars", {}) | |
| # Merge model config | |
| model_deploy = raw_config.get("deploy", {}) | |
| model_vllm_args = raw_config.get("vllm_args", {}) | |
| model_env_vars = raw_config.get("env_vars", {}) | |
| # Merge accelerator overrides from model | |
| accel_overrides = raw_config.get("accelerator_overrides", {}).get(self.accelerator, {}) | |
| override_vllm_args = accel_overrides.get("vllm_args", {}) | |
| override_env_vars = accel_overrides.get("env_vars", {}) | |
| # Build final config through inheritance chain | |
| final_deploy = deep_merge(base_deploy, model_deploy) | |
| final_vllm_args = deep_merge( | |
| deep_merge(deep_merge(base_vllm_args, accel_vllm_args), model_vllm_args), | |
| override_vllm_args, | |
| ) | |
| final_env_vars = deep_merge( | |
| deep_merge(accel_env_vars, model_env_vars), | |
| override_env_vars, | |
| ) | |
| # Merge accelerator defaults | |
| accel_defaults = self.get_accelerator_defaults() | |
| accel_deploy = accel_defaults.get("deploy", {}) | |
| accel_vllm_args = accel_defaults.get("vllm_args", {}) | |
| accel_env_vars = accel_defaults.get("env_vars", {}) | |
| # Merge model config | |
| model_deploy = raw_config.get("deploy", {}) | |
| model_vllm_args = raw_config.get("vllm_args", {}) | |
| model_env_vars = raw_config.get("env_vars", {}) | |
| # Merge accelerator overrides from model | |
| accel_overrides = raw_config.get("accelerator_overrides", {}).get(self.accelerator, {}) | |
| override_deploy = accel_overrides.get("deploy", {}) | |
| override_vllm_args = accel_overrides.get("vllm_args", {}) | |
| override_env_vars = accel_overrides.get("env_vars", {}) | |
| # Build final config through inheritance chain | |
| final_deploy = deep_merge( | |
| deep_merge(deep_merge(base_deploy, accel_deploy), model_deploy), | |
| override_deploy, | |
| ) | |
| final_vllm_args = deep_merge( | |
| deep_merge(deep_merge(base_vllm_args, accel_vllm_args), model_vllm_args), | |
| override_vllm_args, | |
| ) | |
| final_env_vars = deep_merge( | |
| deep_merge(accel_env_vars, model_env_vars), | |
| override_env_vars, | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/core/scenarios/config_loader.py` around lines 181 - 205, The deploy
merge is missing accelerator-specific entries so accelerator-level GPU/resource
settings are lost; update the merge to read accel_defaults.get("deploy", {}) and
accel_overrides.get("deploy", {}) (e.g., accel_deploy and override_deploy) and
include them when building final_deploy (merge base_deploy, accel_deploy,
model_deploy, then override_deploy) so final_deploy reflects accelerator
defaults and overrides (ensure the same final_deploy variable used by
ResolvedModelConfig.deploy/num_gpus).
| # Merge scenario defaults with global defaults | ||
| global_defaults = self.get_global_defaults() | ||
| scenario_defaults = data.get("defaults", {}) | ||
| data["_resolved_defaults"] = deep_merge(global_defaults, scenario_defaults) | ||
|
|
||
| # Add accelerator info | ||
| data["_accelerator"] = self.accelerator | ||
| data["_accelerator_config"] = self.get_accelerator_defaults() |
There was a problem hiding this comment.
load_scenario() is skipping accelerator defaults.
_resolved_defaults currently merges only defaults.defaults and the scenario-local defaults. That means accelerator-specific defaults from defaults.yaml.accelerators[accelerator] never participate, which disagrees with the documented resolution order and can make scenario expansion differ from model resolution.
Suggested fix
# Merge scenario defaults with global defaults
global_defaults = self.get_global_defaults()
+ accelerator_defaults = self.get_accelerator_defaults()
scenario_defaults = data.get("defaults", {})
- data["_resolved_defaults"] = deep_merge(global_defaults, scenario_defaults)
+ data["_resolved_defaults"] = deep_merge(
+ deep_merge(global_defaults, accelerator_defaults),
+ scenario_defaults,
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Merge scenario defaults with global defaults | |
| global_defaults = self.get_global_defaults() | |
| scenario_defaults = data.get("defaults", {}) | |
| data["_resolved_defaults"] = deep_merge(global_defaults, scenario_defaults) | |
| # Add accelerator info | |
| data["_accelerator"] = self.accelerator | |
| data["_accelerator_config"] = self.get_accelerator_defaults() | |
| # Merge scenario defaults with global defaults | |
| global_defaults = self.get_global_defaults() | |
| accelerator_defaults = self.get_accelerator_defaults() | |
| scenario_defaults = data.get("defaults", {}) | |
| data["_resolved_defaults"] = deep_merge( | |
| deep_merge(global_defaults, accelerator_defaults), | |
| scenario_defaults, | |
| ) | |
| # Add accelerator info | |
| data["_accelerator"] = self.accelerator | |
| data["_accelerator_config"] = self.get_accelerator_defaults() |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/core/scenarios/config_loader.py` around lines 266 - 273,
load_scenario() is currently building data["_resolved_defaults"] by merging only
get_global_defaults() and scenario-local "defaults", which omits
accelerator-specific defaults; modify the merge so accelerator defaults from
get_accelerator_defaults() are merged between the global and scenario defaults
(e.g., deep_merge(global_defaults, accelerator_defaults, scenario_defaults)) so
that get_accelerator_defaults(), deep_merge, and data["_resolved_defaults"] are
used in that order within load_scenario().
| url_result = subprocess.run( | ||
| [ | ||
| "oc", "get", "inferenceservice", | ||
| self.deployment_name, | ||
| "-n", self.namespace, | ||
| "-o", "jsonpath={.status.url}", | ||
| ], | ||
| capture_output=True, | ||
| text=True, | ||
| ) |
There was a problem hiding this comment.
Add timeouts to the remaining oc subprocesses.
These calls have no timeout, so the step can hang indefinitely after readiness is reached or even after the main wait deadline expires while collecting debug info. That bypasses the outer timeout contract for the workflow.
Suggested fix
url_result = subprocess.run(
[
"oc", "get", "inferenceservice",
self.deployment_name,
"-n", self.namespace,
"-o", "jsonpath={.status.url}",
],
capture_output=True,
text=True,
+ timeout=15,
)
@@
result = subprocess.run(
["oc", "get", "inferenceservice", self.deployment_name, "-n", self.namespace, "-o", "yaml"],
capture_output=True,
text=True,
+ timeout=15,
)
@@
result = subprocess.run(
["oc", "get", "pods", "-l", f"serving.kserve.io/inferenceservice={self.deployment_name}",
"-n", self.namespace],
capture_output=True,
text=True,
+ timeout=15,
)Also applies to: 419-433
🧰 Tools
🪛 Ruff (0.15.9)
[error] 358-358: subprocess call: check for execution of untrusted input
(S603)
[error] 359-364: Starting a process with a partial executable path
(S607)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/rhaiis/workflows/steps/deploy.py` around lines 358 - 367, The
subprocess.run that populates url_result (and other oc subprocess.run
invocations later in the file) lack timeouts and can hang; add a timeout
argument (use a sensible value or a shared constant/the outer wait deadline) to
each subprocess.run call fetching oc resources (e.g., the run that sets
url_result and the similar calls around lines 419-433), and wrap them in
try/except for subprocess.TimeoutExpired to log the timeout (include
deployment_name and namespace) and fail/raise so the workflow’s outer timeout
contract is preserved.
| step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}" | ||
|
|
||
| subscription_yaml = """ | ||
| apiVersion: operators.coreos.com/v1alpha1 | ||
| kind: Subscription | ||
| metadata: | ||
| name: nfd | ||
| namespace: openshift-nfd | ||
| spec: | ||
| channel: stable | ||
| name: nfd | ||
| source: redhat-operators | ||
| sourceNamespace: openshift-marketplace | ||
| """ | ||
| yaml_path = step_dir / "nfd-subscription.yaml" | ||
| yaml_path.write_text(subscription_yaml) |
There was a problem hiding this comment.
Create the per-step artifact directory before writing YAML.
These methods derive step_dir from mutable context state, but they never create it. yaml_path.write_text(...) will fail with FileNotFoundError unless some caller pre-created the numbered directory first. Reusing ctx.get_step_artifact_dir(self.name) here would also keep step numbering consistent with the rest of the workflow code.
Suggested fix
- step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}"
+ step_dir = ctx.get_step_artifact_dir(self.name)Also applies to: 69-84, 113-131
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/rhaiis/workflows/steps/operators.py` around lines 23 - 38, The code
computes step_dir and writes YAML with yaml_path.write_text(...) but never
creates the per-step artifact directory, causing FileNotFoundError; update the
operator steps to use the canonical ctx.get_step_artifact_dir(self.name) (or
call step_dir.mkdir(parents=True, exist_ok=True) after computing step_dir)
before writing files so the directory exists; apply this change in the blocks
that define step_dir and write nfd-subscription.yaml (and analogous blocks at
the other occurrences noted) to ensure consistent step numbering and avoid write
errors.
| # Create namespace first | ||
| self._run_oc(["create", "namespace", "openshift-nfd", "--dry-run=client", "-o", "yaml"]) | ||
| self._run_oc(["apply", "-f", str(yaml_path)]) | ||
|
|
||
| return StepResult.ok("NFD operator subscription created") |
There was a problem hiding this comment.
Don’t return success when the NFD install commands were never applied.
Line 41 uses --dry-run=client -o yaml without the follow-up oc apply -f -, so the namespace creation is a no-op. Both command results are then ignored, and Line 44 still returns success. That can make the prepare phase look green even though NFD was not installed at all.
Suggested fix
- # Create namespace first
- self._run_oc(["create", "namespace", "openshift-nfd", "--dry-run=client", "-o", "yaml"])
- self._run_oc(["apply", "-f", str(yaml_path)])
-
- return StepResult.ok("NFD operator subscription created")
+ if not self._run_oc(["create", "namespace", "openshift-nfd"]):
+ return StepResult.fail("Failed to create openshift-nfd namespace")
+ if not self._run_oc(["apply", "-f", str(yaml_path)]):
+ return StepResult.fail("Failed to apply NFD subscription")
+ return StepResult.ok("NFD operator subscription created")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/rhaiis/workflows/steps/operators.py` around lines 40 - 44, The code
currently runs a dry-run namespace creation and then applies a YAML but ignores
both command outcomes and always returns StepResult.ok; change the sequence in
the method using _run_oc so the namespace is actually created (either remove the
"--dry-run=client -o yaml" or run the output through "oc apply -f -" so creation
is applied), capture and check the return/result of both _run_oc calls (the
namespace creation/apply and the subsequent _run_oc(["apply", "-f",
str(yaml_path)])), and if either fails return a failure StepResult instead of
StepResult.ok; look for the _run_oc invocations and StepResult.ok call to update
behavior.
| return CliRunner(env={"FORGE_ARTIFACT_DIR": "/tmp/artifacts"}) | ||
|
|
There was a problem hiding this comment.
Avoid shared hardcoded artifact path in tests.
Line [34], Line [40], and Line [53] use a fixed /tmp/artifacts path, which can cause cross-test interference when tests run concurrently. Prefer a per-test temp path fixture.
💡 Suggested change
class TestCiTest:
@@
`@pytest.fixture`
- def runner(self):
- return CliRunner(env={"FORGE_ARTIFACT_DIR": "/tmp/artifacts"})
+ def runner(self, tmp_path):
+ return CliRunner(env={"FORGE_ARTIFACT_DIR": str(tmp_path / "artifacts")})
@@
- env={"FORGE_MODEL": "test/model", "FORGE_ARTIFACT_DIR": "/tmp/artifacts"}
+ env={"FORGE_MODEL": "test/model", "FORGE_ARTIFACT_DIR": str(tmp_path / "artifacts")}
@@
- "FORGE_ARTIFACT_DIR": "/tmp/artifacts",
+ "FORGE_ARTIFACT_DIR": str(tmp_path / "artifacts"),Also applies to: 40-41, 50-54
🧰 Tools
🪛 Ruff (0.15.9)
[error] 34-34: Probable insecure usage of temporary file or directory: "/tmp/artifacts"
(S108)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/rhaiis/test_ci.py` around lines 34 - 35, The tests currently hardcode
FORGE_ARTIFACT_DIR="/tmp/artifacts" which can cause cross-test interference;
update the test(s) that return CliRunner(env={"FORGE_ARTIFACT_DIR":
"/tmp/artifacts"}) to accept a per-test temporary directory (e.g., pytest's
tmp_path or a tempfile.TemporaryDirectory) and set FORGE_ARTIFACT_DIR to that
path (str(tmp_path) or the temp dir path) when constructing CliRunner; locate
the return site that calls CliRunner and any helper functions that build the
runner and replace the hardcoded string with the injected temp fixture so each
test gets an isolated artifact directory.
| assert result.exit_code == 0 | ||
| assert "balanced" in result.output or "heterogeneous" in result.output | ||
|
|
There was a problem hiding this comment.
Strengthen workload assertion for full parsing coverage.
Line [58] uses or, so the test passes even if only one workload is surfaced. If the CLI is expected to reflect both inputs, assert both.
💡 Suggested change
- assert "balanced" in result.output or "heterogeneous" in result.output
+ assert "balanced" in result.output
+ assert "heterogeneous" in result.output📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| assert result.exit_code == 0 | |
| assert "balanced" in result.output or "heterogeneous" in result.output | |
| assert result.exit_code == 0 | |
| assert "balanced" in result.output | |
| assert "heterogeneous" in result.output |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/rhaiis/test_ci.py` around lines 57 - 59, The test currently uses
`assert "balanced" in result.output or "heterogeneous" in result.output`, which
lets the test pass if only one workload is printed; change this to assert both
workloads are present by using `and` (i.e., require both `"balanced"` and
`"heterogeneous"` in `result.output`) in the same test (the assertion
referencing `result`/`result.output`).
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This PR only supports the Forge test workflow
It deploy vLLM, run benchmark, cleanup. It does NOT include:
prepare workflow (operator installation) - stubbed only
cleanup workflow - basic implementation
llm-d integration - design doc included, implementation separate
How It Works
ConfigLoader resolves model config via inheritance chain:
defaults.yaml → accelerator[nvidia|amd] → model → model.accelerator_overrides
BenchmarkWorkflow executes steps sequentially:
DeployVLLM → WaitForReady → RunGuideLLM → (finally) CollectArtifacts → Cleanup
`
Single model + workload
KUBECONFIG=~/kubeconfigs/my-cluster
PYTHONPATH=. python3 projects/rhaiis/orchestration/cli.py test --model qwen-0.6b --workload balanced --accelerator nvidia
Multiple workloads (deploy-once pattern)
PYTHONPATH=. python3 projects/rhaiis/orchestration/cli.py test --model llama-3.3-70b-fp8 --workloadsbalanced,short,long-prompt.
`
Summary
Changes
forge rhaiis test entrypoint- Main orchestration entry pointIMPLEMENTATION.md- Class diagrams and architecture documentationReopened from #13 (moved to fork)
Summary by CodeRabbit
Release Notes
New Features
Documentation
Tests