test: LoRA testing with k8s #4991

biswapanda · 2025-12-17T01:44:45Z

Overview:

Adds comprehensive LoRA (Low-Rank Adaptation) testing infrastructure for Kubernetes-based deployments with fault tolerance validation.

This PR enables testing of LoRA discovery, registration, and inference in both aggregated and disaggregated vLLM deployment modes.

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes: DEP-613

Summary by CodeRabbit

Release Notes

New Features
- Added LoRA testing support to fault tolerance scenarios with discovery and inference validation checks.
- Introduced aggregated and disaggregated deployment configurations for LoRA workloads.
Tests
- Extended test suite with LoRA-specific failure injection scenarios and tailored load configurations.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-17T01:49:25Z

Walkthrough

This pull request adds LoRA (Low-Rank Adaptation) support to the fault tolerance testing suite. Changes include new checker classes for LoRA validation, factory logic to detect and integrate LoRA scenarios, two Kubernetes deployment manifests for LoRA aggregated and disaggregated topologies, and corresponding test configurations with LoRA-specific load profiles.

Changes

Cohort / File(s)	Summary
Checker Infrastructure `tests/fault_tolerance/deploy/checker_factory.py`, `tests/fault_tolerance/deploy/lora_checker.py`	Added LoRA detection logic to factory; imported and integrated two new checker classes (`LoRADiscoveryChecker`, `LoRAInferenceChecker`) that validate LoRA discovery and inference metrics; added conditional VllmWorker selection for "-lora-agg-" tests.
Test Scenarios Configuration `tests/fault_tolerance/deploy/scenarios.py`	Introduced `_create_lora_deployments_for_backend()` helper, `add_lora_scenarios()` function, and `lora_load` configuration; integrated LoRA deployments into `DEPLOYMENT_SPECS`; populated scenarios map with LoRA agg/disagg test variants and failure injection patterns.
Kubernetes Manifests `tests/fault_tolerance/deploy/templates/vllm/lora_agg.yaml`, `tests/fault_tolerance/deploy/templates/vllm/lora_disagg.yaml`	Added two DynamoGraphDeployment manifests for LoRA scenarios: aggregated (Frontend + 2× VllmWorker) and disaggregated (Frontend + 2× VllmDecodeWorker + 2× VllmPrefillWorker) with LoRA enablement, model configuration, tensor parallelization, and AWS/S3 credential placeholders.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

scenarios.py: Verify the structure and logic of add_lora_scenarios() function; check for the reported duplicate definitions and ensure no conflicting state.
lora_checker.py: Review placeholder scaffolding and confirm the logging structure aligns with validation patterns; verify integration with ValidationContext.
Kubernetes manifests: Validate resource requests/limits, environment variable configuration, and LoRA-specific parameters (max-loras, max-lora-rank, tensor-parallel-size).
checker_factory.py: Confirm LoRA detection logic (string matching on "-lora-") is robust and worker selection conditional is correct.

Poem

🐰 Hop, hop! LoRA's here to stay,
New checkers bound with vLLM today,
Templates split 'tween agg and disagg,
Fault tolerance won't ever lag!
Low-rank adapts with bunny delight! ✨

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is incomplete, with all required sections present but empty (Overview, Details, Where should the reviewer start contain only headers).	Fill in the Overview, Details, and 'Where should the reviewer start' sections with meaningful content describing the changes, implementation details, and files requiring close review.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.
Title check	✅ Passed	The title 'test: LoRA testing with k8s' is clear and directly related to the main change — adding LoRA testing support with Kubernetes infrastructure.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

tests/fault_tolerance/deploy/templates/vllm/lora_disagg.yaml (1)
93-167: Consider reducing YAML duplication.

The environment variables and extraPodSpec sections are nearly identical between VllmDecodeWorker and VllmPrefillWorker. While explicit configuration is acceptable for test manifests, you could use YAML anchors to reduce duplication if maintaining these files becomes cumbersome.

Example with anchors:
x-common-envs: &common-envs
  - name: DYN_SYSTEM_ENABLED
    value: "true"
  # ... rest of common envs

VllmDecodeWorker:
  envs: *common-envs
  # ...
tests/fault_tolerance/deploy/lora_checker.py (2)
42-49: Unused parameter lora_name.

The lora_name parameter is stored but never used in the check() method. Either remove it or add a TODO comment indicating planned usage.
-    def __init__(self, lora_name: Optional[str] = None):
+    def __init__(self, lora_name: Optional[str] = None):
         """Initialize LoRA discovery checker.

         Args:
             lora_name: Optional specific LoRA name to check for
         """
         super().__init__(name="LoRADiscoveryChecker")
-        self.lora_name = lora_name
+        self.lora_name = lora_name  # TODO: Use for specific LoRA validation
51-111: Placeholder checker always passes.

The LoRADiscoveryChecker.check() currently only logs information and always passes. While this is documented as a placeholder, consider whether tests relying on this checker will provide meaningful validation. At minimum, you might want to raise an assertion or log a more prominent warning if critical conditions are missing (e.g., context.deployment is None).

Would you like me to help create an issue to track the full etcd-based implementation, or suggest a minimal assertion to add now?

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6b8bb99 and 672f4a1.

📒 Files selected for processing (5)

tests/fault_tolerance/deploy/checker_factory.py (3 hunks)
tests/fault_tolerance/deploy/lora_checker.py (1 hunks)
tests/fault_tolerance/deploy/scenarios.py (4 hunks)
tests/fault_tolerance/deploy/templates/vllm/lora_agg.yaml (1 hunks)
tests/fault_tolerance/deploy/templates/vllm/lora_disagg.yaml (1 hunks)

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-09-04T19:03:06.643Z

Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 2872
File: examples/multimodal/deploy/agg_qwen.yaml:53-60
Timestamp: 2025-09-04T19:03:06.643Z
Learning: In the dynamo repository, Kubernetes Custom Resources use `gpu: "1"` format for GPU resource limits and requests, not the standard Kubernetes `nvidia.com/gpu: 1` format. This applies to DynamoGraphDeployment resources and other dynamo CRs.

Applied to files:

tests/fault_tolerance/deploy/templates/vllm/lora_agg.yaml
tests/fault_tolerance/deploy/templates/vllm/lora_disagg.yaml

📚 Learning: 2025-09-24T17:50:00.970Z

Learnt from: tzulingk
Repo: ai-dynamo/dynamo PR: 3194
File: tests/utils/managed_deployment.py:555-559
Timestamp: 2025-09-24T17:50:00.970Z
Learning: In tests/utils/managed_deployment.py, when handling None service_name in get_service method, prefer setting service_name to empty string "" rather than defaulting to frontend_service_name, to avoid confusion per user tzulingk's preference.

Applied to files:

tests/fault_tolerance/deploy/checker_factory.py

🧬 Code graph analysis (2)

tests/fault_tolerance/deploy/lora_checker.py (2)

tests/fault_tolerance/deploy/base_checker.py (2)

BaseChecker (57-105)

ValidationContext (33-54)

tests/fault_tolerance/deploy/test_deployment.py (1)

scenario (32-57)

tests/fault_tolerance/deploy/checker_factory.py (1)

tests/fault_tolerance/deploy/lora_checker.py (2)

LoRADiscoveryChecker (32-111)

LoRAInferenceChecker (114-165)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: operator (arm64)
GitHub Check: operator (amd64)
GitHub Check: sglang (arm64)
GitHub Check: vllm (arm64)
GitHub Check: trtllm (arm64)
GitHub Check: trtllm (amd64)
GitHub Check: sglang (amd64)
GitHub Check: vllm (amd64)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (12)

tests/fault_tolerance/deploy/checker_factory.py (3)

37-40: LGTM!

New imports for LoRA-specific checkers are correctly added and align with the new lora_checker.py module.

71-78: LGTM!

The LoRA scenario detection using -lora- pattern matching is consistent with the naming convention in scenarios.py. Prepending LoRA-specific checkers before scenario and results checkers is appropriate for validating LoRA discovery early in the validation flow.

151-156: LGTM!

The conditional logic correctly distinguishes between LoRA aggregated deployments (using VllmWorker) and other deployments including LoRA disaggregated (using VllmDecodeWorker). This aligns with the worker naming in lora_agg.yaml and lora_disagg.yaml.

tests/fault_tolerance/deploy/templates/vllm/lora_agg.yaml (2)

17-18: Verify the image tag placeholder.

The image tag my-tag appears to be a placeholder. Confirm this is intentionally parameterized or replaced during test execution; otherwise, update to a valid tag or use a template variable.

19-91: LGTM!

The VllmWorker configuration is well-structured for LoRA testing:

GPU resource format gpu: "1" follows dynamo CR conventions (based on learnings)

LoRA-specific environment variables are properly configured

AWS credentials are marked optional, handling missing secrets gracefully

Debug logging settings are appropriate for test scenarios

tests/fault_tolerance/deploy/templates/vllm/lora_disagg.yaml (1)

32-92: LGTM with minor suggestion.

VllmDecodeWorker configuration is well-structured with proper LoRA environment variables and resource settings.

tests/fault_tolerance/deploy/scenarios.py (5)

508-542: LGTM!

The _create_lora_deployments_for_backend function follows the established pattern from _create_moe_deployments_for_backend. The deployment naming convention {backend}-lora-{deploy_type}-tp-{tp_size}-replicas-{replicas} is consistent and integrates well with the LoRA detection logic in checker_factory.py.

752-762: LGTM!

The lora_load configuration with moderate settings (5 clients, 50 requests per client) is appropriate for LoRA testing. The standard request rate of 1.0 requests/sec is reasonable given LoRA's lower computational overhead compared to MoE models.

554-555: LGTM!

LoRA deployments are correctly integrated into DEPLOYMENT_SPECS following the established pattern. Limiting to vLLM backend initially is a reasonable scope for the initial LoRA testing implementation.

1024-1127: LGTM!

The add_lora_scenarios function is well-structured and comprehensive:

Covers both aggregated and disaggregated deployment types

Includes process termination and pod deletion failure scenarios

Provides "none" scenarios for baseline LoRA discovery testing without failures

Properly integrates with the checker factory for validation

1129-1130: LGTM!

The module-level call to add_lora_scenarios() follows the established pattern used by add_token_overflow_scenarios() and add_rolling_upgrade_scenarios().

tests/fault_tolerance/deploy/lora_checker.py (1)

16-22: LGTM!

Module docstring clearly documents the intended validation scope for LoRA discovery.

coderabbitai · 2025-12-17T01:49:28Z

tests/fault_tolerance/deploy/lora_checker.py

+            # For LoRA tests, we expect high success rate
+            # (failures in LoRA loading should be minimal)
+            if success_rate < 80.0:
+                self.logger.warning(
+                    f"⚠ Low success rate for LoRA inference: {success_rate:.2f}%"
+                )
+                self.logger.warning(
+                    "  This may indicate LoRA loading or routing issues"
+                )
+            else:
+                self.logger.info(
+                    f"✓ LoRA inference success rate is acceptable: {success_rate:.2f}%"
+                )


⚠️ Potential issue | 🟠 Major

Low success rate should fail the check, not just warn.

Per the BaseChecker contract (from base_checker.py), the check() method should "Raise AssertionError on validation failure". Currently, a success rate below 80% only logs a warning but doesn't fail the test. This could allow tests to pass silently despite LoRA loading or routing issues.

# For LoRA tests, we expect high success rate # (failures in LoRA loading should be minimal) if success_rate < 80.0: self.logger.warning( f"⚠ Low success rate for LoRA inference: {success_rate:.2f}%" ) self.logger.warning( " This may indicate LoRA loading or routing issues" ) + raise AssertionError( + f"LoRA inference success rate {success_rate:.2f}% is below threshold (80%)" + ) else: self.logger.info( f"✓ LoRA inference success rate is acceptable: {success_rate:.2f}%" )

🤖 Prompt for AI Agents

In tests/fault_tolerance/deploy/lora_checker.py around lines 151 to 163, the current logic only logs a warning when LoRA success_rate < 80.0 but per BaseChecker.check() contract this must raise an AssertionError; change the code so that when success_rate < 80.0 you log an error (or warning) with the same message and then raise an AssertionError with a clear message including the success_rate (e.g. "LoRA inference success rate below threshold: X%"), leaving the else branch to keep the info log for acceptable rates.

test: LoRA testing with k8s

672f4a1

biswapanda self-assigned this Dec 17, 2025

biswapanda requested review from a team as code owners December 17, 2025 01:44

pull-request-size bot added the size/XL label Dec 17, 2025

biswapanda changed the title ~~tests: LoRA testing with k8s~~ test: LoRA testing with k8s Dec 17, 2025

github-actions bot added the test label Dec 17, 2025

coderabbitai bot reviewed Dec 17, 2025

View reviewed changes

biswapanda marked this pull request as draft December 17, 2025 07:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test: LoRA testing with k8s #4991

test: LoRA testing with k8s #4991

biswapanda commented Dec 17, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 17, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

test: LoRA testing with k8s #4991

Are you sure you want to change the base?

test: LoRA testing with k8s #4991

Conversation

biswapanda commented Dec 17, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

biswapanda commented Dec 17, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 17, 2025 •

edited

Loading