Skip to content

Conversation

@biswapanda
Copy link
Contributor

@biswapanda biswapanda commented Dec 17, 2025

Overview:

Adds comprehensive LoRA (Low-Rank Adaptation) testing infrastructure for Kubernetes-based deployments with fault tolerance validation.

This PR enables testing of LoRA discovery, registration, and inference in both aggregated and disaggregated vLLM deployment modes.

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Release Notes

  • New Features

    • Added LoRA testing support to fault tolerance scenarios with discovery and inference validation checks.
    • Introduced aggregated and disaggregated deployment configurations for LoRA workloads.
  • Tests

    • Extended test suite with LoRA-specific failure injection scenarios and tailored load configurations.

✏️ Tip: You can customize this high-level summary in your review settings.

@biswapanda biswapanda self-assigned this Dec 17, 2025
@biswapanda biswapanda requested review from a team as code owners December 17, 2025 01:44
@biswapanda biswapanda changed the title tests: LoRA testing with k8s test: LoRA testing with k8s Dec 17, 2025
@github-actions github-actions bot added the test label Dec 17, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 17, 2025

Walkthrough

This pull request adds LoRA (Low-Rank Adaptation) support to the fault tolerance testing suite. Changes include new checker classes for LoRA validation, factory logic to detect and integrate LoRA scenarios, two Kubernetes deployment manifests for LoRA aggregated and disaggregated topologies, and corresponding test configurations with LoRA-specific load profiles.

Changes

Cohort / File(s) Summary
Checker Infrastructure
tests/fault_tolerance/deploy/checker_factory.py, tests/fault_tolerance/deploy/lora_checker.py
Added LoRA detection logic to factory; imported and integrated two new checker classes (LoRADiscoveryChecker, LoRAInferenceChecker) that validate LoRA discovery and inference metrics; added conditional VllmWorker selection for "-lora-agg-" tests.
Test Scenarios Configuration
tests/fault_tolerance/deploy/scenarios.py
Introduced _create_lora_deployments_for_backend() helper, add_lora_scenarios() function, and lora_load configuration; integrated LoRA deployments into DEPLOYMENT_SPECS; populated scenarios map with LoRA agg/disagg test variants and failure injection patterns.
Kubernetes Manifests
tests/fault_tolerance/deploy/templates/vllm/lora_agg.yaml, tests/fault_tolerance/deploy/templates/vllm/lora_disagg.yaml
Added two DynamoGraphDeployment manifests for LoRA scenarios: aggregated (Frontend + 2× VllmWorker) and disaggregated (Frontend + 2× VllmDecodeWorker + 2× VllmPrefillWorker) with LoRA enablement, model configuration, tensor parallelization, and AWS/S3 credential placeholders.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

  • scenarios.py: Verify the structure and logic of add_lora_scenarios() function; check for the reported duplicate definitions and ensure no conflicting state.
  • lora_checker.py: Review placeholder scaffolding and confirm the logging structure aligns with validation patterns; verify integration with ValidationContext.
  • Kubernetes manifests: Validate resource requests/limits, environment variable configuration, and LoRA-specific parameters (max-loras, max-lora-rank, tensor-parallel-size).
  • checker_factory.py: Confirm LoRA detection logic (string matching on "-lora-") is robust and worker selection conditional is correct.

Poem

🐰 Hop, hop! LoRA's here to stay,
New checkers bound with vLLM today,
Templates split 'tween agg and disagg,
Fault tolerance won't ever lag!
Low-rank adapts with bunny delight!

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is incomplete, with all required sections present but empty (Overview, Details, Where should the reviewer start contain only headers). Fill in the Overview, Details, and 'Where should the reviewer start' sections with meaningful content describing the changes, implementation details, and files requiring close review.
✅ Passed checks (2 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.
Title check ✅ Passed The title 'test: LoRA testing with k8s' is clear and directly related to the main change — adding LoRA testing support with Kubernetes infrastructure.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
tests/fault_tolerance/deploy/templates/vllm/lora_disagg.yaml (1)

93-167: Consider reducing YAML duplication.

The environment variables and extraPodSpec sections are nearly identical between VllmDecodeWorker and VllmPrefillWorker. While explicit configuration is acceptable for test manifests, you could use YAML anchors to reduce duplication if maintaining these files becomes cumbersome.

Example with anchors:

x-common-envs: &common-envs
  - name: DYN_SYSTEM_ENABLED
    value: "true"
  # ... rest of common envs

VllmDecodeWorker:
  envs: *common-envs
  # ...
tests/fault_tolerance/deploy/lora_checker.py (2)

42-49: Unused parameter lora_name.

The lora_name parameter is stored but never used in the check() method. Either remove it or add a TODO comment indicating planned usage.

-    def __init__(self, lora_name: Optional[str] = None):
+    def __init__(self, lora_name: Optional[str] = None):
         """Initialize LoRA discovery checker.

         Args:
             lora_name: Optional specific LoRA name to check for
         """
         super().__init__(name="LoRADiscoveryChecker")
-        self.lora_name = lora_name
+        self.lora_name = lora_name  # TODO: Use for specific LoRA validation

51-111: Placeholder checker always passes.

The LoRADiscoveryChecker.check() currently only logs information and always passes. While this is documented as a placeholder, consider whether tests relying on this checker will provide meaningful validation. At minimum, you might want to raise an assertion or log a more prominent warning if critical conditions are missing (e.g., context.deployment is None).

Would you like me to help create an issue to track the full etcd-based implementation, or suggest a minimal assertion to add now?

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6b8bb99 and 672f4a1.

📒 Files selected for processing (5)
  • tests/fault_tolerance/deploy/checker_factory.py (3 hunks)
  • tests/fault_tolerance/deploy/lora_checker.py (1 hunks)
  • tests/fault_tolerance/deploy/scenarios.py (4 hunks)
  • tests/fault_tolerance/deploy/templates/vllm/lora_agg.yaml (1 hunks)
  • tests/fault_tolerance/deploy/templates/vllm/lora_disagg.yaml (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-04T19:03:06.643Z
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 2872
File: examples/multimodal/deploy/agg_qwen.yaml:53-60
Timestamp: 2025-09-04T19:03:06.643Z
Learning: In the dynamo repository, Kubernetes Custom Resources use `gpu: "1"` format for GPU resource limits and requests, not the standard Kubernetes `nvidia.com/gpu: 1` format. This applies to DynamoGraphDeployment resources and other dynamo CRs.

Applied to files:

  • tests/fault_tolerance/deploy/templates/vllm/lora_agg.yaml
  • tests/fault_tolerance/deploy/templates/vllm/lora_disagg.yaml
📚 Learning: 2025-09-24T17:50:00.970Z
Learnt from: tzulingk
Repo: ai-dynamo/dynamo PR: 3194
File: tests/utils/managed_deployment.py:555-559
Timestamp: 2025-09-24T17:50:00.970Z
Learning: In tests/utils/managed_deployment.py, when handling None service_name in get_service method, prefer setting service_name to empty string "" rather than defaulting to frontend_service_name, to avoid confusion per user tzulingk's preference.

Applied to files:

  • tests/fault_tolerance/deploy/checker_factory.py
🧬 Code graph analysis (2)
tests/fault_tolerance/deploy/lora_checker.py (2)
tests/fault_tolerance/deploy/base_checker.py (2)
  • BaseChecker (57-105)
  • ValidationContext (33-54)
tests/fault_tolerance/deploy/test_deployment.py (1)
  • scenario (32-57)
tests/fault_tolerance/deploy/checker_factory.py (1)
tests/fault_tolerance/deploy/lora_checker.py (2)
  • LoRADiscoveryChecker (32-111)
  • LoRAInferenceChecker (114-165)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: operator (arm64)
  • GitHub Check: operator (amd64)
  • GitHub Check: sglang (arm64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (12)
tests/fault_tolerance/deploy/checker_factory.py (3)

37-40: LGTM!

New imports for LoRA-specific checkers are correctly added and align with the new lora_checker.py module.


71-78: LGTM!

The LoRA scenario detection using -lora- pattern matching is consistent with the naming convention in scenarios.py. Prepending LoRA-specific checkers before scenario and results checkers is appropriate for validating LoRA discovery early in the validation flow.


151-156: LGTM!

The conditional logic correctly distinguishes between LoRA aggregated deployments (using VllmWorker) and other deployments including LoRA disaggregated (using VllmDecodeWorker). This aligns with the worker naming in lora_agg.yaml and lora_disagg.yaml.

tests/fault_tolerance/deploy/templates/vllm/lora_agg.yaml (2)

17-18: Verify the image tag placeholder.

The image tag my-tag appears to be a placeholder. Confirm this is intentionally parameterized or replaced during test execution; otherwise, update to a valid tag or use a template variable.


19-91: LGTM!

The VllmWorker configuration is well-structured for LoRA testing:

  • GPU resource format gpu: "1" follows dynamo CR conventions (based on learnings)
  • LoRA-specific environment variables are properly configured
  • AWS credentials are marked optional, handling missing secrets gracefully
  • Debug logging settings are appropriate for test scenarios
tests/fault_tolerance/deploy/templates/vllm/lora_disagg.yaml (1)

32-92: LGTM with minor suggestion.

VllmDecodeWorker configuration is well-structured with proper LoRA environment variables and resource settings.

tests/fault_tolerance/deploy/scenarios.py (5)

508-542: LGTM!

The _create_lora_deployments_for_backend function follows the established pattern from _create_moe_deployments_for_backend. The deployment naming convention {backend}-lora-{deploy_type}-tp-{tp_size}-replicas-{replicas} is consistent and integrates well with the LoRA detection logic in checker_factory.py.


752-762: LGTM!

The lora_load configuration with moderate settings (5 clients, 50 requests per client) is appropriate for LoRA testing. The standard request rate of 1.0 requests/sec is reasonable given LoRA's lower computational overhead compared to MoE models.


554-555: LGTM!

LoRA deployments are correctly integrated into DEPLOYMENT_SPECS following the established pattern. Limiting to vLLM backend initially is a reasonable scope for the initial LoRA testing implementation.


1024-1127: LGTM!

The add_lora_scenarios function is well-structured and comprehensive:

  • Covers both aggregated and disaggregated deployment types
  • Includes process termination and pod deletion failure scenarios
  • Provides "none" scenarios for baseline LoRA discovery testing without failures
  • Properly integrates with the checker factory for validation

1129-1130: LGTM!

The module-level call to add_lora_scenarios() follows the established pattern used by add_token_overflow_scenarios() and add_rolling_upgrade_scenarios().

tests/fault_tolerance/deploy/lora_checker.py (1)

16-22: LGTM!

Module docstring clearly documents the intended validation scope for LoRA discovery.

Comment on lines +151 to +163
# For LoRA tests, we expect high success rate
# (failures in LoRA loading should be minimal)
if success_rate < 80.0:
self.logger.warning(
f"⚠ Low success rate for LoRA inference: {success_rate:.2f}%"
)
self.logger.warning(
" This may indicate LoRA loading or routing issues"
)
else:
self.logger.info(
f"✓ LoRA inference success rate is acceptable: {success_rate:.2f}%"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Low success rate should fail the check, not just warn.

Per the BaseChecker contract (from base_checker.py), the check() method should "Raise AssertionError on validation failure". Currently, a success rate below 80% only logs a warning but doesn't fail the test. This could allow tests to pass silently despite LoRA loading or routing issues.

             # For LoRA tests, we expect high success rate
             # (failures in LoRA loading should be minimal)
             if success_rate < 80.0:
                 self.logger.warning(
                     f"⚠ Low success rate for LoRA inference: {success_rate:.2f}%"
                 )
                 self.logger.warning(
                     "  This may indicate LoRA loading or routing issues"
                 )
+                raise AssertionError(
+                    f"LoRA inference success rate {success_rate:.2f}% is below threshold (80%)"
+                )
             else:
                 self.logger.info(
                     f"✓ LoRA inference success rate is acceptable: {success_rate:.2f}%"
                 )
🤖 Prompt for AI Agents
In tests/fault_tolerance/deploy/lora_checker.py around lines 151 to 163, the
current logic only logs a warning when LoRA success_rate < 80.0 but per
BaseChecker.check() contract this must raise an AssertionError; change the code
so that when success_rate < 80.0 you log an error (or warning) with the same
message and then raise an AssertionError with a clear message including the
success_rate (e.g. "LoRA inference success rate below threshold: X%"), leaving
the else branch to keep the info log for acceptable rates.

@biswapanda biswapanda marked this pull request as draft December 17, 2025 07:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants