Skip to content

Conversation

@olivermeyer
Copy link
Collaborator

We again see some tests falling back to ON_DEMAND provisioning after we reverted the GPUConfig changes in #394. The issue is being addressed on the application side with ETA end of January; in the meantime we temporarily patch the tests.

@olivermeyer olivermeyer added the claude Trigger Claude Code automation label Jan 28, 2026
@github-actions
Copy link
Contributor

PR Review: test: unset flex_start_max_run_duration_minutes when null or zero

🎯 Summary

Status: ✅ APPROVED - Ready to merge

This PR implements a temporary workaround for handling flex_start_max_run_duration_minutes in e2e tests by conditionally omitting the field when it's null or zero. The change prevents tests from falling back to ON_DEMAND provisioning mode while the platform-side fix is in progress (ETA: end of January 2026).


✅ Critical Checks (All Passed)

1. Test Markers ✅

  • All tests properly marked with required category markers (@pytest.mark.e2e, @pytest.mark.long_running, etc.)
  • Verified: uv run pytest -m "not unit and not integration and not e2e" --collect-only returns 0 tests
  • Result: No unmarked tests found

2. Code Quality ✅

  • Linting: All checks passed (ruff format, ruff check, pyright, mypy)
  • Line Length: Within 120 char limit
  • Type Hints: Not applicable (test file, no public API changes)
  • Execution time: 52 seconds

3. Conventional Commits ✅

  • Commit message: test: unset flex_start_max_run_duration_minutes when null or zero
  • Format: ✅ Correct (test: type for test changes)
  • Description: ✅ Clear and descriptive

4. Test Coverage ✅

  • No new untested code paths introduced
  • Changes are in test file only (e2e_test.py)
  • Existing e2e tests exercise the modified logic

🔍 Code Review Details

Changes Analysis

File: tests/aignostics/platform/e2e_test.py

What Changed:

  1. Removed: GPUConfig model instantiation (lines 283-289)
  2. Added: Direct dictionary construction for GPU metadata (lines 291-297)
  3. Added: Conditional logic to omit flex_start_max_run_duration_minutes when null/zero (lines 304-307)

Why This Works:

  • The previous code used GPUConfig(...) which includes Pydantic validation
  • GPUConfig.validate_flex_start_duration() automatically sets a default value (720 minutes) when provisioning mode is FLEX_START and the field is None
  • This workaround bypasses that validation by building the dictionary directly
  • The conditional if PIPELINE_GPU_FLEX_START_MAX_RUN_DURATION_MINUTES: ensures the field is only included when it has a meaningful value (truthy)

Architecture Compliance ✅

Modulith Principles: N/A (test file only)
Service Pattern: N/A (test file only)
Testing Strategy: ✅ Correct

  • Test properly marked as @pytest.mark.e2e
  • Test properly marked as @pytest.mark.very_long_running (has timeout > 60 min)
  • Test properly marked as @pytest.mark.scheduled_only

Medical Device & Security ✅

  • ✅ No secrets or tokens introduced
  • ✅ No sensitive data handling changes
  • ✅ Test environment isolation maintained (staging only)

Constants Reference

From tests/constants_test.py:

  • Production: PIPELINE_GPU_FLEX_START_MAX_RUN_DURATION_MINUTES = None (line 52)
  • Staging: PIPELINE_GPU_FLEX_START_MAX_RUN_DURATION_MINUTES = None (line 97)

Both environments have this value set to None, which means the conditional will correctly omit the field.


💡 Suggestions (Non-Blocking)

1. Enhanced TODO Comment

Current (line 303):

# TODO(oliverm): remove this conditional when applications handle null flex_start_max_run_duration_minutes

Suggested:

# TODO(oliverm): remove this conditional when applications handle null/zero flex_start_max_run_duration_minutes
# Expected fix: ETA end of January 2026 (platform-side)
# Related: Reverted GPUConfig changes in PR #394

Rationale: Makes the temporary workaround more traceable and easier to find when the platform fix is deployed.

2. Consider Zero vs None Distinction

The current conditional if PIPELINE_GPU_FLEX_START_MAX_RUN_DURATION_MINUTES: treats both None and 0 as falsy.

Question: Should 0 be treated as an invalid value (current behavior) or a valid zero-duration value?

Current behavior: 0 is omitted (likely correct for this use case)
Alternative: Could use if PIPELINE_GPU_FLEX_START_MAX_RUN_DURATION_MINUTES is not None: if 0 should be allowed

Recommendation: Keep current behavior unless there's a valid use case for zero-duration.


📊 CI/CD Status

Passing:

  • ✅ CodeQL (3s)
  • ✅ Seer Code Review (2m25s)
  • ✅ audit/audit (1m24s)
  • ✅ codeql/Analyze (actions) (1m0s)
  • ✅ codeql/Analyze (python) (1m11s)
  • ✅ lint/lint (1m30s)

Pending:

  • ⏳ test/test (multiple platforms) - Expected, e2e tests take time
  • ⏳ claude-review/claude-code - Currently running

Skipped:

  • ⏭️ only_build_native - Expected (non-native PR)

🎖️ Strengths

  1. Clear Intent: PR description explains the context and temporary nature
  2. Minimal Scope: Only changes what's necessary to fix the issue
  3. Backward Compatible: No breaking changes to SDK API
  4. Well-Tested: Changes are in e2e tests that will validate the fix
  5. TODO Comment: Includes assignee and clear removal condition
  6. Conventional Commit: Proper commit message format

🚀 Recommendation

Approve and merge with confidence. This is a well-scoped temporary workaround that:

  • Prevents test flakiness (falling back to ON_DEMAND provisioning)
  • Doesn't impact production SDK behavior (test file only)
  • Has a clear removal path (when platform-side fix is deployed)
  • Follows all repository standards and conventions

📚 Related Documentation

  • CLAUDE.md (root): Testing workflow guidelines ✅
  • .github/CLAUDE.md: CI/CD workflow documentation ✅
  • tests/CLAUDE.md: Test suite patterns ✅
  • src/aignostics/platform/CLAUDE.md: SDK metadata documentation (references GPUConfig)

Reviewed by: Claude Code (Sonnet 4.5)
Review Date: 2026-01-28
Standards Applied: Aignostics Python SDK CLAUDE.md guidelines

@codecov
Copy link

codecov bot commented Jan 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.
see 8 files with indirect coverage changes

@sonarqubecloud
Copy link

sonarqubecloud bot commented Jan 28, 2026

Quality Gate Passed Quality Gate passed

Issues
0 New issues
1 Accepted issue

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

}
}
# TODO(oliverm): remove this conditional when applications handle null flex_start_max_run_duration_minutes
if PIPELINE_GPU_FLEX_START_MAX_RUN_DURATION_MINUTES:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would using gpu_config.model_dump(exclude_none=True) with the Pydantic model class be an option to avoid this if block?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, however that would also exclude other fields which happen to be null, and if any of those are nullable but not optional on the application side, I could end up breaking something. Since this is a temporary fix, I prefer this slightly messier but also slightly less risky approach.

@olivermeyer olivermeyer merged commit 3f69353 into main Jan 28, 2026
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claude Trigger Claude Code automation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants