Skip to content

Suspend destination folder on systemic errors (disk full, stale NFS)#331

Open
NumericalAdvantage wants to merge 1 commit intomainfrom
fix/suspend-destination-on-disk-full
Open

Suspend destination folder on systemic errors (disk full, stale NFS)#331
NumericalAdvantage wants to merge 1 commit intomainfrom
fix/suspend-destination-on-disk-full

Conversation

@NumericalAdvantage
Copy link
Copy Markdown
Collaborator

@NumericalAdvantage NumericalAdvantage commented Apr 20, 2026

Summary

  • When the NFS destination disk filled up, the mass transfer system failed silently — each task downloaded series from PACS and failed on write, then moved to the next series and the next task. After disk was freed and tasks retried, all evidence was erased (volumes deleted on retry).
  • Adds DicomFolder.suspended field. On systemic errors (ENOSPC, stale NFS, read-only FS, I/O error), the processor suspends the destination folder, stops immediately, and emails admins.
  • Subsequent tasks for the same destination return WARNING without touching the PACS.
  • Non-systemic errors (bad DICOM data, conversion failures) continue to be handled per-series as before.

Root cause

_handle_fetched_image (dicom_operator.py:757) catches ENOSPC and raises DicomError. In _transfer_single_series, the generic except Exception marks the volume ERROR and continues to the next series — downloading from PACS again, failing on write again, for every remaining series in the task and every subsequent task.

Test plan

  • _is_systemic_error returns True for ENOSPC/ESTALE/EROFS/EIO and "Out of disk space" DicomError
  • _is_systemic_error returns False for regular exceptions
  • process() suspends destination and returns FAILURE on systemic error
  • process() skips processing when destination is already suspended
  • process() does NOT suspend on regular errors
  • Regular errors still continue to next series (existing behavior preserved)
  • 92 processor tests pass

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Destinations can now be suspended, automatically skipping them from batch processing operations.
    • System automatically suspends destinations when critical transfer errors occur (e.g., disk full scenarios).
    • Administrators receive email notifications when destinations are automatically suspended.

When a systemic infrastructure error occurs during mass transfer (e.g.
ENOSPC, stale NFS handle, read-only filesystem), the processor now:

1. Stops processing remaining series immediately instead of grinding
   through them all
2. Sets DicomFolder.suspended = True on the destination
3. Sends an alert email to admins
4. Returns FAILURE for the current task

Subsequent tasks for the same destination check the suspended flag at
the start of process() and return WARNING without touching the PACS,
preventing thousands of wasted queries.

Non-systemic errors (bad DICOM data, conversion failures) continue to
be handled per-series as before.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 20, 2026

📝 Walkthrough

Walkthrough

A new suspended boolean field is added to the DICOMFolder model via migration to mark destinations that should skip processing. The mass transfer processor gains systemic error detection logic that automatically suspends folders encountering disk-space or filesystem errors, and includes admin notification. Comprehensive test coverage validates the suspension behavior and error classification.

Changes

Cohort / File(s) Summary
Model and Migration
adit/core/migrations/0018_dicomfolder_suspended.py, adit/core/models.py
Added suspended BooleanField to DICOMFolder model with default False and descriptive help text. New migration registers the schema change with dependency on prior migration.
Mass Transfer Processor
adit/mass_transfer/processors.py
Introduced _is_systemic_error() helper to detect systemic failures via OSError errno values (ENOSPC, ESTALE, EROFS, EIO) and disk-space DicomError messages. Processor checks destination suspension status before processing and skips if suspended. On systemic errors during grouped series transfer, automatically suspends destination folder via _suspend_destination(), logs critically, sends admin notification, and returns FAILURE. Per-series exception handling enhanced to re-raise systemic errors instead of silently marking as ERROR.
Processor Tests
adit/mass_transfer/tests/test_processor.py
Added TestIsSystemicError suite validating systemic error classification. Added integration tests covering: automatic suspension on systemic export errors, regular error handling without suspension, early exit for pre-suspended destinations, and destination suspension state persistence to database.

Sequence Diagram

sequenceDiagram
    participant Processor as MassTransferTaskProcessor
    participant DB as DICOMFolder DB
    participant Transfer as Transfer Process
    participant Admin as Admin Notification
    
    Processor->>DB: Check if destination suspended
    alt Destination Already Suspended
        Processor->>Processor: Return WARNING + skip
    else Destination Active
        Processor->>Transfer: Begin grouped series transfer
        Transfer-->>Transfer: OSError (ENOSPC) or disk-space error
        Processor->>Processor: _is_systemic_error() → True
        Processor->>DB: suspended = True (save)
        Processor->>Admin: Send suspension notification
        Processor->>Processor: Return FAILURE + log
    end
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A folder suspended, when disks grow too tight,
Auto-detection catches the systemic blight,
With error-wise logic, admins take flight—
No more silent failures in the digital night! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 45.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: suspending destination folders when systemic errors occur, with concrete examples (disk full, stale NFS).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/suspend-destination-on-disk-full

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to automatically suspend DICOM folder destinations when systemic infrastructure errors, such as disk full (ENOSPC) or I/O errors, are detected during mass transfers. It adds a 'suspended' field to the DicomFolder model, implements systemic error detection logic, and ensures that admins are notified via email when a destination is automatically disabled. The review feedback highlights opportunities to improve the robustness of error string matching, refine status messages for non-folder destinations, and ensure email notifications are correctly formatted for HTML-capable clients.

"""
if isinstance(err, OSError) and err.errno in _SYSTEMIC_ERRNOS:
return True
if isinstance(err, DicomError) and "Out of disk space" in str(err):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The string comparison for "Out of disk space" is case-sensitive. It is safer to use .lower() to ensure it catches variations in the error message from different DICOM implementations or system locales.

Suggested change
if isinstance(err, DicomError) and "Out of disk space" in str(err):
if isinstance(err, DicomError) and "out of disk space" in str(err).lower():

self._suspend_destination(destination_node, job, err)
return {
"status": MassTransferTask.Status.FAILURE,
"message": f"Destination suspended: {err}",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The error message "Destination suspended" is misleading when the destination is a SERVER. In that case, _suspend_destination returns early without taking any action (as intended for server nodes), but the task summary still claims the destination was suspended. The message should accurately reflect whether suspension occurred.

Suggested change
"message": f"Destination suspended: {err}",
"message": f"Destination suspended: {err}" if destination_node.node_type == DicomNode.NodeType.FOLDER else f"Systemic error: {err}",

Comment on lines +389 to +395
send_mail_to_admins(
f"Mass transfer destination '{destination_node.name}' suspended",
f"Destination folder '{destination_node.name}' was automatically suspended "
f"due to a systemic error during mass transfer job {job.pk}.\n\n"
f"Error: {err}\n\n"
f"Please fix the underlying issue and unsuspend the folder in the admin panel.",
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The send_mail_to_admins helper (imported from adit.core.utils.mail) appears to be designed for HTML content, as seen in its other usages where it is called with the html_content keyword argument. Passing a plain text string with \n as a positional argument will likely result in an email where line breaks are not rendered in most email clients. Consider using the html_content keyword argument and replacing newlines with <br> tags, or verify if the helper supports a plain text message argument.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
adit/mass_transfer/processors.py (2)

341-361: Partial progress is discarded in the systemic‑error FAILURE summary.

When a systemic error fires after N series have already been transferred successfully, those counts aren't surfaced in the returned message/log. The DB rows are saved per-volume, but operators looking at the task result won't see how far it got before the abort. Consider capturing per-volume progress inside _transfer_grouped_series (or letting it run _build_task_summary for the partial state) and including that in the FAILURE log alongside the error. Non-blocking.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adit/mass_transfer/processors.py` around lines 341 - 361, The FAILURE branch
currently drops any per-volume progress when a systemic error occurs in
_transfer_grouped_series; modify _transfer_grouped_series to surface partial
progress (e.g., return a partial_summary or update job state) or ensure it calls
_build_task_summary for the current state before raising, then in this except
block call that summary (or read it from the job) and include it in the returned
dict's "message" and "log" alongside the systemic error; keep the existing
suspension via _suspend_destination(destination_node, job, err) and ensure the
summary contains per-volume counts so operators can see how many series/volumes
completed before the failure.

387-395: Use consistent email rendering pattern with template.

The existing call to send_job_failed_mail (even though marked unused) demonstrates the intended pattern: render the message body via render_to_string() with an HTML template, then pass via html_content= keyword. While plain-text positional arguments are used elsewhere (adit/core/tasks.py), matching the structured template approach here improves maintainability and consistency. Consider:

+        from adit_radis_shared.common.utils.mail import send_mail_to_admins
+        from django.template.loader import render_to_string
+
+        subject = f"Mass transfer destination '{destination_node.name}' suspended"
+        html_content = render_to_string(
+            "core/mail/destination_suspended.html",
+            {"destination": destination_node, "job": job, "error": str(err)},
+        )
+        send_mail_to_admins(subject, html_content=html_content)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adit/mass_transfer/processors.py` around lines 387 - 395, Replace the current
positional plain-text call to send_mail_to_admins with the project’s templated
rendering pattern: use render_to_string() to render the HTML body from the
appropriate template (same pattern used by send_job_failed_mail), then call
send_mail_to_admins(subject, html_content=rendered_html) and include a short
plain-text fallback if supported; update the call site referencing
destination_node, job.pk and err so those context vars are passed into the
template rendering instead of interpolating them into the positional message
string.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@adit/mass_transfer/processors.py`:
- Around line 341-361: The FAILURE branch currently drops any per-volume
progress when a systemic error occurs in _transfer_grouped_series; modify
_transfer_grouped_series to surface partial progress (e.g., return a
partial_summary or update job state) or ensure it calls _build_task_summary for
the current state before raising, then in this except block call that summary
(or read it from the job) and include it in the returned dict's "message" and
"log" alongside the systemic error; keep the existing suspension via
_suspend_destination(destination_node, job, err) and ensure the summary contains
per-volume counts so operators can see how many series/volumes completed before
the failure.
- Around line 387-395: Replace the current positional plain-text call to
send_mail_to_admins with the project’s templated rendering pattern: use
render_to_string() to render the HTML body from the appropriate template (same
pattern used by send_job_failed_mail), then call send_mail_to_admins(subject,
html_content=rendered_html) and include a short plain-text fallback if
supported; update the call site referencing destination_node, job.pk and err so
those context vars are passed into the template rendering instead of
interpolating them into the positional message string.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 349ebe5c-5223-4bdf-8f41-eafe88011154

📥 Commits

Reviewing files that changed from the base of the PR and between 0f67ddd and f7c4b26.

📒 Files selected for processing (4)
  • adit/core/migrations/0018_dicomfolder_suspended.py
  • adit/core/models.py
  • adit/mass_transfer/processors.py
  • adit/mass_transfer/tests/test_processor.py

Comment on lines +276 to +286
destination_node = self.mass_task.destination
if (
destination_node.node_type == DicomNode.NodeType.FOLDER
and destination_node.dicomfolder.suspended
):
return {
"status": MassTransferTask.Status.WARNING,
"message": f"Destination '{destination_node.name}' is suspended.",
"log": "Task skipped because the destination folder is suspended.",
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the first task that encounters the disk full and suspends the destination DicomFolder is marked with status FAILURE. Then the remaining tasks in the job are still processed, hit the suspended check and are marked with status WARNING. I think they should also be marked as FAILURE since they are essentially in the same state as the first task. Also would make it easier to retry all the failed tasks since we have a UI button that does this for a job.

I also wonder if it makes more sense to pause the job instead when a task encounters disk full?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants