Suspend destination folder on systemic errors (disk full, stale NFS)#331
Suspend destination folder on systemic errors (disk full, stale NFS)#331NumericalAdvantage wants to merge 1 commit intomainfrom
Conversation
When a systemic infrastructure error occurs during mass transfer (e.g. ENOSPC, stale NFS handle, read-only filesystem), the processor now: 1. Stops processing remaining series immediately instead of grinding through them all 2. Sets DicomFolder.suspended = True on the destination 3. Sends an alert email to admins 4. Returns FAILURE for the current task Subsequent tasks for the same destination check the suspended flag at the start of process() and return WARNING without touching the PACS, preventing thousands of wasted queries. Non-systemic errors (bad DICOM data, conversion failures) continue to be handled per-series as before.
📝 WalkthroughWalkthroughA new Changes
Sequence DiagramsequenceDiagram
participant Processor as MassTransferTaskProcessor
participant DB as DICOMFolder DB
participant Transfer as Transfer Process
participant Admin as Admin Notification
Processor->>DB: Check if destination suspended
alt Destination Already Suspended
Processor->>Processor: Return WARNING + skip
else Destination Active
Processor->>Transfer: Begin grouped series transfer
Transfer-->>Transfer: OSError (ENOSPC) or disk-space error
Processor->>Processor: _is_systemic_error() → True
Processor->>DB: suspended = True (save)
Processor->>Admin: Send suspension notification
Processor->>Processor: Return FAILURE + log
end
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to automatically suspend DICOM folder destinations when systemic infrastructure errors, such as disk full (ENOSPC) or I/O errors, are detected during mass transfers. It adds a 'suspended' field to the DicomFolder model, implements systemic error detection logic, and ensures that admins are notified via email when a destination is automatically disabled. The review feedback highlights opportunities to improve the robustness of error string matching, refine status messages for non-folder destinations, and ensure email notifications are correctly formatted for HTML-capable clients.
| """ | ||
| if isinstance(err, OSError) and err.errno in _SYSTEMIC_ERRNOS: | ||
| return True | ||
| if isinstance(err, DicomError) and "Out of disk space" in str(err): |
There was a problem hiding this comment.
The string comparison for "Out of disk space" is case-sensitive. It is safer to use .lower() to ensure it catches variations in the error message from different DICOM implementations or system locales.
| if isinstance(err, DicomError) and "Out of disk space" in str(err): | |
| if isinstance(err, DicomError) and "out of disk space" in str(err).lower(): |
| self._suspend_destination(destination_node, job, err) | ||
| return { | ||
| "status": MassTransferTask.Status.FAILURE, | ||
| "message": f"Destination suspended: {err}", |
There was a problem hiding this comment.
The error message "Destination suspended" is misleading when the destination is a SERVER. In that case, _suspend_destination returns early without taking any action (as intended for server nodes), but the task summary still claims the destination was suspended. The message should accurately reflect whether suspension occurred.
| "message": f"Destination suspended: {err}", | |
| "message": f"Destination suspended: {err}" if destination_node.node_type == DicomNode.NodeType.FOLDER else f"Systemic error: {err}", |
| send_mail_to_admins( | ||
| f"Mass transfer destination '{destination_node.name}' suspended", | ||
| f"Destination folder '{destination_node.name}' was automatically suspended " | ||
| f"due to a systemic error during mass transfer job {job.pk}.\n\n" | ||
| f"Error: {err}\n\n" | ||
| f"Please fix the underlying issue and unsuspend the folder in the admin panel.", | ||
| ) |
There was a problem hiding this comment.
The send_mail_to_admins helper (imported from adit.core.utils.mail) appears to be designed for HTML content, as seen in its other usages where it is called with the html_content keyword argument. Passing a plain text string with \n as a positional argument will likely result in an email where line breaks are not rendered in most email clients. Consider using the html_content keyword argument and replacing newlines with <br> tags, or verify if the helper supports a plain text message argument.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
adit/mass_transfer/processors.py (2)
341-361: Partial progress is discarded in the systemic‑error FAILURE summary.When a systemic error fires after N series have already been transferred successfully, those counts aren't surfaced in the returned
message/log. The DB rows are saved per-volume, but operators looking at the task result won't see how far it got before the abort. Consider capturing per-volume progress inside_transfer_grouped_series(or letting it run_build_task_summaryfor the partial state) and including that in the FAILURE log alongside the error. Non-blocking.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@adit/mass_transfer/processors.py` around lines 341 - 361, The FAILURE branch currently drops any per-volume progress when a systemic error occurs in _transfer_grouped_series; modify _transfer_grouped_series to surface partial progress (e.g., return a partial_summary or update job state) or ensure it calls _build_task_summary for the current state before raising, then in this except block call that summary (or read it from the job) and include it in the returned dict's "message" and "log" alongside the systemic error; keep the existing suspension via _suspend_destination(destination_node, job, err) and ensure the summary contains per-volume counts so operators can see how many series/volumes completed before the failure.
387-395: Use consistent email rendering pattern with template.The existing call to
send_job_failed_mail(even though marked unused) demonstrates the intended pattern: render the message body viarender_to_string()with an HTML template, then pass viahtml_content=keyword. While plain-text positional arguments are used elsewhere (adit/core/tasks.py), matching the structured template approach here improves maintainability and consistency. Consider:+ from adit_radis_shared.common.utils.mail import send_mail_to_admins + from django.template.loader import render_to_string + + subject = f"Mass transfer destination '{destination_node.name}' suspended" + html_content = render_to_string( + "core/mail/destination_suspended.html", + {"destination": destination_node, "job": job, "error": str(err)}, + ) + send_mail_to_admins(subject, html_content=html_content)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@adit/mass_transfer/processors.py` around lines 387 - 395, Replace the current positional plain-text call to send_mail_to_admins with the project’s templated rendering pattern: use render_to_string() to render the HTML body from the appropriate template (same pattern used by send_job_failed_mail), then call send_mail_to_admins(subject, html_content=rendered_html) and include a short plain-text fallback if supported; update the call site referencing destination_node, job.pk and err so those context vars are passed into the template rendering instead of interpolating them into the positional message string.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@adit/mass_transfer/processors.py`:
- Around line 341-361: The FAILURE branch currently drops any per-volume
progress when a systemic error occurs in _transfer_grouped_series; modify
_transfer_grouped_series to surface partial progress (e.g., return a
partial_summary or update job state) or ensure it calls _build_task_summary for
the current state before raising, then in this except block call that summary
(or read it from the job) and include it in the returned dict's "message" and
"log" alongside the systemic error; keep the existing suspension via
_suspend_destination(destination_node, job, err) and ensure the summary contains
per-volume counts so operators can see how many series/volumes completed before
the failure.
- Around line 387-395: Replace the current positional plain-text call to
send_mail_to_admins with the project’s templated rendering pattern: use
render_to_string() to render the HTML body from the appropriate template (same
pattern used by send_job_failed_mail), then call send_mail_to_admins(subject,
html_content=rendered_html) and include a short plain-text fallback if
supported; update the call site referencing destination_node, job.pk and err so
those context vars are passed into the template rendering instead of
interpolating them into the positional message string.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 349ebe5c-5223-4bdf-8f41-eafe88011154
📒 Files selected for processing (4)
adit/core/migrations/0018_dicomfolder_suspended.pyadit/core/models.pyadit/mass_transfer/processors.pyadit/mass_transfer/tests/test_processor.py
| destination_node = self.mass_task.destination | ||
| if ( | ||
| destination_node.node_type == DicomNode.NodeType.FOLDER | ||
| and destination_node.dicomfolder.suspended | ||
| ): | ||
| return { | ||
| "status": MassTransferTask.Status.WARNING, | ||
| "message": f"Destination '{destination_node.name}' is suspended.", | ||
| "log": "Task skipped because the destination folder is suspended.", | ||
| } | ||
|
|
There was a problem hiding this comment.
So the first task that encounters the disk full and suspends the destination DicomFolder is marked with status FAILURE. Then the remaining tasks in the job are still processed, hit the suspended check and are marked with status WARNING. I think they should also be marked as FAILURE since they are essentially in the same state as the first task. Also would make it easier to retry all the failed tasks since we have a UI button that does this for a job.
I also wonder if it makes more sense to pause the job instead when a task encounters disk full?
Summary
DicomFolder.suspendedfield. On systemic errors (ENOSPC, stale NFS, read-only FS, I/O error), the processor suspends the destination folder, stops immediately, and emails admins.Root cause
_handle_fetched_image(dicom_operator.py:757) catches ENOSPC and raisesDicomError. In_transfer_single_series, the genericexcept Exceptionmarks the volume ERROR and continues to the next series — downloading from PACS again, failing on write again, for every remaining series in the task and every subsequent task.Test plan
_is_systemic_errorreturns True for ENOSPC/ESTALE/EROFS/EIO and "Out of disk space" DicomError_is_systemic_errorreturns False for regular exceptionsprocess()suspends destination and returns FAILURE on systemic errorprocess()skips processing when destination is already suspendedprocess()does NOT suspend on regular errors🤖 Generated with Claude Code
Summary by CodeRabbit