fix: harden Docker scan timeout and cancellation in LinuxContainerDetector#1798
Merged
fix: harden Docker scan timeout and cancellation in LinuxContainerDetector#1798
Conversation
|
👋 Hi! It looks like you modified some files in the
If none of the above scenarios apply, feel free to ignore this comment 🙂 |
Contributor
There was a problem hiding this comment.
Pull request overview
Hardens Linux container image scanning to avoid indefinite hangs by improving cancellation propagation in Docker interactions and adding additional diagnostics around long-running operations.
Changes:
- Ensure DockerService methods propagate CTS cancellation by re-throwing when the provided cancellation token is canceled.
- Add a periodic heartbeat log and move
CanRunLinuxContainersAsyncinto the detector try/catch to avoid process crashes on cancellation. - Add telemetry “checkpoint” flushing and additional best-effort time-bounded container cleanup safeguards.
Show a summary per file
| File | Description |
|---|---|
| src/Microsoft.ComponentDetection.Detectors/linux/LinuxContainerDetector.cs | Adds heartbeat logging and adjusts cancellation handling scope around Docker availability checks. |
| src/Microsoft.ComponentDetection.Common/Telemetry/TelemetryRelay.cs | Introduces FlushCurrentTelemetry() to flush buffered telemetry without shutting down. |
| src/Microsoft.ComponentDetection.Common/DockerService.cs | Improves cancellation propagation, adds telemetry flush call before reading container output, and hardens cleanup/stream disposal. |
Copilot's findings
- Files reviewed: 3/3 changed files
- Comments generated: 4
grvillic
reviewed
Apr 24, 2026
grvillic
reviewed
Apr 24, 2026
…ector - Fix OperationCanceledException swallowing in 5 DockerService methods (CanPingDockerAsync, CanRunLinuxContainersAsync, ImageExistsLocallyAsync, TryPullImageAsync, InspectImageAsync) by adding cancellationToken.ThrowIfCancellationRequested() in catch blocks - Move CanRunLinuxContainersAsync call inside try/catch in LinuxContainerDetector to prevent unhandled OCE crash - Add belt-and-suspenders Task.WhenAny with 30s race timer for RemoveContainerAsync in finally block - Make stream.Dispose() fire-and-forget via Task.Run to prevent blocking if underlying socket close() hangs - Add TelemetryRelay.FlushCurrentTelemetry() before ReadContainerOutputAsync to ensure telemetry is durable before risky long-running operation - Add 60s heartbeat timer in LinuxContainerDetector for process liveness monitoring - Add ILogger breadcrumbs for container start, removal, and cleanup Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
c74e704 to
d365d19
Compare
grvillic
approved these changes
Apr 24, 2026
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1798 +/- ##
============================
============================
☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The CG task hangs indefinitely during Docker container scanning in CI pipelines (observed 31+ hours on Store.DevX.Nova.Prod). Despite existing timeout mechanisms (480s CLI timeout, per-detector CTS, 10-min detector timeout), the process gets stuck and never exits.
Root cause: Five \DockerService\ methods swallow \OperationCanceledException\ in their catch blocks, preventing CTS-based cancellation from propagating. When the timeout fires, each method catches the OCE, returns false/null, and execution continues to the next Docker call — which also swallows the OCE. The timeout chain is completely broken.
Changes
Fix: Stop swallowing cancellation (the actual bug)
Hardening: Belt-and-suspenders for Docker operations
Diagnostics: Instrumentation for future investigation
Scope
All changes are scoped to \LinuxContainerDetector, \LinuxScanner, and \DockerService. No changes to \DetectorProcessingService\ or other detectors.
Testing