Skip to content

fix: harden Docker scan timeout and cancellation in LinuxContainerDetector#1798

Merged
AMaini503 merged 1 commit intomainfrom
fix/docker-scan-timeout-hardening
Apr 24, 2026
Merged

fix: harden Docker scan timeout and cancellation in LinuxContainerDetector#1798
AMaini503 merged 1 commit intomainfrom
fix/docker-scan-timeout-hardening

Conversation

@AMaini503
Copy link
Copy Markdown
Contributor

Problem

The CG task hangs indefinitely during Docker container scanning in CI pipelines (observed 31+ hours on Store.DevX.Nova.Prod). Despite existing timeout mechanisms (480s CLI timeout, per-detector CTS, 10-min detector timeout), the process gets stuck and never exits.

Root cause: Five \DockerService\ methods swallow \OperationCanceledException\ in their catch blocks, preventing CTS-based cancellation from propagating. When the timeout fires, each method catches the OCE, returns false/null, and execution continues to the next Docker call — which also swallows the OCE. The timeout chain is completely broken.

Changes

Fix: Stop swallowing cancellation (the actual bug)

  • Add \cancellationToken.ThrowIfCancellationRequested()\ in catch blocks of 5 \DockerService\ methods: \CanPingDockerAsync, \CanRunLinuxContainersAsync, \ImageExistsLocallyAsync, \TryPullImageAsync, \InspectImageAsync\
  • Move \CanRunLinuxContainersAsync\ call inside the try/catch in \LinuxContainerDetector.ExecuteDetectorAsync\ (was outside — OCE would escape to \DetectorProcessingService\ and crash the process)

Hardening: Belt-and-suspenders for Docker operations

  • RemoveContainerAsync: \Task.WhenAny\ with 30s race timer in the finally block — if remove hangs, we abandon and move on
  • stream.Dispose(): Fire-and-forget via \Task.Run\ — if the underlying \socket.Close()\ blocks (Docker daemon in D-state), we don't hang

Diagnostics: Instrumentation for future investigation

  • Telemetry flush (\TelemetryRelay.FlushCurrentTelemetry()) before \ReadContainerOutputAsync\ — ensures telemetry is durable before the riskiest operation
  • Heartbeat timer (60s interval) in \LinuxContainerDetector\ — confirms process liveness; if heartbeats stop, process died
  • ILogger breadcrumbs for container start, removal, and cleanup completion

Scope

All changes are scoped to \LinuxContainerDetector, \LinuxScanner, and \DockerService. No changes to \DetectorProcessingService\ or other detectors.

Testing

  • \Microsoft.ComponentDetection.Common.Tests: 214 passed, 0 failed
  • Local Docker scan with timeout: exits cleanly

@AMaini503 AMaini503 requested a review from a team as a code owner April 24, 2026 17:47
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 24, 2026

👋 Hi! It looks like you modified some files in the Detectors folder.
You may need to bump the detector versions if any of the following scenarios apply:

  • The detector detects more or fewer components than before
  • The detector generates different parent/child graph relationships than before
  • The detector generates different devDependencies values than before

If none of the above scenarios apply, feel free to ignore this comment 🙂

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Hardens Linux container image scanning to avoid indefinite hangs by improving cancellation propagation in Docker interactions and adding additional diagnostics around long-running operations.

Changes:

  • Ensure DockerService methods propagate CTS cancellation by re-throwing when the provided cancellation token is canceled.
  • Add a periodic heartbeat log and move CanRunLinuxContainersAsync into the detector try/catch to avoid process crashes on cancellation.
  • Add telemetry “checkpoint” flushing and additional best-effort time-bounded container cleanup safeguards.
Show a summary per file
File Description
src/Microsoft.ComponentDetection.Detectors/linux/LinuxContainerDetector.cs Adds heartbeat logging and adjusts cancellation handling scope around Docker availability checks.
src/Microsoft.ComponentDetection.Common/Telemetry/TelemetryRelay.cs Introduces FlushCurrentTelemetry() to flush buffered telemetry without shutting down.
src/Microsoft.ComponentDetection.Common/DockerService.cs Improves cancellation propagation, adds telemetry flush call before reading container output, and hardens cleanup/stream disposal.

Copilot's findings

  • Files reviewed: 3/3 changed files
  • Comments generated: 4

Comment thread src/Microsoft.ComponentDetection.Common/Telemetry/TelemetryRelay.cs
Comment thread src/Microsoft.ComponentDetection.Common/DockerService.cs
Comment thread src/Microsoft.ComponentDetection.Common/DockerService.cs
Comment thread src/Microsoft.ComponentDetection.Common/DockerService.cs
Comment thread src/Microsoft.ComponentDetection.Common/DockerService.cs Outdated
Comment thread src/Microsoft.ComponentDetection.Detectors/linux/LinuxContainerDetector.cs Outdated
…ector

- Fix OperationCanceledException swallowing in 5 DockerService methods
  (CanPingDockerAsync, CanRunLinuxContainersAsync, ImageExistsLocallyAsync,
  TryPullImageAsync, InspectImageAsync) by adding
  cancellationToken.ThrowIfCancellationRequested() in catch blocks
- Move CanRunLinuxContainersAsync call inside try/catch in
  LinuxContainerDetector to prevent unhandled OCE crash
- Add belt-and-suspenders Task.WhenAny with 30s race timer for
  RemoveContainerAsync in finally block
- Make stream.Dispose() fire-and-forget via Task.Run to prevent blocking
  if underlying socket close() hangs
- Add TelemetryRelay.FlushCurrentTelemetry() before ReadContainerOutputAsync
  to ensure telemetry is durable before risky long-running operation
- Add 60s heartbeat timer in LinuxContainerDetector for process liveness
  monitoring
- Add ILogger breadcrumbs for container start, removal, and cleanup

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@AMaini503 AMaini503 force-pushed the fix/docker-scan-timeout-hardening branch from c74e704 to d365d19 Compare April 24, 2026 19:08
@AMaini503 AMaini503 merged commit cdbf903 into main Apr 24, 2026
15 checks passed
@AMaini503 AMaini503 deleted the fix/docker-scan-timeout-hardening branch April 24, 2026 20:15
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 0.0%. Comparing base (7ce401b) to head (d365d19).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@     Coverage Diff      @@
##   main   #1798   +/-   ##
============================
============================

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants