Skip to content

Conversation

@agners
Copy link
Member

@agners agners commented Dec 2, 2025

Proposed change

Refactor Docker image pull progress tracking for more accurate and reliable progress updates.

Problem:
The previous progress tracking had issues where:

  • Progress would stay stuck when large layers were discovered late (Docker rate-limits concurrent downloads to ~3 layers, if large layers get discovered late, total would get bumped late in the image pull)
  • Progress would jump to 100% prematurely when cached layers were reported before downloads started
  • Size-weighted progress was unreliable because layer sizes weren't known until each layer started downloading

Solution:
This PR introduces a registry manifest fetcher that retrieves layer sizes directly from container registries before pulling. This enables accurate size-based progress tracking:

  1. Registry Manifest Fetcher (supervisor/docker/manifest.py):

    • Fetches manifests directly from registries (ghcr.io, Docker Hub, private registries)
    • Discovers auth endpoints via WWW-Authenticate headers
    • Handles multi-arch manifest lists by selecting the correct platform
    • Uses credentials from Docker config for private registries
  2. Size-weighted progress:

    • Each layer contributes to progress proportionally to its byte size
    • Large layers (e.g., 500MB) have more weight than small layers (e.g., 100 bytes)
    • Falls back to count-based progress (equal weight per layer) if manifest fetch fails
  3. Exclude cached layers:

    • Layers that "Already exist" locally are excluded from progress calculation
    • Progress only tracks bytes that actually need to be downloaded
  4. Wait for downloads to start:

    • Progress returns 0% until first "Downloading" event
    • This prevents premature progress jumps from cached layer events

Type of change

  • Dependency upgrade
  • Bugfix (non-breaking change which fixes an issue)
  • New feature (which adds functionality to the supervisor)
  • Breaking change (fix/feature causing existing functionality to break)
  • Code quality improvements to existing code or addition of tests

Additional information

  • This PR fixes or closes issue: fixes #
  • This PR is related to issue:
  • Link to documentation pull request:
  • Link to cli pull request:
  • Link to client library pull request:

Checklist

  • The code change is tested and works locally.
  • Local tests pass. Your PR cannot be merged unless tests pass
  • There is no commented out code in this PR.
  • I have followed the development checklist
  • The code has been formatted using Ruff (ruff format supervisor tests)
  • Tests have been added to verify that the new code works.

If API endpoints or add-on configuration are added/changed:

@agners agners requested a review from mdegat01 December 2, 2025 14:53
@agners agners added the refactor A code change that neither fixes a bug nor adds a feature label Dec 2, 2025
@agners agners marked this pull request as draft December 2, 2025 15:05
"""Docker registry manifest fetcher.
Fetches image manifests directly from container registries to get layer sizes
before pulling an image. This enables accurate size-based progress tracking.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than implementing registry manifest fetcher as part of Supervisor we should either call the Docker CLI (we'd need to create a config to get authentication right, but we have the necessary code already) or move the logic into aiodocker (I would prefer this approach).

@agners agners force-pushed the refactor-docker-pull-progress-new branch from ae04e3e to f2596fe Compare December 2, 2025 20:18
@agners agners marked this pull request as ready for review December 2, 2025 20:19
@agners agners force-pushed the refactor-docker-pull-progress-new branch 3 times, most recently from e9204fc to 620c906 Compare December 3, 2025 17:49
Copy link
Contributor

@mdegat01 mdegat01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few things to note in here. The session reuse one in particular seems like something we need to fix as we go to a lot of effort to avoid making new sessions elsewhere.

Comment on lines 127 to 133
try:
async with session.get(manifest_url) as resp:
if resp.status == 200:
# No auth required
return None

if resp.status != 401:
_LOGGER.warning(
"Unexpected status %d from registry %s", resp.status, registry
)
return None

www_auth = resp.headers.get("WWW-Authenticate", "")
except aiohttp.ClientError as err:
_LOGGER.warning("Failed to connect to registry %s: %s", registry, err)
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no way around trying and failing to figure out if we need auth and how to provide it? We could just call self._get_credentials(registry) up front but it seems like we can't figure out the realm and service without making the call and failing which is unfortunate for performance...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... But given that auth is really not all that common and more of a corner case I don't think it is a big deal 🤷 .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Yea I mean reading how the code works I don't see an obvious way around it if the realm and service can change. So this is what we have to do I suppose.

and evt.args[0]["data"]["event"] == WSEvent.JOB
and evt.args[0]["data"]["data"]["name"] == "supervisor_update"
]
# Count-based progress: 2 layers need pulling (each worth 50%)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of this is count-based progress anymore right?

Comment on lines +157 to +159
# Mock manifest fetcher to return None (falls back to count-based progress)
docker_obj._manifest_fetcher.get_manifest = AsyncMock(return_value=None)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem accurate? Removing the manifest doesn't force us to count-based progress, it just means the progress reporter creates and tracks layers as it sees them. More things would have to go wrong to force us to used count-based progress I think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this is still accurate.

This is the whole reason I've moved to the manifest approach: If a couple small layers are downloaded first, and we calculate the overall progress based on that, we quickly arrive at a high progress procentage (like 60% or so), then if a huge layer follows, and we "rescale" progress, the progress stays stuck potentially for minutes! This is very confusing for the user. More confusing than a progress bar which moves a bit faster at times, and then slower at other times. But if we have 10 layers, it will move the progress bar during a large download still.

So, either we can get the layer sizes ahead of time (99.9% of the cases with the manifest fetcher), and have a progress bar which tracks overall image download progress by known overall size ahead of time, or we fallback with the layer count method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, since its designed to generally know all sizes up front we start calculating as soon as we see some progress, not when we see downloading for all layers. So without a manifest its just going to shoot right up. Ok that's fine then.

for evt in ha_ws_client.async_send_command.call_args_list
if "data" in evt.args[0]
and evt.args[0]["data"]["event"] == WSEvent.JOB
and evt.args[0]["data"]["data"]["name"] == "home_assistant_core_update"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we added a lot of unit tests for the new stuff but wouldn't it make sense to have at least one of these higher level tests pulling and using a manifest to help more accurately report progress? Like how most of our existing tests would stumble initially because they make layers on demand as they see them and progress may go backwards but with a manifest it would be accurate from the beginning.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked my log and Docker events captures locally, I do have a couple of events captures, but I think I only have a log captured. Another problem is that maybe on that (virtual) machine the 70/30 split for downloading/extracting is probably off:
layer-progress-non-linear-containerd-snapshotter.txt

But anyhow, an end-to-end test would be nice indeed. I mostly encountered this with Home Assistant Core bumps. But it depends a bit if base layers got updated or not. So 2025.11.3->2025.12.0 might behave quite a bit different from 2025.11.2->2025.11.3.

@home-assistant home-assistant bot marked this pull request as draft December 3, 2025 22:41
@home-assistant
Copy link

home-assistant bot commented Dec 3, 2025

Please take a look at the requested changes, and use the Ready for review button when you are done, thanks 👍

Learn more about our pull request process.

Comment on lines +247 to +249
if should_update:
stage = pull_progress.get_stage()
current_job.update(progress=progress, stage=stage)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it might be good to have some info level logging around here for #6390 . Removing the timeout fixes part of the issue but they're right that no logs for 15 minutes will still look like a crash. Some logs here would add visibility to what's going on until onboarding can add a progress bar of some kind.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agreed. But I rather have that in a follow up PR, then we can also discuss the exact rate of logging etc.

@agners agners force-pushed the refactor-docker-pull-progress-new branch from 1dc5fd0 to 9698085 Compare December 5, 2025 15:11
agners and others added 9 commits December 5, 2025 16:14
Refactor Docker image pull progress to use a simpler count-based approach
where each layer contributes equally (100% / total_layers) regardless of
size. This replaces the previous size-weighted calculation that was
susceptible to progress regression.

The core issue was that Docker rate-limits concurrent downloads (~3 at a
time) and reports layer sizes only when downloading starts. With size-
weighted progress, large layers appearing late would cause progress to
drop dramatically (e.g., 59% -> 29%) as the total size increased.

The new approach:
- Each layer contributes equally to overall progress
- Per-layer progress: 70% download weight, 30% extraction weight
- Progress only starts after first "Downloading" event (when layer
  count is known)
- Always caps at 99% - job completion handles final 100%

This simplifies the code by moving progress tracking to a dedicated
module (pull_progress.py) and removing complex size-based scaling logic
that tried to account for unknown layer sizes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Layers that already exist locally should not count towards download
progress since there's nothing to download for them. Only layers that
need pulling are included in the progress calculation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Fetch image manifests directly from container registries before pulling
to get accurate layer sizes upfront. This enables size-weighted progress
tracking where each layer contributes proportionally to its byte size,
rather than equal weight per layer.

Key changes:
- Add RegistryManifestFetcher that handles auth discovery via
  WWW-Authenticate headers, token fetching with optional credentials,
  and multi-arch manifest list resolution
- Update ImagePullProgress to accept manifest layer sizes via
  set_manifest() and calculate size-weighted progress
- Fall back to count-based progress when manifest fetch fails
- Pre-populate layer sizes from manifest when creating layer trackers

The manifest fetcher supports ghcr.io, Docker Hub, and private
registries by using credentials from Docker config when available.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Floating point arithmetic in weighted progress calculations can produce
values slightly above 100 (e.g., 100.00000000000001). This causes
validation errors when the progress value is checked.

Add min(100, ...) clamping to both size-weighted and count-based
progress calculations to ensure the result never exceeds 100.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Reuse the existing CoreSys websession for registry manifest requests
instead of creating a new aiohttp session. This improves performance
and follows the established pattern used throughout the codebase.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Make platform a required parameter in get_manifest() and _fetch_manifest()
  since it's always provided by the calling code
- Return None and log warning when requested platform is not found in
  multi-arch manifest list, instead of falling back to first manifest
  which could be the wrong architecture

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Users will notice degraded progress tracking when manifest fetch fails,
so log at warning level to help diagnose issues.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Update download_current and total_size independently in the DOWNLOADING
handler. This ensures download_current is updated even when total is
not yet available.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@agners agners force-pushed the refactor-docker-pull-progress-new branch 3 times, most recently from 9441498 to e11a62e Compare December 5, 2025 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed refactor A code change that neither fixes a bug nor adds a feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants