Refactor Docker pull progress with registry manifest fetcher #6379

agners · 2025-12-02T14:53:30Z

Proposed change

Refactor Docker image pull progress tracking for more accurate and reliable progress updates.

Problem:
The previous progress tracking had issues where:

Progress would stay stuck when large layers were discovered late (Docker rate-limits concurrent downloads to ~3 layers, if large layers get discovered late, total would get bumped late in the image pull)
Progress would jump to 100% prematurely when cached layers were reported before downloads started
Size-weighted progress was unreliable because layer sizes weren't known until each layer started downloading

Solution:
This PR introduces a registry manifest fetcher that retrieves layer sizes directly from container registries before pulling. This enables accurate size-based progress tracking:

Registry Manifest Fetcher (supervisor/docker/manifest.py):
- Fetches manifests directly from registries (ghcr.io, Docker Hub, private registries)
- Discovers auth endpoints via WWW-Authenticate headers
- Handles multi-arch manifest lists by selecting the correct platform
- Uses credentials from Docker config for private registries
Size-weighted progress:
- Each layer contributes to progress proportionally to its byte size
- Large layers (e.g., 500MB) have more weight than small layers (e.g., 100 bytes)
- Falls back to count-based progress (equal weight per layer) if manifest fetch fails
Exclude cached layers:
- Layers that "Already exist" locally are excluded from progress calculation
- Progress only tracks bytes that actually need to be downloaded
Wait for downloads to start:
- Progress returns 0% until first "Downloading" event
- This prevents premature progress jumps from cached layer events

Type of change

Dependency upgrade
Bugfix (non-breaking change which fixes an issue)
New feature (which adds functionality to the supervisor)
Breaking change (fix/feature causing existing functionality to break)
Code quality improvements to existing code or addition of tests

Additional information

This PR fixes or closes issue: fixes #
This PR is related to issue:
Link to documentation pull request:
Link to cli pull request:
Link to client library pull request:

Checklist

The code change is tested and works locally.
Local tests pass. Your PR cannot be merged unless tests pass
There is no commented out code in this PR.
I have followed the development checklist
The code has been formatted using Ruff (ruff format supervisor tests)
Tests have been added to verify that the new code works.

If API endpoints or add-on configuration are added/changed:

Documentation added/updated for developers.home-assistant.io
CLI updated (if necessary)
Client library updated (if necessary)

agners · 2025-12-02T15:07:56Z

supervisor/docker/manifest.py

+"""Docker registry manifest fetcher.
+
+Fetches image manifests directly from container registries to get layer sizes
+before pulling an image. This enables accurate size-based progress tracking.


Rather than implementing registry manifest fetcher as part of Supervisor we should either call the Docker CLI (we'd need to create a config to get authentication right, but we have the necessary code already) or move the logic into aiodocker (I would prefer this approach).

mdegat01

Few things to note in here. The session reuse one in particular seems like something we need to fix as we go to a lot of effort to avoid making new sessions elsewhere.

supervisor/docker/interface.py

supervisor/docker/manifest.py

mdegat01 · 2025-12-03T21:59:43Z

supervisor/docker/manifest.py

+        try:
+            async with session.get(manifest_url) as resp:
+                if resp.status == 200:
+                    # No auth required
+                    return None
+
+                if resp.status != 401:
+                    _LOGGER.warning(
+                        "Unexpected status %d from registry %s", resp.status, registry
+                    )
+                    return None
+
+                www_auth = resp.headers.get("WWW-Authenticate", "")
+        except aiohttp.ClientError as err:
+            _LOGGER.warning("Failed to connect to registry %s: %s", registry, err)
+            return None


There's no way around trying and failing to figure out if we need auth and how to provide it? We could just call self._get_credentials(registry) up front but it seems like we can't figure out the realm and service without making the call and failing which is unfortunate for performance...

Yeah... But given that auth is really not all that common and more of a corner case I don't think it is a big deal 🤷 .

Fair enough. Yea I mean reading how the code works I don't see an obvious way around it if the realm and service can change. So this is what we have to do I suppose.

supervisor/docker/manifest.py

supervisor/docker/pull_progress.py

mdegat01 · 2025-12-03T22:29:18Z

tests/api/test_supervisor.py

        and evt.args[0]["data"]["event"] == WSEvent.JOB
        and evt.args[0]["data"]["data"]["name"] == "supervisor_update"
    ]
+    # Count-based progress: 2 layers need pulling (each worth 50%)


None of this is count-based progress anymore right?

mdegat01 · 2025-12-03T22:32:19Z

tests/conftest.py

+        # Mock manifest fetcher to return None (falls back to count-based progress)
+        docker_obj._manifest_fetcher.get_manifest = AsyncMock(return_value=None)
+


This doesn't seem accurate? Removing the manifest doesn't force us to count-based progress, it just means the progress reporter creates and tracks layers as it sees them. More things would have to go wrong to force us to used count-based progress I think?

No this is still accurate.

This is the whole reason I've moved to the manifest approach: If a couple small layers are downloaded first, and we calculate the overall progress based on that, we quickly arrive at a high progress procentage (like 60% or so), then if a huge layer follows, and we "rescale" progress, the progress stays stuck potentially for minutes! This is very confusing for the user. More confusing than a progress bar which moves a bit faster at times, and then slower at other times. But if we have 10 layers, it will move the progress bar during a large download still.

So, either we can get the layer sizes ahead of time (99.9% of the cases with the manifest fetcher), and have a progress bar which tracks overall image download progress by known overall size ahead of time, or we fallback with the layer count method.

Ah right, since its designed to generally know all sizes up front we start calculating as soon as we see some progress, not when we see downloading for all layers. So without a manifest its just going to shoot right up. Ok that's fine then.

tests/docker/test_manifest.py

mdegat01 · 2025-12-03T22:40:16Z

tests/api/test_homeassistant.py

        for evt in ha_ws_client.async_send_command.call_args_list
        if "data" in evt.args[0]
        and evt.args[0]["data"]["event"] == WSEvent.JOB
        and evt.args[0]["data"]["data"]["name"] == "home_assistant_core_update"


I see we added a lot of unit tests for the new stuff but wouldn't it make sense to have at least one of these higher level tests pulling and using a manifest to help more accurately report progress? Like how most of our existing tests would stumble initially because they make layers on demand as they see them and progress may go backwards but with a manifest it would be accurate from the beginning.

I've checked my log and Docker events captures locally, I do have a couple of events captures, but I think I only have a log captured. Another problem is that maybe on that (virtual) machine the 70/30 split for downloading/extracting is probably off:
layer-progress-non-linear-containerd-snapshotter.txt

But anyhow, an end-to-end test would be nice indeed. I mostly encountered this with Home Assistant Core bumps. But it depends a bit if base layers got updated or not. So 2025.11.3->2025.12.0 might behave quite a bit different from 2025.11.2->2025.11.3.

home-assistant · 2025-12-03T22:41:05Z

Please take a look at the requested changes, and use the Ready for review button when you are done, thanks 👍

Learn more about our pull request process.

mdegat01 · 2025-12-03T22:44:57Z

supervisor/docker/interface.py

+            if should_update:
+                stage = pull_progress.get_stage()
+                current_job.update(progress=progress, stage=stage)


Also it might be good to have some info level logging around here for #6390 . Removing the timeout fixes part of the issue but they're right that no logs for 15 minutes will still look like a crash. Some logs here would add visibility to what's going on until onboarding can add a progress bar of some kind.

Yeah agreed. But I rather have that in a follow up PR, then we can also discuss the exact rate of logging etc.

Refactor Docker image pull progress to use a simpler count-based approach where each layer contributes equally (100% / total_layers) regardless of size. This replaces the previous size-weighted calculation that was susceptible to progress regression. The core issue was that Docker rate-limits concurrent downloads (~3 at a time) and reports layer sizes only when downloading starts. With size- weighted progress, large layers appearing late would cause progress to drop dramatically (e.g., 59% -> 29%) as the total size increased. The new approach: - Each layer contributes equally to overall progress - Per-layer progress: 70% download weight, 30% extraction weight - Progress only starts after first "Downloading" event (when layer count is known) - Always caps at 99% - job completion handles final 100% This simplifies the code by moving progress tracking to a dedicated module (pull_progress.py) and removing complex size-based scaling logic that tried to account for unknown layer sizes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Layers that already exist locally should not count towards download progress since there's nothing to download for them. Only layers that need pulling are included in the progress calculation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Fetch image manifests directly from container registries before pulling to get accurate layer sizes upfront. This enables size-weighted progress tracking where each layer contributes proportionally to its byte size, rather than equal weight per layer. Key changes: - Add RegistryManifestFetcher that handles auth discovery via WWW-Authenticate headers, token fetching with optional credentials, and multi-arch manifest list resolution - Update ImagePullProgress to accept manifest layer sizes via set_manifest() and calculate size-weighted progress - Fall back to count-based progress when manifest fetch fails - Pre-populate layer sizes from manifest when creating layer trackers The manifest fetcher supports ghcr.io, Docker Hub, and private registries by using credentials from Docker config when available. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Floating point arithmetic in weighted progress calculations can produce values slightly above 100 (e.g., 100.00000000000001). This causes validation errors when the progress value is checked. Add min(100, ...) clamping to both size-weighted and count-based progress calculations to ensure the result never exceeds 100. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Reuse the existing CoreSys websession for registry manifest requests instead of creating a new aiohttp session. This improves performance and follows the established pattern used throughout the codebase. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

- Make platform a required parameter in get_manifest() and _fetch_manifest() since it's always provided by the calling code - Return None and log warning when requested platform is not found in multi-arch manifest list, instead of falling back to first manifest which could be the wrong architecture 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Users will notice degraded progress tracking when manifest fetch fails, so log at warning level to help diagnose issues. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Update download_current and total_size independently in the DOWNLOADING handler. This ensures download_current is updated even when total is not yet available. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

agners requested a review from mdegat01 December 2, 2025 14:53

agners added the refactor A code change that neither fixes a bug nor adds a feature label Dec 2, 2025

home-assistant bot added the cla-signed label Dec 2, 2025

agners marked this pull request as draft December 2, 2025 15:05

agners commented Dec 2, 2025

View reviewed changes

agners force-pushed the refactor-docker-pull-progress-new branch from ae04e3e to f2596fe Compare December 2, 2025 20:18

agners marked this pull request as ready for review December 2, 2025 20:19

agners force-pushed the refactor-docker-pull-progress-new branch 3 times, most recently from e9204fc to 620c906 Compare December 3, 2025 17:49

mdegat01 requested changes Dec 3, 2025

View reviewed changes

home-assistant bot marked this pull request as draft December 3, 2025 22:41

mdegat01 reviewed Dec 3, 2025

View reviewed changes

agners force-pushed the refactor-docker-pull-progress-new branch from 1dc5fd0 to 9698085 Compare December 5, 2025 15:11

agners and others added 9 commits December 5, 2025 16:14

Log manifest fetch failures at warning level

53af152

Users will notice degraded progress tracking when manifest fetch fails, so log at warning level to help diagnose issues. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Add pylint disable comments for protected access in manifest tests

d05a88e

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

agners force-pushed the refactor-docker-pull-progress-new branch 3 times, most recently from 9441498 to e11a62e Compare December 5, 2025 15:27

		# Mock manifest fetcher to return None (falls back to count-based progress)
		docker_obj._manifest_fetcher.get_manifest = AsyncMock(return_value=None)

Refactor Docker pull progress with registry manifest fetcher #6379

Are you sure you want to change the base?

Refactor Docker pull progress with registry manifest fetcher #6379

Uh oh!

Conversation

agners commented Dec 2, 2025

Proposed change

Type of change

Additional information

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mdegat01 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

home-assistant bot commented Dec 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants