-
Notifications
You must be signed in to change notification settings - Fork 753
Refactor Docker pull progress with registry manifest fetcher #6379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| """Docker registry manifest fetcher. | ||
| Fetches image manifests directly from container registries to get layer sizes | ||
| before pulling an image. This enables accurate size-based progress tracking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than implementing registry manifest fetcher as part of Supervisor we should either call the Docker CLI (we'd need to create a config to get authentication right, but we have the necessary code already) or move the logic into aiodocker (I would prefer this approach).
ae04e3e to
f2596fe
Compare
e9204fc to
620c906
Compare
mdegat01
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few things to note in here. The session reuse one in particular seems like something we need to fix as we go to a lot of effort to avoid making new sessions elsewhere.
| try: | ||
| async with session.get(manifest_url) as resp: | ||
| if resp.status == 200: | ||
| # No auth required | ||
| return None | ||
|
|
||
| if resp.status != 401: | ||
| _LOGGER.warning( | ||
| "Unexpected status %d from registry %s", resp.status, registry | ||
| ) | ||
| return None | ||
|
|
||
| www_auth = resp.headers.get("WWW-Authenticate", "") | ||
| except aiohttp.ClientError as err: | ||
| _LOGGER.warning("Failed to connect to registry %s: %s", registry, err) | ||
| return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no way around trying and failing to figure out if we need auth and how to provide it? We could just call self._get_credentials(registry) up front but it seems like we can't figure out the realm and service without making the call and failing which is unfortunate for performance...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... But given that auth is really not all that common and more of a corner case I don't think it is a big deal 🤷 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. Yea I mean reading how the code works I don't see an obvious way around it if the realm and service can change. So this is what we have to do I suppose.
| and evt.args[0]["data"]["event"] == WSEvent.JOB | ||
| and evt.args[0]["data"]["data"]["name"] == "supervisor_update" | ||
| ] | ||
| # Count-based progress: 2 layers need pulling (each worth 50%) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of this is count-based progress anymore right?
| # Mock manifest fetcher to return None (falls back to count-based progress) | ||
| docker_obj._manifest_fetcher.get_manifest = AsyncMock(return_value=None) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem accurate? Removing the manifest doesn't force us to count-based progress, it just means the progress reporter creates and tracks layers as it sees them. More things would have to go wrong to force us to used count-based progress I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this is still accurate.
This is the whole reason I've moved to the manifest approach: If a couple small layers are downloaded first, and we calculate the overall progress based on that, we quickly arrive at a high progress procentage (like 60% or so), then if a huge layer follows, and we "rescale" progress, the progress stays stuck potentially for minutes! This is very confusing for the user. More confusing than a progress bar which moves a bit faster at times, and then slower at other times. But if we have 10 layers, it will move the progress bar during a large download still.
So, either we can get the layer sizes ahead of time (99.9% of the cases with the manifest fetcher), and have a progress bar which tracks overall image download progress by known overall size ahead of time, or we fallback with the layer count method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, since its designed to generally know all sizes up front we start calculating as soon as we see some progress, not when we see downloading for all layers. So without a manifest its just going to shoot right up. Ok that's fine then.
| for evt in ha_ws_client.async_send_command.call_args_list | ||
| if "data" in evt.args[0] | ||
| and evt.args[0]["data"]["event"] == WSEvent.JOB | ||
| and evt.args[0]["data"]["data"]["name"] == "home_assistant_core_update" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see we added a lot of unit tests for the new stuff but wouldn't it make sense to have at least one of these higher level tests pulling and using a manifest to help more accurately report progress? Like how most of our existing tests would stumble initially because they make layers on demand as they see them and progress may go backwards but with a manifest it would be accurate from the beginning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked my log and Docker events captures locally, I do have a couple of events captures, but I think I only have a log captured. Another problem is that maybe on that (virtual) machine the 70/30 split for downloading/extracting is probably off:
layer-progress-non-linear-containerd-snapshotter.txt
But anyhow, an end-to-end test would be nice indeed. I mostly encountered this with Home Assistant Core bumps. But it depends a bit if base layers got updated or not. So 2025.11.3->2025.12.0 might behave quite a bit different from 2025.11.2->2025.11.3.
|
Please take a look at the requested changes, and use the Ready for review button when you are done, thanks 👍 |
| if should_update: | ||
| stage = pull_progress.get_stage() | ||
| current_job.update(progress=progress, stage=stage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also it might be good to have some info level logging around here for #6390 . Removing the timeout fixes part of the issue but they're right that no logs for 15 minutes will still look like a crash. Some logs here would add visibility to what's going on until onboarding can add a progress bar of some kind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah agreed. But I rather have that in a follow up PR, then we can also discuss the exact rate of logging etc.
1dc5fd0 to
9698085
Compare
Refactor Docker image pull progress to use a simpler count-based approach where each layer contributes equally (100% / total_layers) regardless of size. This replaces the previous size-weighted calculation that was susceptible to progress regression. The core issue was that Docker rate-limits concurrent downloads (~3 at a time) and reports layer sizes only when downloading starts. With size- weighted progress, large layers appearing late would cause progress to drop dramatically (e.g., 59% -> 29%) as the total size increased. The new approach: - Each layer contributes equally to overall progress - Per-layer progress: 70% download weight, 30% extraction weight - Progress only starts after first "Downloading" event (when layer count is known) - Always caps at 99% - job completion handles final 100% This simplifies the code by moving progress tracking to a dedicated module (pull_progress.py) and removing complex size-based scaling logic that tried to account for unknown layer sizes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Layers that already exist locally should not count towards download progress since there's nothing to download for them. Only layers that need pulling are included in the progress calculation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Fetch image manifests directly from container registries before pulling to get accurate layer sizes upfront. This enables size-weighted progress tracking where each layer contributes proportionally to its byte size, rather than equal weight per layer. Key changes: - Add RegistryManifestFetcher that handles auth discovery via WWW-Authenticate headers, token fetching with optional credentials, and multi-arch manifest list resolution - Update ImagePullProgress to accept manifest layer sizes via set_manifest() and calculate size-weighted progress - Fall back to count-based progress when manifest fetch fails - Pre-populate layer sizes from manifest when creating layer trackers The manifest fetcher supports ghcr.io, Docker Hub, and private registries by using credentials from Docker config when available. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Floating point arithmetic in weighted progress calculations can produce values slightly above 100 (e.g., 100.00000000000001). This causes validation errors when the progress value is checked. Add min(100, ...) clamping to both size-weighted and count-based progress calculations to ensure the result never exceeds 100. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Reuse the existing CoreSys websession for registry manifest requests instead of creating a new aiohttp session. This improves performance and follows the established pattern used throughout the codebase. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
- Make platform a required parameter in get_manifest() and _fetch_manifest() since it's always provided by the calling code - Return None and log warning when requested platform is not found in multi-arch manifest list, instead of falling back to first manifest which could be the wrong architecture 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Users will notice degraded progress tracking when manifest fetch fails, so log at warning level to help diagnose issues. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Update download_current and total_size independently in the DOWNLOADING handler. This ensures download_current is updated even when total is not yet available. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
9441498 to
e11a62e
Compare
Proposed change
Refactor Docker image pull progress tracking for more accurate and reliable progress updates.
Problem:
The previous progress tracking had issues where:
Solution:
This PR introduces a registry manifest fetcher that retrieves layer sizes directly from container registries before pulling. This enables accurate size-based progress tracking:
Registry Manifest Fetcher (
supervisor/docker/manifest.py):Size-weighted progress:
Exclude cached layers:
Wait for downloads to start:
Type of change
Additional information
Checklist
ruff format supervisor tests)If API endpoints or add-on configuration are added/changed: