[BUGFIX] Fall back to in-memory model when best checkpoint can't be loaded#671
Open
rhaist wants to merge 2 commits into
Open
[BUGFIX] Fall back to in-memory model when best checkpoint can't be loaded#671rhaist wants to merge 2 commits into
rhaist wants to merge 2 commits into
Conversation
…oaded Lightning's ModelCheckpoint updates best_model_path before the underlying file write completes. If training is interrupted between those steps the finally block in nam.train.full.main raises when it tries to load the partial or missing file, so no .nam is exported. Wrap the load and fall back to the in-memory model on failure. Adds a deterministic regression test alongside the existing flaky integration test. Fixes sdatkinson#645.
sdatkinson
previously approved these changes
Jun 8, 2026
sdatkinson
left a comment
Owner
There was a problem hiding this comment.
Fantastic sleuthing--thank you!
Nit on the Exception catching; feel free to address or else I'll merge this after perhaps a day or two if I don't hear from you.
| best_checkpoint, | ||
| **lightning_cls.parse_config(model_config), | ||
| ) | ||
| except Exception as e: |
Owner
There was a problem hiding this comment.
Is there a more specific exception we can catch?
Author
There was a problem hiding this comment.
I did not figure out a better Exception handling here. Maybe you have an idea?
Author
There was a problem hiding this comment.
Looked into it over the weekend and found a better exception handling :)
Address PR review: replace bare 'except Exception' with the concrete failure modes torch.load raises on a missing or partially-written checkpoint (FileNotFoundError, RuntimeError, EOFError, pickle.UnpicklingError). Verified empirically against torch 2.12.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #645.
What's going on
Lightning's
ModelCheckpointsetsbest_model_pathbefore the checkpoint file write actually finishes. If Ctrl+C lands in that small window,best_model_pathends up pointing at a missing or partial file. Thefinallyblock innam/train/full.pythen raises when it tries to load that path, the export never runs, and no.namis produced.On my local machine the flake is rare; the CI runner is slow enough to hit it ~25% of the time.
Fix
Wrap the
load_from_checkpointcall and, on any failure, log a warning and fall back to the in-memory model. The.namalways gets written.Test
Adds
TestExportFallbackOnBadCheckpointwhich monkeypatches Lightning's_atomic_saveto write garbage so every checkpoint is unreadable. With the fix the test passes (warning logged,.namexported); reverting the fix makes it fail with the originalUnpicklingError. ExistingTestGracefulShutdownstill passes.