Skip to content

[BUGFIX] Fall back to in-memory model when best checkpoint can't be loaded#671

Open
rhaist wants to merge 2 commits into
sdatkinson:mainfrom
rhaist:fix-graceful-shutdown-flake
Open

[BUGFIX] Fall back to in-memory model when best checkpoint can't be loaded#671
rhaist wants to merge 2 commits into
sdatkinson:mainfrom
rhaist:fix-graceful-shutdown-flake

Conversation

@rhaist

@rhaist rhaist commented May 24, 2026

Copy link
Copy Markdown

Fixes #645.

What's going on

Lightning's ModelCheckpoint sets best_model_path before the checkpoint file write actually finishes. If Ctrl+C lands in that small window, best_model_path ends up pointing at a missing or partial file. The finally block in nam/train/full.py then raises when it tries to load that path, the export never runs, and no .nam is produced.

On my local machine the flake is rare; the CI runner is slow enough to hit it ~25% of the time.

Fix

Wrap the load_from_checkpoint call and, on any failure, log a warning and fall back to the in-memory model. The .nam always gets written.

Test

Adds TestExportFallbackOnBadCheckpoint which monkeypatches Lightning's _atomic_save to write garbage so every checkpoint is unreadable. With the fix the test passes (warning logged, .nam exported); reverting the fix makes it fail with the original UnpicklingError. Existing TestGracefulShutdown still passes.

…oaded

Lightning's ModelCheckpoint updates best_model_path before the underlying
file write completes. If training is interrupted between those steps the
finally block in nam.train.full.main raises when it tries to load the
partial or missing file, so no .nam is exported. Wrap the load and fall
back to the in-memory model on failure.

Adds a deterministic regression test alongside the existing flaky
integration test. Fixes sdatkinson#645.
sdatkinson
sdatkinson previously approved these changes Jun 8, 2026

@sdatkinson sdatkinson left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic sleuthing--thank you!

Nit on the Exception catching; feel free to address or else I'll merge this after perhaps a day or two if I don't hear from you.

Comment thread nam/train/full.py Outdated
best_checkpoint,
**lightning_cls.parse_config(model_config),
)
except Exception as e:

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a more specific exception we can catch?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not figure out a better Exception handling here. Maybe you have an idea?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked into it over the weekend and found a better exception handling :)

Address PR review: replace bare 'except Exception' with the concrete
failure modes torch.load raises on a missing or partially-written
checkpoint (FileNotFoundError, RuntimeError, EOFError,
pickle.UnpicklingError). Verified empirically against torch 2.12.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Graceful shutdown tests are flaky on GitHub runner

2 participants