It's epoch 90 of 100. The session drops.
Press Start, then hit Simulate disconnect mid-run. Toggle the pattern on and off to feel the difference.
Training run
Checkpoint the whole state — not just the weights
Most people save only model.state_dict(). Watch what a "resume" actually does when pieces are missing. Toggle each one and read the outcome.
What's in your checkpoint?
Write atomically — or risk a corrupted checkpoint
If the machine dies while writing, a naive save leaves a half-written file — and destroys the good one it was overwriting. Flip "crash mid-write" on and try both saves.
checkpoint.pt
def save_atomic(path, state): tmp = path + ".tmp" torch.save(state, tmp) os.replace(tmp, path) # atomic on the same filesystem
A "done marker" makes a whole sweep idempotent
Run a 3×3 sweep. Disconnect partway. Then re-launch the entire batch — finished runs skip instantly, the interrupted one resumes, nothing is ever duplicated.
Put the state where it outlives the machine
The #1 mistake: checkpointing to the node's local scratch disk, which is wiped the instant the runtime recycles. Choose where you save, then recycle the machine.
Local scratch disk
Durable storage
Resume means continue, not restart
The tell-tale smoke test is the learning rate. Restore the scheduler and the LR picks up smoothly; forget it and the LR snaps back to its starting value. Toggle it and re-run.
Learning-rate schedule
Your bulletproofing checklist
Tick them off — these are the six moves that make any training job survive an ephemeral machine.
Found this useful?
I write about the unglamorous engineering that makes ML actually ship.
Read the full write-up with all the code →