Never lose a training run again — an interactive walkthrough

The problem

It's epoch 90 of 100. The session drops.

Press Start, then hit Simulate disconnect mid-run. Toggle the pattern on and off to feel the difference.

Training run

Resumable pattern

epoch 0 / 100

0epochs lost to disconnects

0disconnects survived

Idea 1

Checkpoint the whole state — not just the weights

Most people save only model.state_dict(). Watch what a "resume" actually does when pieces are missing. Toggle each one and read the outcome.

What's in your checkpoint?

Idea 2

Write atomically — or risk a corrupted checkpoint

If the machine dies while writing, a naive save leaves a half-written file — and destroys the good one it was overwriting. Flip "crash mid-write" on and try both saves.

checkpoint.pt

Crash mid-write

checkpoint.pt

✓ valid — epoch 42

A rename on the same filesystem is atomic: you get the complete old file or the complete new one — never a torn mix.

def save_atomic(path, state):
    tmp = path + ".tmp"
    torch.save(state, tmp)
    os.replace(tmp, path)   # atomic on the same filesystem

Idea 3

A "done marker" makes a whole sweep idempotent

Run a 3×3 sweep. Disconnect partway. Then re-launch the entire batch — finished runs skip instantly, the interrupted one resumes, nothing is ever duplicated.

0runs complete

0runs duplicated

0skipped on re-launch

Idea 4

Put the state where it outlives the machine

The #1 mistake: checkpointing to the node's local scratch disk, which is wiped the instant the runtime recycles. Choose where you save, then recycle the machine.

Save checkpoints to:

Local scratch disk

on the compute node — ephemeral

Durable storage

cloud bucket / mounted drive / your SSD

Idea 5

Resume means continue, not restart

The tell-tale smoke test is the learning rate. Restore the scheduler and the LR picks up smoothly; forget it and the LR snaps back to its starting value. Toggle it and re-run.

Learning-rate schedule

Restore scheduler state

The payoff

Your bulletproofing checklist

Tick them off — these are the six moves that make any training job survive an ephemeral machine.

Found this useful?

I write about the unglamorous engineering that makes ML actually ship.

Read the full write-up with all the code →