Interactive walkthrough

Never lose a training run again

Free GPUs disconnect without warning. Here's the engineering pattern that turns a dropped session from hours lost into a shrug — explained by letting you break things yourself.

↓ scroll to start training
The problem

It's epoch 90 of 100. The session drops.

Press Start, then hit Simulate disconnect mid-run. Toggle the pattern on and off to feel the difference.

Training run

epoch 0 / 100
0epochs lost to disconnects
0disconnects survived
Idea 1

Checkpoint the whole state — not just the weights

Most people save only model.state_dict(). Watch what a "resume" actually does when pieces are missing. Toggle each one and read the outcome.

What's in your checkpoint?

Idea 2

Write atomically — or risk a corrupted checkpoint

If the machine dies while writing, a naive save leaves a half-written file — and destroys the good one it was overwriting. Flip "crash mid-write" on and try both saves.

checkpoint.pt

checkpoint.pt
✓ valid — epoch 42
A rename on the same filesystem is atomic: you get the complete old file or the complete new one — never a torn mix.
def save_atomic(path, state):
    tmp = path + ".tmp"
    torch.save(state, tmp)
    os.replace(tmp, path)   # atomic on the same filesystem
Idea 3

A "done marker" makes a whole sweep idempotent

Run a 3×3 sweep. Disconnect partway. Then re-launch the entire batch — finished runs skip instantly, the interrupted one resumes, nothing is ever duplicated.

0runs complete
0runs duplicated
0skipped on re-launch
Idea 4

Put the state where it outlives the machine

The #1 mistake: checkpointing to the node's local scratch disk, which is wiped the instant the runtime recycles. Choose where you save, then recycle the machine.

Save checkpoints to:

Local scratch disk

on the compute node — ephemeral

Durable storage

cloud bucket / mounted drive / your SSD
Idea 5

Resume means continue, not restart

The tell-tale smoke test is the learning rate. Restore the scheduler and the LR picks up smoothly; forget it and the LR snaps back to its starting value. Toggle it and re-run.

Learning-rate schedule

The payoff

Your bulletproofing checklist

Tick them off — these are the six moves that make any training job survive an ephemeral machine.

Found this useful?

I write about the unglamorous engineering that makes ML actually ship.

Read the full write-up with all the code →