Generative models · 2023

Diffusion Model

Building DDPM from the noise up, until the math stopped feeling like magic.

Solo · learning build1
PyTorchU-NetDDPMNumPyMatplotlib

Why I built it

Diffusion models had taken over generative imaging, and I could use them — but I couldn't have derived one on a whiteboard. That gap bothered me. Reading the DDPM paper a third time wasn't closing it, so I did the thing that always works for me: I rebuilt it from scratch, with no reference implementation open in another tab, and refused to move on until each piece earned its place.

What I actually built

The project is a small, readable PyTorch codebase with three parts I implemented deliberately separately so I could poke at each in isolation:

  • The forward process — the closed-form noising schedule that lets you jump to any timestep t in one step, rather than looping. Getting the reparameterisation right is the whole trick.
  • The reverse denoiser — a compact U-Net that predicts the noise added at step t, conditioned on t via sinusoidal time embeddings.
  • The sampling loop — the iterative denoising that turns pure Gaussian noise into a sample, one small step at a time.

I logged intermediate samples at fixed timesteps so I could watch structure emerge from noise across training, which turned out to be the most useful debugging tool I had.

What surprised me

Two things. First, how much of the difficulty is bookkeeping — the variance schedules, the sqrt(alpha_bar) terms, keeping shapes and broadcasts honest. The conceptual leap is small; the place you actually lose hours is a sign error in the posterior. Second, how forgiving the objective is: predicting the noise (rather than the image) makes the loss a plain MSE, and that simplicity is most of why diffusion trains so stably compared to the GANs I'd fought with earlier.

The moment it clicked: once you predict noise instead of pixels, the loss is just MSE — and the whole intimidating apparatus collapses into something you can train without adversarial drama.

What I'd do next

Three concrete extensions I scoped but haven't shipped: a DDIM sampler for far fewer denoising steps at inference, classifier-free guidance for conditional generation, and a move to latent-space diffusion to make higher resolutions tractable on a single GPU. They're the natural next rungs, and each maps to a paper I now feel equipped to implement directly.

What I took away

  • Re-deriving the reverse process by hand made the noise-prediction objective intuitive in a way no amount of reading did.
  • Most of the engineering pain is in the variance-schedule bookkeeping, not the concepts.
  • Visualising intermediate denoising steps was the single best debugging tool.
← all projectsview the code ↗