NLP · Architectures · 2022

Transformer from Scratch

Attention, positional encodings, the whole machine — re-derived by hand so it stuck.

Solo · learning build

PyTorchSelf-attentionPositional encodingJupyter

Why rebuild a solved thing

"Attention Is All You Need" is the most-read paper in modern ML, and I'd read it plenty. But there's a difference between recognising scaled dot-product attention and being able to write it — with the masking, the multi-head reshapes, and the positional encodings — without looking. So I rebuilt the whole machine and ran it on small sequence tasks.

What re-deriving it taught me

The shapes are the curriculum. Multi-head attention is conceptually simple, but getting the reshape/transpose dance right — (batch, heads, seq, d_k) and back — is where understanding actually lives. Implementing causal masking by hand made why decoder-only models can't peek at the future obvious in a way the diagram never did. This build is the foundation everything else I do with LLMs sits on top of, including the RAG systems I now work on professionally.

What I took away

You don't understand attention until you've debugged the multi-head reshapes yourself.
Hand-implementing causal masking made autoregressive decoding intuitive.

← all projects view the code ↗