NLP · Architectures · 2022
Transformer from Scratch
Attention, positional encodings, the whole machine — re-derived by hand so it stuck.
Why rebuild a solved thing
"Attention Is All You Need" is the most-read paper in modern ML, and I'd read it plenty. But there's a difference between recognising scaled dot-product attention and being able to write it — with the masking, the multi-head reshapes, and the positional encodings — without looking. So I rebuilt the whole machine and ran it on small sequence tasks.
What re-deriving it taught me
The shapes are the curriculum. Multi-head attention is conceptually simple, but getting the reshape/transpose dance right — (batch, heads, seq, d_k) and back — is where understanding actually lives. Implementing causal masking by hand made why decoder-only models can't peek at the future obvious in a way the diagram never did. This build is the foundation everything else I do with LLMs sits on top of, including the RAG systems I now work on professionally.
What I took away
- You don't understand attention until you've debugged the multi-head reshapes yourself.
- Hand-implementing causal masking made autoregressive decoding intuitive.