Computer vision · 2022
Vision Transformer, from Scratch
A clean ViT for image classification — and a front-row seat to how data-hungry attention really is.
The idea
After implementing the original Transformer for text, the obvious next question was: what changes when the tokens are image patches? The ViT recipe — split an image into patches, embed each as a token, add positional information, and run a standard Transformer encoder — is elegant, and building it cleared up exactly which pieces are vision-specific and which are inherited wholesale from NLP.
What stood out
The most instructive part wasn't the architecture, it was the data appetite. A ViT has far weaker built-in inductive biases than a CNN — no locality, no translation-equivariance baked in — so it has to learn them from data. On small datasets a solid convolutional baseline wins comfortably; attention only overtakes once you give it enough examples (or strong augmentation / pre-training). Watching that crossover happen, rather than reading the claim in a paper, is what made the lesson stick.
What I took away
- ViTs trade CNN inductive biases for flexibility — and pay for it in data.
- Patch embedding + positional encoding is the only genuinely vision-specific part; the rest is the text Transformer.