Computer vision · 2022

Vision Transformer, from Scratch

A clean ViT for image classification — and a front-row seat to how data-hungry attention really is.

Solo · learning build1
PyTorchVision TransformerSelf-attentionJupyter

The idea

After implementing the original Transformer for text, the obvious next question was: what changes when the tokens are image patches? The ViT recipe — split an image into patches, embed each as a token, add positional information, and run a standard Transformer encoder — is elegant, and building it cleared up exactly which pieces are vision-specific and which are inherited wholesale from NLP.

What stood out

The most instructive part wasn't the architecture, it was the data appetite. A ViT has far weaker built-in inductive biases than a CNN — no locality, no translation-equivariance baked in — so it has to learn them from data. On small datasets a solid convolutional baseline wins comfortably; attention only overtakes once you give it enough examples (or strong augmentation / pre-training). Watching that crossover happen, rather than reading the claim in a paper, is what made the lesson stick.

What I took away

  • ViTs trade CNN inductive biases for flexibility — and pay for it in data.
  • Patch embedding + positional encoding is the only genuinely vision-specific part; the rest is the text Transformer.
← all projectsview the code ↗