Attend

The attention mechanism, from scratch.

A patient, step-by-step walk through the one idea behind every modern language model — from a single dot product all the way to the full Transformer block, with live math on every page.

Part I · Foundations

§1
A vector, a dot product
Similarity is just an inner product. We'll build the intuition by hand on two tiny vectors.
live10 min
§2
From similarity to attention
Turn scores into weights with softmax. Feel how temperature bends the distribution.
live12 min

Part II · Scaled dot-product attention

§3
Why we scale by √dk
Watch variance explode as dk grows, and softmax collapse to one-hot. Then fix it.
live8 min
§4
Splitting Q, K, V
A worked example on six tokens. Every number computed live; hover to trace the flow.
live · flagship20 min

Part III · Multi-head and beyond

§5
Per-head attention
Eight heads, each with its own view of the sentence. Compare patterns side-by-side.
live15 min
§6
Positional encodings
Sinusoids of many wavelengths. Drag through dimensions to see the pattern emerge.
live10 min
§7
The full Transformer block
Attention + residual + layer-norm + FFN. Click each wire to see what flows through.
live18 min