Attend

The attention mechanism, from scratch.

A patient, step-by-step walk through the one idea behind every modern language model — from a single dot product all the way to the full Transformer block, with live math on every page.

Part I · Foundations

§1

A vector, a dot product

Similarity is just an inner product. We'll build the intuition by hand on two tiny vectors.

live10 min

§2

From similarity to attention

Turn scores into weights with softmax. Feel how temperature bends the distribution.

live12 min

Part II · Scaled dot-product attention

§3

Why we scale by √d_k

Watch variance explode as d_k grows, and softmax collapse to one-hot. Then fix it.

live8 min

§4

Splitting Q, K, V

A worked example on six tokens. Every number computed live; hover to trace the flow.

live · flagship20 min

Part III · Multi-head and beyond

§5

Per-head attention

Eight heads, each with its own view of the sentence. Compare patterns side-by-side.

live15 min

§6

Positional encodings

Sinusoids of many wavelengths. Drag through dimensions to see the pattern emerge.

live10 min

§7

The full Transformer block

Attention + residual + layer-norm + FFN. Click each wire to see what flows through.

live18 min