Part III · Multi-head and beyond

Positional encodings, a choir of sinusoids.

Attention is permutation-invariant. Shuffle the tokens and the math is unchanged — which is a disaster for language, where order is meaning. The Transformer fixes this by adding a pattern to every token embedding that depends only on its position.

Each position pos gets a d_model-dimensional vector whose even entries are sines and odd entries are cosines — each at a different frequency:

PE(pos, 2i) = sin(pos / 10000^2i/d)
PE(pos, 2i+1) = cos(pos / 10000^2i/d)

The pattern

Below, rows are positions (0…31), columns are dimensions (0…63). Low-index columns oscillate fast; high-index columns drift slow. Together they give every position a unique fingerprint — and, crucially, one that encodes relative distance in a linearly recoverable way.

Figure 6 · sinusoidal positional encoding

d_model 64

sequence length 32

← dim 0 (fast) dim d/2 (medium) dim d-1 (slow) →

Trace one dimension

dimension i 2

Fig 6 — the encoding is added to the token embedding before attention. The network can learn to attend by position ("look two tokens back") because relative offsets are linear in this basis.

One sentence to remember — the Transformer doesn't know what "first" means, so we tell it — in a basis of sinusoids.