Part II · Scaled dot-product attention

Why we scale by √dk.

In the paper, one tiny divisor prevents softmax from collapsing. Raise the dimension, watch the attention pattern sharpen into a single spike, and the fix becomes obvious.

If q and k are vectors of length dk with independent, unit-variance entries, the dot product q · k has variance dk. Grow dk from 4 to 64 to 512, and the raw scores spread wider and wider — feeding softmax a distribution it can only respond to with a near-one-hot spike.

Live experiment

Below we sample 6 queries × 6 keys with i.i.d. entries and compute the attention weights of the first query. Slide dk up and toggle the √dk divisor on and off. Watch what happens without it.

Figure 4 · variance vs scaling
dk 8
Raw scores (row 0)
std =
Softmax weights
max =
Fig 4 — without the divisor, softmax collapses to a near-one-hot as dk grows. With it, the spread is held constant regardless of dimension.

The math

For independent unit-variance qi, ki:

Var(q · k) = Σ Var(qiki) = dk

Dividing by √dk rescales that variance back to 1. Softmax inputs stay in a regime where gradients flow — which is exactly why the paper includes the divisor even though nothing else in the expression demands it.

One sentence to remember — √dk keeps softmax's temperature constant as the model gets wider.