Why we scale by √dk.
In the paper, one tiny divisor prevents softmax from collapsing. Raise the dimension, watch the attention pattern sharpen into a single spike, and the fix becomes obvious.
If q and k are vectors of length dk with independent, unit-variance entries, the dot product q · k has variance dk. Grow dk from 4 to 64 to 512, and the raw scores spread wider and wider — feeding softmax a distribution it can only respond to with a near-one-hot spike.
Live experiment
Below we sample 6 queries × 6 keys with i.i.d. entries and compute the attention weights of the first query. Slide dk up and toggle the √dk divisor on and off. Watch what happens without it.
The math
For independent unit-variance qi, ki:
Var(q · k) = Σ Var(qiki) = dk
Dividing by √dk rescales that variance back to 1. Softmax inputs stay in a regime where gradients flow — which is exactly why the paper includes the divisor even though nothing else in the expression demands it.