Part II · Scaled dot-product attention

Why we scale by √d_k.

In the paper, one tiny divisor prevents softmax from collapsing. Raise the dimension, watch the attention pattern sharpen into a single spike, and the fix becomes obvious.

If q and k are vectors of length d_k with independent, unit-variance entries, the dot product q · k has variance d_k. Grow d_k from 4 to 64 to 512, and the raw scores spread wider and wider — feeding softmax a distribution it can only respond to with a near-one-hot spike.

Live experiment

Below we sample 6 queries × 6 keys with i.i.d. entries and compute the attention weights of the first query. Slide d_k up and toggle the √d_k divisor on and off. Watch what happens without it.

Figure 4 · variance vs scaling

d_k 8

divide by √d_k

Raw scores (row 0)

std = —

Softmax weights

max = —

Fig 4 — without the divisor, softmax collapses to a near-one-hot as d_k grows. With it, the spread is held constant regardless of dimension.

The math

For independent unit-variance q_i, k_i:

Var(q · k) = Σ Var(q_ik_i) = d_k

Dividing by √d_k rescales that variance back to 1. Softmax inputs stay in a regime where gradients flow — which is exactly why the paper includes the divisor even though nothing else in the expression demands it.

One sentence to remember — √d_k keeps softmax's temperature constant as the model gets wider.

Why we scale by √dk.

Live experiment

The math

Why we scale by √d_k.