Part I · Foundations

From similarity to attention.

A list of dot-product scores is not yet an attention pattern. We need to turn raw numbers into weights that sum to one — a probability distribution over what to look at. That is what softmax does.

Start with six scores: one per token. These might be dot products of a query against six keys. They can be anything — positive, negative, tiny, huge.

Softmax, with a dial

Drag the temperature below. At T = 1 you get the textbook softmax. Push T toward zero and the distribution sharpens into a one-hot argmax. Push it toward infinity and it flattens into a uniform average.

Figure 3 · softmax temperature

temperature T 1.00

randomize scores

Fig 3 — the height of each orange bar is the final attention weight. All six weights always sum to 1.

Why exponentiate?

Softmax is exp(x_i) / Σ exp(x_j). The exponential is what makes small differences in scores blow up into big differences in weights — a score that is 2 units higher gets about e² ≈ 7× the attention. That sharpness is why the model can make decisions instead of averaging everything into mush.

One sentence to remember — softmax turns scores into weights that sum to 1, and its exponential gives the network the power to commit.