From similarity to attention.
A list of dot-product scores is not yet an attention pattern. We need to turn raw numbers into weights that sum to one — a probability distribution over what to look at. That is what softmax does.
Start with six scores: one per token. These might be dot products of a query against six keys. They can be anything — positive, negative, tiny, huge.
Softmax, with a dial
Drag the temperature below. At T = 1 you get the textbook softmax. Push T toward zero and the distribution sharpens into a one-hot argmax. Push it toward infinity and it flattens into a uniform average.
Why exponentiate?
Softmax is exp(xi) / Σ exp(xj). The exponential is what makes small differences in scores blow up into big differences in weights — a score that is 2 units higher gets about e² ≈ 7× the attention. That sharpness is why the model can make decisions instead of averaging everything into mush.