What just happened
Each row of the output is a convex combination of the value rows, weighted by how much this query "asked" each key. A row that attends uniformly just averages all values; a row that attends sharply copies a single value. Multi-head attention runs h copies of this in parallel, each with its own learned Q, K, V projections — giving the model h different ways to relate tokens.
Try it yourself
Reproduce the numbers above with NumPy. The shapes are small enough that you can print every intermediate.
Next up: we'll stack h = 8 heads, each with
dk = 64, and see how concatenation
followed by the output projection WO
gives us back a [n, d_model] tensor — the same shape we
started with, now enriched by eight distinct attention views.