The Transformer block, wired up.
We have every ingredient. Now we assemble. A Transformer block is a piece of circuitry with one attention sub-layer, one feed-forward sub-layer, two residuals, and two layer-norms — no more, no less.
Click any wire in the diagram. The right-hand panel explains what flows through it, what shape it has, and what the math does at that point.
The block
The equation
Written out, one encoder layer is:
z = LayerNorm(x + MultiHead(x, x, x))
y = LayerNorm(z + FFN(z))
The residuals let gradient flow unblocked past every sub-layer; layer-norm keeps activations in a well-conditioned range; multi-head attention mixes information across tokens; the FFN (two linears with a ReLU between them) does the position-wise computation. Everything else in a full model — decoder, cross-attention, masking — is a variation on these four parts.
What you've built
From a single dot product you now have: similarity → softmax → scaled dot-product attention → multi-head attention → positional encoding → the full block. Every Transformer in use today — the ones writing code, translating languages, folding proteins — is a stack of these blocks with variations on what's masked and what's cross-attended. The math is no harder than what you just read.