Transformer Forward Pass

Layer l, timestep t=4 ("cat") · d_model=512, d_head=64, n_heads=8

1D Activation
2D Weight
Residual
Nonlinear
Per-Head
Layer l-1 outputs
t=1I[512]
t=2am[512]
t=3a[512]
[512]t=4cat
Multi-Head Attention
8× softmax matmul
I am a cat ← K,V use all t≤4
Query (t=4)
[512]
x₄
×
[512×512]
Wq
Keys
[4×512]
X
×
[512×512]
Wk
Values
[4×512]
X
×
[512×512]
Wv
[512]
Q
[4×512]
K
[4×512]
V
split → 8 × 64
8 Independent Heads
concat → 512
Output Projection linear
[512]
H
×
[512×512]
Wo
[512]
out
current token queries history
8 independent summaries
1
2
3
4
5
6
7
8
each: softmax → weighted avg
mix heads linearly → residual
-->
Loss Computation
softmax −log
Unembedding
[512]
xL
×
[512×50257]
WET
[50257]
z
Cross-Entropy
[50257]
z
softmax −log p[y]
scalar
512
unembed
50257
CE
Input
xl-14
∂ℒ/∂xl-1
[512]
LN₁
to attn
drop
++attn
LN₂
to mlp
drop
++mlp
Output
xl4
∂ℒ/∂xl
[512]
predicted that [argmax(z)]
label sat [y=9338]
MLP
GELU matmul
Up (4× expand)
[512]
x
×
[512×2048]
W↑
[2048]
h
Activation
[2048]
h
GELU
[2048]
h'
Down (contract)
[2048]
h'
×
[2048×512]
W↓
[512]
out
512
expand
2048
GELU
2048
contract
512
Click any element to see equations, dimensions, and code →