The paper that changed everything: “Attention Is All You Need” (2017). No RNNs, no convolutions. Just attention. Now powers GPT, BERT, and nearly every language model.

High-level structure

Original transformer: encoder-decoder

Encoder: Process input, create representations Decoder: Generate output, attending to encoder

BERT uses encoder only. GPT uses decoder only.

Transformer Architecture

Full architecture tour: Transformer Animation

Encoder block

Each encoder layer:

  1. Multi-head self-attention
  2. Add & normalize (residual + layer norm)
  3. Feed-forward network
  4. Add & normalize
class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Linear(d_ff, d_model)
        )
        self.norm2 = nn.LayerNorm(d_model)
    
    def forward(self, x):
        # Pre-norm style
        x = x + self.attention(self.norm1(x))
        x = x + self.ffn(self.norm2(x))
        return x

Stack N of these (usually 6-96 depending on model size).

Decoder block

Same as encoder, plus:

  • Masked self-attention: Can only attend to previous positions
  • Cross-attention: Attend to encoder outputs
class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff):
        super().__init__()
        self.self_attention = MultiHeadAttention(d_model, num_heads)
        self.cross_attention = MultiHeadAttention(d_model, num_heads)
        self.ffn = nn.Sequential(...)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
    
    def forward(self, x, encoder_output, mask):
        x = x + self.self_attention(self.norm1(x), mask=mask)
        x = x + self.cross_attention(self.norm2(x), encoder_output)
        x = x + self.ffn(self.norm3(x))
        return x

Feed-forward network

Simple but crucial. Processes each position independently:

$$\text{FFN}(x) = \text{GELU}(xW_1 + b_1)W_2 + b_2$$

Typically d_ff = 4 × d_model. This is where most parameters live.

Recent theory: FFN acts as key-value memory. First layer selects patterns, second retrieves information.

Attention patterns

Multi-head attention lets model learn different relationship types:

  • Head 1: syntactic dependencies
  • Head 2: coreference
  • Head 3: positional patterns
  • etc.

Different heads specialize automatically during training.

Residual connections

Every sublayer has skip connection:

$$\text{output} = \text{sublayer}(x) + x$$

Critical for:

  • Training deep networks (gradient flow)
  • Preserving information
  • Allowing layers to learn “refinements”

The full picture

Input Embeddings + Positional Encoding
           ↓
┌──────────────────────────┐
│     Encoder Block ×N     │
│  - Self-attention        │
│  - FFN                   │
└──────────────────────────┘
           ↓
    Encoder Output
           ↓
┌──────────────────────────┐
│     Decoder Block ×N     │
│  - Masked self-attention │
│  - Cross-attention       │
│  - FFN                   │
└──────────────────────────┘
           ↓
      Linear + Softmax
           ↓
      Output Tokens

Variants

Encoder-only (BERT): Classification, understanding tasks Decoder-only (GPT): Generation, language modeling Encoder-decoder (T5): Translation, summarization

Modern trend: decoder-only scales best, most versatile.