Building Diffusion Models (2/7): Neural Network Basics

Neural networks learn to denoise images. Before building a full diffusion model, understand the building blocks: conv layers, activations, residual connections.

Part 2 of 7 in the Diffusion Models series.

The goal

Input: noisy image tensor (3, 256, 256) Output: predicted noise tensor (3, 256, 256)

Network learns: “What noise was added?”

Interactive demo: DiT Animation

Convolutional layers

Process images while preserving spatial structure:

conv = nn.Conv2d(
    in_channels=3,
    out_channels=64,
    kernel_size=3,
    padding=1  # same spatial size
)

x = torch.rand(1, 3, 256, 256)
out = conv(x)  # (1, 64, 256, 256)

Learn local patterns: edges, textures, shapes.

Activation functions

Add nonlinearity:

# ReLU - simple, effective
x = F.relu(x)

# SiLU (Swish) - used in modern diffusion
x = F.silu(x)  # x * sigmoid(x)

# GELU - smooth, used in transformers
x = F.gelu(x)

SiLU is standard for diffusion models.

Normalization

Keep activations stable:

# Group normalization - works with any batch size
norm = nn.GroupNorm(num_groups=8, num_channels=64)
x = norm(x)

Group norm > batch norm for diffusion (works with batch size 1).

Basic block

Conv + Norm + Activation:

class ConvBlock(nn.Module):
    def __init__(self, in_ch, out_ch):
        super().__init__()
        self.conv = nn.Conv2d(in_ch, out_ch, kernel_size=3, padding=1)
        self.norm = nn.GroupNorm(8, out_ch)
        self.act = nn.SiLU()
    
    def forward(self, x):
        return self.act(self.norm(self.conv(x)))

Residual connections

Add input to output. Enables very deep networks:

class ResidualBlock(nn.Module):
    def __init__(self, channels):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, 3, padding=1)
        self.conv2 = nn.Conv2d(channels, channels, 3, padding=1)
        self.norm1 = nn.GroupNorm(8, channels)
        self.norm2 = nn.GroupNorm(8, channels)
    
    def forward(self, x):
        residual = x
        x = F.silu(self.norm1(self.conv1(x)))
        x = self.norm2(self.conv2(x))
        return x + residual  # skip connection

Gradients flow directly through skip connection.

Downsampling and upsampling

Downsample: Reduce spatial size, increase channels

# Strided conv
down = nn.Conv2d(64, 128, kernel_size=4, stride=2, padding=1)
# (1, 64, 256, 256) → (1, 128, 128, 128)

Upsample: Increase spatial size, decrease channels

# Transposed conv
up = nn.ConvTranspose2d(128, 64, kernel_size=4, stride=2, padding=1)
# (1, 128, 128, 128) → (1, 64, 256, 256)

# Or resize + conv
up = nn.Sequential(
    nn.Upsample(scale_factor=2, mode='nearest'),
    nn.Conv2d(128, 64, 3, padding=1)
)

Simple encoder-decoder

class SimpleDenoiser(nn.Module):
    def __init__(self):
        super().__init__()
        # Encoder (downsample)
        self.enc1 = ConvBlock(3, 64)
        self.enc2 = ConvBlock(64, 128)
        self.down = nn.MaxPool2d(2)
        
        # Decoder (upsample)
        self.up = nn.Upsample(scale_factor=2)
        self.dec1 = ConvBlock(128, 64)
        self.dec2 = ConvBlock(64, 3)
    
    def forward(self, x):
        # Encode
        x = self.enc1(x)
        x = self.down(self.enc2(x))
        # Decode
        x = self.up(x)
        x = self.dec1(x)
        x = self.dec2(x)
        return x

What’s missing

This simple network works but is limited. Real diffusion models need:

Timestep conditioning (how noisy is input?)
Skip connections between encoder and decoder (U-Net)
Attention layers
More depth

These come in Part 4 (U-Net architecture).

Diffusion neural nets explained! Help the community by starring ML Animations and sharing this series!