Neural networks learn to denoise images. Before building a full diffusion model, understand the building blocks: conv layers, activations, residual connections.
Part 2 of 7 in the Diffusion Models series.
The goal
Input: noisy image tensor (3, 256, 256) Output: predicted noise tensor (3, 256, 256)
Network learns: “What noise was added?”
See network components: Neural Network Animation
Convolutional layers
Process images while preserving spatial structure:
conv = nn.Conv2d(
in_channels=3,
out_channels=64,
kernel_size=3,
padding=1 # same spatial size
)
x = torch.rand(1, 3, 256, 256)
out = conv(x) # (1, 64, 256, 256)
Learn local patterns: edges, textures, shapes.
Activation functions
Add nonlinearity:
# ReLU - simple, effective
x = F.relu(x)
# SiLU (Swish) - used in modern diffusion
x = F.silu(x) # x * sigmoid(x)
# GELU - smooth, used in transformers
x = F.gelu(x)
SiLU is standard for diffusion models.
Normalization
Keep activations stable:
# Group normalization - works with any batch size
norm = nn.GroupNorm(num_groups=8, num_channels=64)
x = norm(x)
Group norm > batch norm for diffusion (works with batch size 1).
Basic block
Conv + Norm + Activation:
class ConvBlock(nn.Module):
def __init__(self, in_ch, out_ch):
super().__init__()
self.conv = nn.Conv2d(in_ch, out_ch, kernel_size=3, padding=1)
self.norm = nn.GroupNorm(8, out_ch)
self.act = nn.SiLU()
def forward(self, x):
return self.act(self.norm(self.conv(x)))
Residual connections
Add input to output. Enables very deep networks:
class ResidualBlock(nn.Module):
def __init__(self, channels):
super().__init__()
self.conv1 = nn.Conv2d(channels, channels, 3, padding=1)
self.conv2 = nn.Conv2d(channels, channels, 3, padding=1)
self.norm1 = nn.GroupNorm(8, channels)
self.norm2 = nn.GroupNorm(8, channels)
def forward(self, x):
residual = x
x = F.silu(self.norm1(self.conv1(x)))
x = self.norm2(self.conv2(x))
return x + residual # skip connection
Gradients flow directly through skip connection.
Downsampling and upsampling
Downsample: Reduce spatial size, increase channels
# Strided conv
down = nn.Conv2d(64, 128, kernel_size=4, stride=2, padding=1)
# (1, 64, 256, 256) → (1, 128, 128, 128)
Upsample: Increase spatial size, decrease channels
# Transposed conv
up = nn.ConvTranspose2d(128, 64, kernel_size=4, stride=2, padding=1)
# (1, 128, 128, 128) → (1, 64, 256, 256)
# Or resize + conv
up = nn.Sequential(
nn.Upsample(scale_factor=2, mode='nearest'),
nn.Conv2d(128, 64, 3, padding=1)
)
Simple encoder-decoder
class SimpleDenoiser(nn.Module):
def __init__(self):
super().__init__()
# Encoder (downsample)
self.enc1 = ConvBlock(3, 64)
self.enc2 = ConvBlock(64, 128)
self.down = nn.MaxPool2d(2)
# Decoder (upsample)
self.up = nn.Upsample(scale_factor=2)
self.dec1 = ConvBlock(128, 64)
self.dec2 = ConvBlock(64, 3)
def forward(self, x):
# Encode
x = self.enc1(x)
x = self.down(self.enc2(x))
# Decode
x = self.up(x)
x = self.dec1(x)
x = self.dec2(x)
return x
What’s missing
This simple network works but is limited. Real diffusion models need:
- Timestep conditioning (how noisy is input?)
- Skip connections between encoder and decoder (U-Net)
- Attention layers
- More depth
These come in Part 4 (U-Net architecture).