Building Diffusion Models (1/7): Understanding Tensors

Before diving into diffusion models, master the foundation: tensors. Every image is a tensor. Every operation is tensor manipulation.

Part 1 of 7 in the Diffusion Models series.

What is a tensor?

Multi-dimensional array. Generalizes scalars, vectors, matrices.

Scalar: 0D tensor (single number)
Vector: 1D tensor (list of numbers)
Matrix: 2D tensor (table of numbers)
3D+ tensor: higher dimensions

Images are 3D: (channels, height, width) or (height, width, channels).

Interactive demo: SD3 Overview Animation

Image tensors

RGB image 256×256:

import torch

# (channels, height, width)
image = torch.rand(3, 256, 256)
# 3 channels (R, G, B)
# 256 pixels high
# 256 pixels wide
# Total: 3 * 256 * 256 = 196,608 values

Batch of images:

# (batch, channels, height, width)
batch = torch.rand(32, 3, 256, 256)
# 32 images

Basic operations

Element-wise:

a = torch.tensor([1, 2, 3])
b = torch.tensor([4, 5, 6])

a + b  # [5, 7, 9]
a * b  # [4, 10, 18]
a ** 2 # [1, 4, 9]

Broadcasting:

image = torch.rand(3, 256, 256)
brightness = torch.tensor([1.0, 0.8, 1.2])[:, None, None]
# brightness has shape (3, 1, 1), broadcasts to (3, 256, 256)
adjusted = image * brightness

Reshaping:

x = torch.rand(2, 3, 4)
x.view(6, 4)      # reshape to (6, 4)
x.permute(2, 0, 1)  # swap dimensions: (4, 2, 3)
x.flatten()        # 1D: (24,)

Indexing

image = torch.rand(3, 256, 256)

red_channel = image[0]       # (256, 256)
top_half = image[:, :128, :] # (3, 128, 256)
center = image[:, 64:192, 64:192]  # (3, 128, 128)

Key operations for diffusion

Random noise:

noise = torch.randn(3, 256, 256)  # standard normal

Interpolation:

# Linear interpolation between two tensors
t = 0.3
result = t * tensor_a + (1 - t) * tensor_b

Normalization:

# Scale to [-1, 1]
image_normalized = (image - 0.5) / 0.5

# Scale to [0, 1]
image_01 = (image_normalized + 1) / 2

Device management

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Move tensor to GPU
tensor_gpu = tensor.to(device)

# Create directly on device
noise = torch.randn(3, 256, 256, device=device)

Gradients

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = (x ** 2).sum()
y.backward()
print(x.grad)  # [2, 4, 6]

For diffusion, we’ll compute gradients of noise predictions with respect to model parameters.

Coming up

This tensor foundation enables everything in diffusion:

Part 2: Neural networks that process image tensors
Part 3: Adding and removing noise
Part 4: U-Net architecture
Part 5: Training the denoiser
Part 6: Sampling (generating images)
Part 7: Conditioning (text-to-image)

Ready for diffusion models? Star ML Animations and share this series with others learning generative AI!