Before diving into diffusion models, master the foundation: tensors. Every image is a tensor. Every operation is tensor manipulation.
Part 1 of 7 in the Diffusion Models series.
What is a tensor?
Multi-dimensional array. Generalizes scalars, vectors, matrices.
- Scalar: 0D tensor (single number)
- Vector: 1D tensor (list of numbers)
- Matrix: 2D tensor (table of numbers)
- 3D+ tensor: higher dimensions
Images are 3D: (channels, height, width) or (height, width, channels).
Visualize dimensions: Tensor Animation
Image tensors
RGB image 256×256:
import torch
# (channels, height, width)
image = torch.rand(3, 256, 256)
# 3 channels (R, G, B)
# 256 pixels high
# 256 pixels wide
# Total: 3 * 256 * 256 = 196,608 values
Batch of images:
# (batch, channels, height, width)
batch = torch.rand(32, 3, 256, 256)
# 32 images
Basic operations
Element-wise:
a = torch.tensor([1, 2, 3])
b = torch.tensor([4, 5, 6])
a + b # [5, 7, 9]
a * b # [4, 10, 18]
a ** 2 # [1, 4, 9]
Broadcasting:
image = torch.rand(3, 256, 256)
brightness = torch.tensor([1.0, 0.8, 1.2])[:, None, None]
# brightness has shape (3, 1, 1), broadcasts to (3, 256, 256)
adjusted = image * brightness
Reshaping:
x = torch.rand(2, 3, 4)
x.view(6, 4) # reshape to (6, 4)
x.permute(2, 0, 1) # swap dimensions: (4, 2, 3)
x.flatten() # 1D: (24,)
Indexing
image = torch.rand(3, 256, 256)
red_channel = image[0] # (256, 256)
top_half = image[:, :128, :] # (3, 128, 256)
center = image[:, 64:192, 64:192] # (3, 128, 128)
Key operations for diffusion
Random noise:
noise = torch.randn(3, 256, 256) # standard normal
Interpolation:
# Linear interpolation between two tensors
t = 0.3
result = t * tensor_a + (1 - t) * tensor_b
Normalization:
# Scale to [-1, 1]
image_normalized = (image - 0.5) / 0.5
# Scale to [0, 1]
image_01 = (image_normalized + 1) / 2
Device management
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Move tensor to GPU
tensor_gpu = tensor.to(device)
# Create directly on device
noise = torch.randn(3, 256, 256, device=device)
Gradients
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = (x ** 2).sum()
y.backward()
print(x.grad) # [2, 4, 6]
For diffusion, we’ll compute gradients of noise predictions with respect to model parameters.
Coming up
This tensor foundation enables everything in diffusion:
- Part 2: Neural networks that process image tensors
- Part 3: Adding and removing noise
- Part 4: U-Net architecture
- Part 5: Training the denoiser
- Part 6: Sampling (generating images)
- Part 7: Conditioning (text-to-image)