Autoencoders compress data to a bottleneck, then reconstruct. VAEs make the bottleneck probabilistic - encode to a distribution, sample from it, decode. This enables generation of new data.

Regular autoencoder

Encoder: x → z (deterministic) Decoder: z → x̂ (deterministic)

Problem: latent space is sparse, unstructured. Can’t sample meaningful z.

VAE idea

Encoder: x → (μ, σ) (parameters of distribution) Sample: z ~ N(μ, σ²) Decoder: z → x̂

Latent space becomes continuous. Similar inputs → similar latent distributions.

VAE Architecture

See encoding and sampling: VAE Animation

The loss function

$$\mathcal{L} = \mathbb{E}{z \sim q(z|x)}[\log p(x|z)] - D{KL}(q(z|x) || p(z))$$

Two parts:

  1. Reconstruction: Decoded output matches input
  2. KL divergence: Latent distribution close to prior (standard normal)
def vae_loss(x, x_recon, mu, logvar):
    # Reconstruction loss
    recon_loss = F.binary_cross_entropy(x_recon, x, reduction='sum')
    
    # KL divergence: -0.5 * sum(1 + log(sigma^2) - mu^2 - sigma^2)
    kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
    
    return recon_loss + kl_loss

Reparameterization trick

Problem: can’t backprop through sampling.

Solution: separate randomness from parameters.

$$z = \mu + \sigma \cdot \epsilon, \quad \epsilon \sim N(0, 1)$$

Now gradient flows through μ and σ.

def reparameterize(mu, logvar):
    std = torch.exp(0.5 * logvar)
    eps = torch.randn_like(std)
    return mu + eps * std

Architecture

class VAE(nn.Module):
    def __init__(self, input_dim, latent_dim):
        super().__init__()
        # Encoder
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 512),
            nn.ReLU(),
            nn.Linear(512, 256),
            nn.ReLU()
        )
        self.fc_mu = nn.Linear(256, latent_dim)
        self.fc_var = nn.Linear(256, latent_dim)
        
        # Decoder
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 512),
            nn.ReLU(),
            nn.Linear(512, input_dim),
            nn.Sigmoid()
        )
    
    def encode(self, x):
        h = self.encoder(x)
        return self.fc_mu(h), self.fc_var(h)
    
    def decode(self, z):
        return self.decoder(z)
    
    def forward(self, x):
        mu, logvar = self.encode(x)
        z = reparameterize(mu, logvar)
        return self.decode(z), mu, logvar

Generating new samples

# Sample from prior
z = torch.randn(num_samples, latent_dim)
generated = model.decode(z)

Interpolation

VAE latent space is smooth. Interpolate between two images:

z1 = model.encode(img1)[0]  # get mu
z2 = model.encode(img2)[0]

for alpha in np.linspace(0, 1, 10):
    z_interp = alpha * z1 + (1 - alpha) * z2
    img_interp = model.decode(z_interp)

β-VAE

Weight the KL term:

$$\mathcal{L} = \text{recon} + \beta \cdot \text{KL}$$

Higher β → more disentangled latent dimensions (each captures different factor).

Limitations

  • Blurry outputs (mean of possibilities)
  • KL collapse (posterior = prior, ignores input)
  • Harder to train than GANs

Improvements: VQ-VAE (discrete latent), hierarchical VAE, diffusion models…