# Variational Autoencoder --- ## Mathematical Foundations
Calculus & Linear Algebra
Basis for optimization algorithms and machine learning model operations
1676
Chain Rule
Leibniz, G. W.
1805
Least Squares
Legendre, A. M.
1809
Normal Equations
Gauss, C. F.
1847
Gradient Descent
Cauchy, A. L.
1858
Eigenvalue Theory
Cayley & Hamilton
1901
PCA
Pearson, K.
1951
Stochastic Gradient Descent
Robbins & Monro
Probability & Statistics
Basis for Bayesian methods, statistical inference, and generative models
1763
Bayes' Theorem
Bayes, T.
1812
Bayesian Probability
Laplace, P. S.
1815
Gaussian Distribution
Gauss, C. F.
1830
Central Limit Theorem
Various
1922
Maximum Likelihood
Fisher, R.
Information & Computation
Foundations of algorithmic thinking and information theory
1843
First Computer Algorithm
Lovelace, A.
1936
Turing Machine
Turing, A.
1947
Linear Programming
Dantzig, G.
1948
Information Theory
Shannon, C.
--- ## Early History of Neural Networks
Architectures & Layers
Evolution of network architectures and layer innovations
1943
Artificial Neurons
McCulloch & Pitts
1957
Perceptron
Rosenblatt, F.
1965
Deep Networks
Ivakhnenko & Lapa
1979
Convolutional Networks
Fukushima, K.
1982
Recurrent Networks
Hopfield
1989
LSTM
Hochreiter & Schmidhuber
2006
Deep Belief Networks
Hinton, G. et al.
2012
AlexNet
Krizhevsky et al.
Training & Optimization
Methods for efficient learning and gradient-based optimization
1967
Stochastic Gradient Descent for NN
Amari, S.
1970
Automatic Differentiation
Linnainmaa, S.
1986
Backpropagation for NN
Hinton et al.
1992
Weight Decay
Krogh & Hertz
2009
Convolutional DBNs & Prob. Max Pooling
Lee, H. et al.
2010
ReLU & Xavier Init
Nair, Hinton & Glorot
2012
Dropout
Hinton, G. et al.
Software & Datasets
Tools, platforms, and milestones that enabled practical deep learning
1997
Deep Blue
IBM
1998
MNIST Dataset & LeNet 5
LeCun, Y. et al.
2002
Torch Framework
Torch Team
2007
CUDA Platform
NVIDIA
2009
ImageNet Dataset
Deng, J. et al.
2011
Siri
Apple Inc.
--- ## The Deep Learning Era
Deep architectures
Deep architectures and generative models transforming AI capabilities
2013
Variational Autoencoders
Kingma et al.
2014
Generative Adversarial Nets
Goodfellow et al.
2015
ResNet & Diffusion
He et al. & Sohl-Dickstein et al.
2016
Style Transfer & WaveNet
Gatys & van den Oord
2017
Transformers
Vaswani et al.
2021
ViT & CLIP
Dosovitskiy & Radford
2022
Diffusion Transformer
Peebles & Xie
2023
Mamba
Gu & Dao
Training & Optimization
Advanced learning techniques and representation learning breakthroughs
2013
Word2Vec
Mikolov, T. et al.
2014
Attention Mechanism
Bahdanau, D. et al.
2015
BatchNorm & Adam
Ioffe & Kingma
2016
Layer Normalization
Ba, J. L. et al.
2020
DDPM
Ho, J. et al.
Software & Applications
Practical deployment and mainstream adoption of deep learning systems
2016
AlphaGo
Silver, D. et al.
2017
PyTorch
Paszke, A. et al.
2018
GPT-1
Radford & Devlin
2020
GPT-3
Brown, T. B. et al.
2022
ChatGPT & Stable Diffusion
OpenAI & Stability AI
2023
LLaMA
Touvron, H. et al.
--- ## Recap: Latent Models
**Latent Variable Models:** Introduce hidden $\mathbf{z}$ to model complex distributions; marginal likelihood: $p(\mathbf{x}|\boldsymbol{\theta}) = \int p(\mathbf{x}, \mathbf{z}|\boldsymbol{\theta}) \, d\mathbf{z}$ **GMM (Discrete Latent):** $p(\mathbf{x}|\boldsymbol{\theta}) = \sum_{k=1}^K \pi_k \cdot \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$ — tractable sum, but **log-of-sum** prevents closed-form MLE **EM Algorithm:** Iteratively optimize when direct MLE is intractable
E-Step:
Compute responsibilities $\gamma_{ik} = p(z_i=k|\mathbf{x}_i, \boldsymbol{\theta}^{(t)})$ (soft cluster assignments)
M-Step:
Update $\boldsymbol{\theta}$ via weighted MLE: $\boldsymbol{\mu}_k = \frac{\sum_i \gamma_{ik} \mathbf{x}_i}{\sum_i \gamma_{ik}}$, etc.
**Variational View:** - **ELBO:** $\log p(\mathbf{x}|\boldsymbol{\theta}) = \text{ELBO}(q, \boldsymbol{\theta}) + D_{\text{KL}}(q(z|\mathbf{x}) \,\|\, p(z|\mathbf{x}, \boldsymbol{\theta}))$ - **E-Step** = minimize KL → set $q = p(z|\mathbf{x}, \boldsymbol{\theta})$ (tighten bound) - **M-Step** = maximize Q-function $\mathbb{E}_q[\log p(\mathbf{x}, z|\boldsymbol{\theta})]$ (raise bound) **Key:** EM converges because log-likelihood is monotonically non-decreasing; K-means is EM with hard assignments
--- ## From GMM to Deep Latent Models
**GMM worked because:** | Component | GMM Choice | Why it's tractable | |:----------|:-----------|:-------------------| | Latent $z$ | Discrete: $z \in \{1, ..., K\}$ | Sum over $K$ values instead of integral | | Prior $p(z)$ | Categorical: $\pi_k$ | Simple mixing weights | | Decoder $p(\mathbf{x}\|z)$ | Gaussian: $\mathcal{N}(\mathbf{x}\|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$ | Closed-form posterior |
**What if we want more expressive models?**
Continuous latent space:
$\mathbf{z} \in \mathbb{R}^d$ — can represent smooth, continuous factors of variation
Neural network decoder:
$p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) = \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{z}), \sigma^2 \mathbf{I})$ — mean is neural network output, variance is fixed
This is the **deep latent variable model** — but what breaks?
--- ## The Intractable Posterior Problem
**Recall the E-step goal:** Compute the posterior $p(\mathbf{z}|\mathbf{x}, \boldsymbol{\theta})$ Using Bayes' theorem:
$ p(\mathbf{z}|\mathbf{x}, \boldsymbol{\theta}) = \frac{p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) \cdot p(\mathbf{z})}{p(\mathbf{x}|\boldsymbol{\theta})} = \frac{p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) \cdot p(\mathbf{z})}{\int p(\mathbf{x}|\mathbf{z}', \boldsymbol{\theta}) \cdot p(\mathbf{z}') \, d\mathbf{z}'} $
**The denominator is the problem!** | Model | Decoder $p(\mathbf{x}\|\mathbf{z})$ | Marginal $p(\mathbf{x})$ | Posterior $p(\mathbf{z}\|\mathbf{x})$ | |:------|:-----------------------------------|:------------------------|:------------------------------------| | GMM | Gaussian | Finite sum | **Tractable** | | Deep LVM | Neural Network | Intractable integral | **Intractable** |
**Why neural networks break tractability:**
GMM has fixed $\boldsymbol{\mu}_k$ and discrete $z$ (finite sum). The deep latent variable model has $\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{z}) = \text{NeuralNet}(\mathbf{z})$ — a complex function over continuous latent space. The marginal $\int p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) \cdot p(\mathbf{z}) \, d\mathbf{z}$ has no closed form!
--- ## EM Breaks Down
**Recall the EM framework:**
$ \log p(\mathbf{x}|\boldsymbol{\theta}) = \text{ELBO}(q, \boldsymbol{\theta}) + D_{\text{KL}}\left( q(\mathbf{z}|\mathbf{x}) \,\|\, p(\mathbf{z}|\mathbf{x}, \boldsymbol{\theta}) \right) $
| Step | GMM | Deep Latent Model | |:-----|:----|:------------------| | **E-step** | Set $q = p(\mathbf{z}\|\mathbf{x}, \boldsymbol{\theta})$ exactly | **Cannot compute** $p(\mathbf{z}\|\mathbf{x}, \boldsymbol{\theta})$ | | **M-step** | Closed-form weighted MLE | Gradient descent on NN parameters | | **Bound** | Tight (KL = 0 after E-step) | **Always a gap** |
**The fundamental problem:** In GMM, we could set $q(\mathbf{z}|\mathbf{x}) = p(\mathbf{z}|\mathbf{x}, \boldsymbol{\theta})$ exactly, making the ELBO tight. With neural network decoders, we **cannot compute the true posterior** — so we cannot perform the E-step!
**We need an approximation strategy...**
--- ## Learn the Posterior Approximation
**Key insight:** If we can't compute $p(\mathbf{z}|\mathbf{x}, \boldsymbol{\theta})$, let's **learn to approximate it!**
**Introduce an encoder network** $q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})$ that approximates the intractable posterior:
$ q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi}) = \mathcal{N}\left(\mathbf{z} \,|\, \boldsymbol{\mu}_{\boldsymbol{\phi}}(\mathbf{x}), \text{diag}(\boldsymbol{\sigma}^2_{\boldsymbol{\phi}}(\mathbf{x}))\right) $
- $\boldsymbol{\mu}_{\boldsymbol{\phi}}(\mathbf{x})$: neural network outputting the mean - $\boldsymbol{\sigma}_{\boldsymbol{\phi}}(\mathbf{x})$: neural network outputting the standard deviation
**Why this works:** - A single encoder handles **all datapoints** — one forward pass per $\mathbf{x}$ - The encoder learns to map $\mathbf{x} \mapsto (\boldsymbol{\mu}, \boldsymbol{\sigma})$ that approximate the true posterior - Generalizes to unseen data (unlike per-datapoint optimization)
**This is called "amortized inference"** — the cost of learning the posterior is amortized across the entire dataset by sharing encoder parameters $\boldsymbol{\phi}$.
--- ## VAE vs GMM: The Setup
**Recall the ELBO decomposition** (same as GMM!):
$ \log p(\mathbf{x}|\boldsymbol{\theta}) = \text{ELBO}(q, \boldsymbol{\theta}) + D_{\text{KL}}\left( q(\mathbf{z}|\mathbf{x}) \,\|\, p(\mathbf{z}|\mathbf{x}, \boldsymbol{\theta}) \right) $
| Component | GMM | VAE | |:----------|:----|:----| | **Latent** $\mathbf{z}$ | Discrete: $z \in \{1, ..., K\}$ | Continuous: $\mathbf{z} \in \mathbb{R}^d$ | | **Prior** $p(\mathbf{z})$ | Categorical: $\pi_k$ | Standard Gaussian: $\mathcal{N}(\mathbf{0}, \mathbf{I})$ | | **Decoder** $p(\mathbf{x}\|\mathbf{z}, \boldsymbol{\theta})$ | Gaussian: $\mathcal{N}(\mathbf{x}\|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$ | Neural network with Gaussian output | | **Posterior approx.** $q$ | Exact: $q = p(z\|\mathbf{x}, \boldsymbol{\theta})$ | Learned encoder: $q(\mathbf{z}\|\mathbf{x}, \boldsymbol{\phi})$ | | **Parameters** | $\boldsymbol{\theta} = \{\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k, \pi_k\}$ | $\boldsymbol{\theta}$ (decoder NN), $\boldsymbol{\phi}$ (encoder NN) |
**Key difference:** In VAE, we optimize **both** $\boldsymbol{\theta}$ and $\boldsymbol{\phi}$ jointly, since we cannot compute the true posterior!
--- ## Recap: Deriving the ELBO
**The fundamental challenge:** We want to maximize $\log p(\mathbf{x}|\boldsymbol{\theta})$, but the log-of-sum is intractable. We introduce a variational distribution $q(z|\mathbf{x})$ and use Jensen's inequality:
$ \begin{aligned} \log p_{X|\Theta}(\mathbf{x}|\boldsymbol{\theta}) &= \log \left( \sum_{k=1}^K p(\mathbf{x}, z=k|\boldsymbol{\theta}) \right) \\ &= \log \left( \sum_{k=1}^K q(z=k|\mathbf{x}) \cdot \frac{p(\mathbf{x}, z=k|\boldsymbol{\theta})}{q(z=k|\mathbf{x})} \right) \\ &= \log \left( \mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \frac{p(\mathbf{x}, z|\boldsymbol{\theta})}{q(z|\mathbf{x})} \right] \right) \\ &\geq \mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \log \frac{p(\mathbf{x}, z|\boldsymbol{\theta})}{q(z|\mathbf{x})} \right] = \text{ELBO}(q, \boldsymbol{\theta}) \end{aligned} $
--- ## The VAE ELBO
Starting from the general ELBO definition:
$ \text{ELBO}(\boldsymbol{\phi}, \boldsymbol{\theta}; \mathbf{x}) = \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})} \left[ \log \frac{p(\mathbf{x}, \mathbf{z} | \boldsymbol{\theta})}{q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})} \right] $
Using the chain rule $p(\mathbf{x}, \mathbf{z} | \boldsymbol{\theta}) = p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) \cdot p(\mathbf{z})$:
$ \begin{aligned} \text{ELBO} &= \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})} \left[ \log p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) + \log p(\mathbf{z}) - \log q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi}) \right] \\ &= \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})} \left[ \log p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) \right] + \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})} \left[ \log \frac{p(\mathbf{z})}{q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})} \right] \end{aligned} $
Recognizing the KL divergence, we get the **VAE objective**:
$ \text{ELBO}(\boldsymbol{\phi}, \boldsymbol{\theta}; \mathbf{x}) = \underbrace{\mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})} \left[ \log p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) \right]}_{\text{Reconstruction term}} - \underbrace{D_{\text{KL}}\left( q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi}) \,\|\, p(\mathbf{z}) \right)}_{\text{Regularization term}} $
--- ## Understanding the VAE Objective
$ \text{ELBO}(\boldsymbol{\phi}, \boldsymbol{\theta}; \mathbf{x}) = \underbrace{\mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})} \left[ \log p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) \right]}_{\text{Reconstruction}} - \underbrace{D_{\text{KL}}\left( q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi}) \,\|\, p(\mathbf{z}) \right)}_{\text{Regularization}} $
**Reconstruction term:** How well can the decoder reconstruct $\mathbf{x}$ from samples $\mathbf{z} \sim q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})$? - Encourages the latent code to **preserve information** about $\mathbf{x}$ - Like the expected complete-data log-likelihood in EM's M-step
**Regularization term:** How close is the encoder's output to the prior? - Encourages the latent space to be **well-structured** (match $\mathcal{N}(\mathbf{0}, \mathbf{I})$) - Prevents the encoder from "cheating" by encoding each $\mathbf{x}$ as a delta function - No direct analogue in GMM — posterior is exact there!
**Trade-off:** Reconstruction wants $q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})$ to be specific to each $\mathbf{x}$; regularization wants $q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})$ close to the prior. The VAE balances these!
--- ## Comparison: GMM Q-Function vs VAE ELBO
**GMM (E-step sets $q = p(z|\mathbf{x}, \boldsymbol{\theta})$ exactly):**
$ Q(\boldsymbol{\theta}; \boldsymbol{\theta}^{(t)}) = \sum_{i=1}^{n} \sum_{k=1}^{K} \gamma_{ik} \log p(\mathbf{x}_i, z_i=k | \boldsymbol{\theta}) = \sum_{i=1}^{n} \sum_{k=1}^{K} \gamma_{ik} \left[ \log \pi_k + \log \mathcal{N}(\mathbf{x}_i | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right] $
**VAE (optimize $\boldsymbol{\phi}$ and $\boldsymbol{\theta}$ jointly):**
$ \text{ELBO}(\boldsymbol{\phi}, \boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{x}_i, \boldsymbol{\phi})} \left[ \log p(\mathbf{x}_i|\mathbf{z}, \boldsymbol{\theta}) \right] - D_{\text{KL}}\left( q(\mathbf{z}|\mathbf{x}_i, \boldsymbol{\phi}) \,\|\, p(\mathbf{z}) \right) \right] $
**Goal:** Maximize the ELBO with respect to both $\boldsymbol{\theta}$ and $\boldsymbol{\phi}$
$ \boldsymbol{\phi}^*, \boldsymbol{\theta}^* = \arg\max_{\boldsymbol{\phi}, \boldsymbol{\theta}} \sum_{i=1}^{n} \text{ELBO}(\boldsymbol{\phi}, \boldsymbol{\theta}; \mathbf{x}_i) $
**Problem: The reconstruction term involves an expectation**
$ \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})} \left[ \log p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) \right] = \int q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi}) \log p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) \, d\mathbf{z} $
This integral has no closed form when $p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta})$ is a neural network!
--- ## Monte Carlo Estimation
**Problem:** How do we compute expectations when integrals have no closed form?
$ \mathbb{E}_{p(\mathbf{x})}[f(\mathbf{x})] = \int p(\mathbf{x}) f(\mathbf{x}) \, d\mathbf{x} \quad \text{(often intractable)} $
**Monte Carlo estimation:** Approximate the expectation using samples!
$ \mathbb{E}_{p(\mathbf{x})}[f(\mathbf{x})] \approx \frac{1}{L} \sum_{l=1}^{L} f(\mathbf{x}^{(l)}), \quad \text{where } \mathbf{x}^{(l)} \sim p(\mathbf{x}) $
**Why this works:** By the Law of Large Numbers, the sample mean converges to the true expectation:
$ \frac{1}{L} \sum_{l=1}^{L} f(\mathbf{x}^{(l)}) \xrightarrow{L \to \infty} \mathbb{E}_{p(\mathbf{x})}[f(\mathbf{x})] $
**Key properties:** - **Unbiased:** $\mathbb{E}\left[\frac{1}{L}\sum_l f(\mathbf{x}^{(l)})\right] = \mathbb{E}_{p}[f(\mathbf{x})]$ - **Variance:** $\text{Var} \propto \frac{1}{L}$ — more samples = lower variance - **Works for any** $f$ as long as we can sample from $p(\mathbf{x})$
Your browser does not support the video tag.
--- ## The Optimization Challenge
**Goal:** Maximize the ELBO with respect to both $\boldsymbol{\theta}$ and $\boldsymbol{\phi}$
$ \boldsymbol{\phi}^*, \boldsymbol{\theta}^* = \arg\max_{\boldsymbol{\phi}, \boldsymbol{\theta}} \sum_{i=1}^{n} \text{ELBO}(\boldsymbol{\phi}, \boldsymbol{\theta}; \mathbf{x}_i) $
**Problem: The reconstruction term involves an expectation**
$ \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})} \left[ \log p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) \right] = \int q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi}) \log p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) \, d\mathbf{z} $
This integral has no closed form when $p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta})$ is a neural network!
**Solution:** Monte Carlo estimation — sample $\mathbf{z}^{(l)} \sim q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})$:
$ \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})} \left[ \log p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) \right] \approx \frac{1}{L} \sum_{l=1}^{L} \log p(\mathbf{x}|\mathbf{z}^{(l)}, \boldsymbol{\theta}) $
In practice, $L = 1$ works well during training!
--- ## Gradient w.r.t. Decoder Parameters $\boldsymbol{\theta}$
**Good news:** The gradient w.r.t. $\boldsymbol{\theta}$ is straightforward!
$ \nabla_{\boldsymbol{\theta}} \frac{1}{L} \sum_{l=1}^{L} \log p(\mathbf{x}|\mathbf{z}^{(l)}, \boldsymbol{\theta}) = \frac{1}{L} \sum_{l=1}^{L} \nabla_{\boldsymbol{\theta}} \log p(\mathbf{x}|\mathbf{z}^{(l)}, \boldsymbol{\theta}) $
**Why is this easy?** - The samples $\mathbf{z}^{(l)}$ come from the **encoder** (parameters $\boldsymbol{\phi}$) - From the decoder's perspective, $\mathbf{z}^{(l)}$ is just a **fixed input** — like any other input to a neural network - No sampling w.r.t. $\boldsymbol{\theta}$ means standard backpropagation works!
**This is just like training any neural network:** $\mathbf{z}^{(l)} \xrightarrow{\text{Decoder}_{\boldsymbol{\theta}}} \hat{\mathbf{x}} \xrightarrow{\text{loss}} \log p(\mathbf{x}|\mathbf{z}^{(l)}, \boldsymbol{\theta})$ Backprop through the decoder as usual!
--- ## Gradient w.r.t. Encoder Parameters $\boldsymbol{\phi}$
**Problem:** We need gradients w.r.t. $\boldsymbol{\phi}$, but we sample from $q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})$!
$ \nabla_{\boldsymbol{\phi}} \frac{1}{L} \sum_{l=1}^{L} \log p(\mathbf{x}|\mathbf{z}^{(l)}, \boldsymbol{\theta}), \quad \text{where } \mathbf{z}^{(l)} \sim q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi}) $
**The issue:** The samples $\mathbf{z}^{(l)}$ depend on $\boldsymbol{\phi}$ through stochastic sampling! - Sampling $\mathbf{z} \sim q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})$ is a **stochastic operation** - Gradients don't flow through random sampling! - We cannot backpropagate through the sampling step
**Compare the two gradients:** | Parameter | Gradient | Difficulty | |:----------|:---------|:-----------| | $\boldsymbol{\theta}$ (decoder) | $\nabla_{\boldsymbol{\theta}} \log p(\mathbf{x}\|\mathbf{z}, \boldsymbol{\theta})$ | Standard backprop — $\mathbf{z}$ is just an input | | $\boldsymbol{\phi}$ (encoder) | $\nabla_{\boldsymbol{\phi}} \log p(\mathbf{x}\|\mathbf{z}, \boldsymbol{\theta})$ | **Problematic** — $\mathbf{z}$ depends on $\boldsymbol{\phi}$ via sampling |
--- ## The Reparameterization Trick
**Key insight:** Rewrite the sampling process to separate stochasticity from parameters!
**Before (non-differentiable):**
$\mathbf{z} \sim q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi}) = \mathcal{N}(\boldsymbol{\mu}_{\boldsymbol{\phi}}(\mathbf{x}), \text{diag}(\boldsymbol{\sigma}^2_{\boldsymbol{\phi}}(\mathbf{x})))$
**After (differentiable):**
$ \mathbf{z} = \boldsymbol{\mu}_{\boldsymbol{\phi}}(\mathbf{x}) + \boldsymbol{\sigma}_{\boldsymbol{\phi}}(\mathbf{x}) \odot \boldsymbol{\epsilon}, \quad \text{where } \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) $
$\boldsymbol{\epsilon}$ is sampled from a
fixed
distribution (independent of $\boldsymbol{\phi}$)
$\mathbf{z}$ is now a
deterministic function
of $\boldsymbol{\phi}$ (given $\boldsymbol{\epsilon}$)
Gradients flow through $\boldsymbol{\mu}_{\boldsymbol{\phi}}$ and $\boldsymbol{\sigma}_{\boldsymbol{\phi}}$ via standard backpropagation!
**The expectation becomes:**
$ \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})} \left[ f(\mathbf{z}) \right] = \mathbb{E}_{\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \left[ f(\boldsymbol{\mu}_{\boldsymbol{\phi}}(\mathbf{x}) + \boldsymbol{\sigma}_{\boldsymbol{\phi}}(\mathbf{x}) \odot \boldsymbol{\epsilon}) \right] $
Now $\nabla_{\boldsymbol{\phi}}$ can go inside the expectation!
--- ## Reparameterization: The Math
**With reparameterization, we can compute gradients of the MC estimate:**
$ \nabla_{\boldsymbol{\phi}} \frac{1}{L} \sum_{l=1}^{L} f(\mathbf{z}^{(l)}) = \frac{1}{L} \sum_{l=1}^{L} \nabla_{\boldsymbol{\phi}} f(\boldsymbol{\mu}_{\boldsymbol{\phi}} + \boldsymbol{\sigma}_{\boldsymbol{\phi}} \odot \boldsymbol{\epsilon}^{(l)}) $
**Applying the chain rule:**
$ \nabla_{\boldsymbol{\phi}} f(\mathbf{z}) = \nabla_\mathbf{z} f(\mathbf{z}) \cdot \nabla_{\boldsymbol{\phi}} \mathbf{z} = \nabla_\mathbf{z} f(\mathbf{z}) \cdot \left( \nabla_{\boldsymbol{\phi}} \boldsymbol{\mu}_{\boldsymbol{\phi}} + \boldsymbol{\epsilon} \odot \nabla_{\boldsymbol{\phi}} \boldsymbol{\sigma}_{\boldsymbol{\phi}} \right) $
**In practice (with $L$ samples):**
$ \nabla_{\boldsymbol{\phi}} \frac{1}{L} \sum_{l=1}^{L} f(\mathbf{z}^{(l)}) = \frac{1}{L} \sum_{l=1}^{L} \nabla_\mathbf{z} f(\mathbf{z}^{(l)}) \cdot \left( \nabla_{\boldsymbol{\phi}} \boldsymbol{\mu}_{\boldsymbol{\phi}} + \boldsymbol{\epsilon}^{(l)} \odot \nabla_{\boldsymbol{\phi}} \boldsymbol{\sigma}_{\boldsymbol{\phi}} \right) $
--- ## Applying to the VAE Reconstruction Term
**Now let's substitute** $f(\mathbf{z}) = \log p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta})$ — the decoder log-likelihood:
$ \nabla_{\boldsymbol{\phi}} \frac{1}{L} \sum_{l=1}^{L} \log p(\mathbf{x}|\mathbf{z}^{(l)}, \boldsymbol{\theta}) = \frac{1}{L} \sum_{l=1}^{L} \nabla_{\boldsymbol{\phi}} \log p(\mathbf{x}|\boldsymbol{\mu}_{\boldsymbol{\phi}} + \boldsymbol{\sigma}_{\boldsymbol{\phi}} \odot \boldsymbol{\epsilon}^{(l)}, \boldsymbol{\theta}) $
**Expanding with the chain rule:**
$ = \frac{1}{L} \sum_{l=1}^{L} \underbrace{\nabla_\mathbf{z} \log p(\mathbf{x}|\mathbf{z}^{(l)}, \boldsymbol{\theta})}_{\text{decoder gradient}} \cdot \left( \nabla_{\boldsymbol{\phi}} \boldsymbol{\mu}_{\boldsymbol{\phi}} + \boldsymbol{\epsilon}^{(l)} \odot \nabla_{\boldsymbol{\phi}} \boldsymbol{\sigma}_{\boldsymbol{\phi}} \right) $
**Key insight:** The gradient flows from decoder → through $\mathbf{z}$ → to encoder parameters $\boldsymbol{\phi}$ - $\nabla_\mathbf{z} \log p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta})$: how changing $\mathbf{z}$ affects reconstruction - $\nabla_{\boldsymbol{\phi}} \boldsymbol{\mu}_{\boldsymbol{\phi}}$: how encoder parameters affect the mean - $\boldsymbol{\epsilon}^{(l)} \odot \nabla_{\boldsymbol{\phi}} \boldsymbol{\sigma}_{\boldsymbol{\phi}}$: how encoder parameters affect variance (scaled by noise)
**In practice ($L=1$):** Sample one $\boldsymbol{\epsilon}$, compute $\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon}$, backprop through decoder and encoder!
--- ## Recap: The Full VAE Objective
**We're optimizing the ELBO:**
$ \text{ELBO}(\boldsymbol{\phi}, \boldsymbol{\theta}; \mathbf{x}) = \underbrace{\mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})} \left[ \log p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) \right]}_{\text{Reconstruction term}} - \underbrace{D_{\text{KL}}\left( q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi}) \,\|\, p(\mathbf{z}) \right)}_{\text{Regularization term}} $
**What we've solved — the reconstruction term:** | Challenge | Solution | |:----------|:---------| | Intractable expectation | Monte Carlo: $\frac{1}{L} \sum_{l=1}^{L} \log p(\mathbf{x}\|\mathbf{z}^{(l)}, \boldsymbol{\theta})$ | | Gradient w.r.t. $\boldsymbol{\theta}$ | Standard backprop (z is just an input) | | Gradient w.r.t. $\boldsymbol{\phi}$ | Reparameterization trick |
**What's left — the KL term:** How do we compute $D_{\text{KL}}\left( q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi}) \,\|\, p(\mathbf{z}) \right)$?
--- ## The KL Term: Closed Form
**Good news:** The KL divergence between two Gaussians has a closed form! For $q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi}) = \mathcal{N}(\boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma}^2))$ and $p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})$:
$ \begin{aligned} D_{\text{KL}}(q \| p) &= \mathbb{E}_q[\log q(\mathbf{z})] - \mathbb{E}_q[\log p(\mathbf{z})] \\[0.5em] &= \mathbb{E}_q\left[-\frac{1}{2}\sum_{j=1}^d \left(\log(2\pi\sigma_j^2) + \frac{(z_j - \mu_j)^2}{\sigma_j^2}\right)\right] - \mathbb{E}_q\left[-\frac{1}{2}\sum_{j=1}^d \left(\log(2\pi) + z_j^2\right)\right] \\[0.5em] &= -\frac{1}{2}\sum_j \left(\log \sigma_j^2 + 1\right) + \frac{1}{2}\sum_j \mathbb{E}_q[z_j^2] \\[0.5em] &= -\frac{1}{2}\sum_j \left(\log \sigma_j^2 + 1\right) + \frac{1}{2}\sum_j \left(\mu_j^2 + \sigma_j^2\right) \\[0.5em] &= \frac{1}{2} \sum_{j=1}^{d} \left( \sigma_j^2 + \mu_j^2 - 1 - \log \sigma_j^2 \right) \end{aligned} $
where $j \in \{1, \ldots, d\}$ indexes each dimension of the latent vector $\mathbf{z} \in \mathbb{R}^d$.
**No Monte Carlo needed for this term!** Gradients w.r.t. $\boldsymbol{\phi}$ are straightforward.
--- ## The Complete VAE Loss
**Putting it all together:** For a single datapoint $\mathbf{x}$:
$ \text{ELBO}(\boldsymbol{\phi}, \boldsymbol{\theta}; \mathbf{x}) = \underbrace{\frac{1}{L} \sum_{l=1}^{L} \log p(\mathbf{x}|\mathbf{z}^{(l)}, \boldsymbol{\theta})}_{\text{Monte Carlo estimate}} - \underbrace{\frac{1}{2} \sum_{j=1}^{d} \left( \sigma_j^2 + \mu_j^2 - 1 - \log \sigma_j^2 \right)}_{\text{Closed-form KL}} $
where $\mathbf{z}^{(l)} = \boldsymbol{\mu}_{\boldsymbol{\phi}}(\mathbf{x}) + \boldsymbol{\sigma}_{\boldsymbol{\phi}}(\mathbf{x}) \odot \boldsymbol{\epsilon}^{(l)}$, $\boldsymbol{\epsilon}^{(l)} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$
**But what is** $\log p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta})$**?** We need to specify the decoder's output distribution! **Common choice:** Gaussian with fixed variance $\sigma^2$
$ p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) = \mathcal{N}(\mathbf{x} \,|\, \boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{z}), \sigma^2 \mathbf{I}) $
The neural network $\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{z})$ outputs the **mean** of this Gaussian — the reconstructed $\hat{\mathbf{x}}$.
--- ## Decoder Likelihood: From Gaussian to MSE
**Decoder output distribution:** $p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) = \mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{z}), \sigma^2 \mathbf{I})$
**Taking the log of the Gaussian PDF:**
$ \begin{aligned} \log p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) &= \log \left( \frac{1}{(2\pi\sigma^2)^{D/2}} \exp\left( -\frac{1}{2\sigma^2} \|\mathbf{x} - \boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{z})\|^2 \right) \right) \\[0.5em] &= -\frac{D}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \|\mathbf{x} - \boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{z})\|^2 \\[0.5em] &= -\frac{1}{2\sigma^2} \|\mathbf{x} - \boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{z})\|^2 + \text{const} \end{aligned} $
where $D$ is the data dimensionality (e.g., number of pixels).
**Key insight:** Since $\sigma^2$ is a fixed constant:
$\max_{\boldsymbol{\theta}} \log p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) \quad \Longleftrightarrow \quad \min_{\boldsymbol{\theta}} \|\mathbf{x} - \boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{z})\|^2$
Maximizing Gaussian log-likelihood is equivalent to minimizing **mean squared error (MSE)**!
--- ## The Practical VAE Loss
**Substituting the Gaussian decoder into the ELBO:**
$ \text{ELBO} \propto -\frac{1}{2\sigma^2} \|\mathbf{x} - \hat{\mathbf{x}}\|^2 - D_{\text{KL}}(q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi}) \| p(\mathbf{z})) $
**Converting to a loss (negate and drop constants):**
$ \mathcal{L}_{\text{VAE}} = \underbrace{\|\mathbf{x} - \hat{\mathbf{x}}\|^2}_{\text{Reconstruction loss (MSE)}} + \underbrace{\beta \cdot D_{\text{KL}}(q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi}) \| p(\mathbf{z}))}_{\text{KL regularization}} $
where $\hat{\mathbf{x}} = \boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{z})$ is the decoder output.
| $\beta$ | Effect | | :--------- | :-------- | | $\beta = 1$ | Standard VAE (original formulation) | | $\beta < 1$ | Better reconstructions, less regularized latent space | | $\beta > 1$ | **$\beta$-VAE**: stronger regularization, more disentangled latents |
**Note:** The relationship $\beta = 2\sigma^2$ shows that $\beta$ implicitly controls the assumed decoder variance — larger $\beta$ corresponds to assuming a noisier decoder!
--- ## VAE Training Algorithm
``` Initialize: encoder parameters φ, decoder parameters θ For each epoch: For each minibatch {x₁, ..., xₘ}: # Forward pass (encoder) For each xᵢ: (μᵢ, σᵢ) = Encoder_φ(xᵢ) # Reparameterization (sample latent codes) For each xᵢ: εᵢ ~ N(0, I) zᵢ = μᵢ + σᵢ ⊙ εᵢ # Forward pass (decoder) For each zᵢ: x̂ᵢ = Decoder_θ(zᵢ) # Compute loss L_recon = (1/m) Σᵢ ||xᵢ - x̂ᵢ||² L_KL = (1/m) Σᵢ Σⱼ (σᵢⱼ² + μᵢⱼ² - 1 - log σᵢⱼ²) / 2 L = L_recon + β · L_KL # Backward pass & update Compute ∇_θ L, ∇_φ L via backpropagation Update θ, φ using optimizer (e.g., Adam) ```
--- ## GMM vs VAE: Optimization Comparison
Aspect
GMM (EM)
VAE
E-step / Encoder
Compute $\gamma_{ik} = p(z_i=k|\mathbf{x}_i, \boldsymbol{\theta})$ exactly
Forward pass: $(\boldsymbol{\mu}, \boldsymbol{\sigma}) = \text{Encoder}_{\boldsymbol{\phi}}(\mathbf{x})$
Posterior
Exact (tractable)
Approximate (learned)
Sampling
Weighted sum over $K$ components
Monte Carlo: $\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon}$
M-step / Decoder
Closed-form: $\boldsymbol{\mu}_k = \frac{\sum_i \gamma_{ik} \mathbf{x}_i}{\sum_i \gamma_{ik}}$
Gradient descent on NN
Joint optimization
Alternating (E then M)
Simultaneous (SGD on $\boldsymbol{\theta}, \boldsymbol{\phi}$)
Convergence
Monotonic increase in likelihood
ELBO increases (with noise from SGD)
KL gap
Zero (ELBO is tight)
Non-zero (approximation gap)
**Key insight:** VAE trades exactness for expressiveness: - GMM: Exact inference, limited model (Gaussian components) - VAE: Approximate inference, powerful model (neural networks)
--- ## Summary: Optimizing the VAE
**The VAE objective** (maximize ELBO):
$ \text{ELBO}(\boldsymbol{\phi}, \boldsymbol{\theta}; \mathbf{x}) = \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})} \left[ \log p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) \right] - D_{\text{KL}}\left( q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi}) \,\|\, p(\mathbf{z}) \right) $
**Three key ingredients:** | Challenge | Solution | |:----------|:---------| | Intractable posterior $p(\mathbf{z}\|\mathbf{x})$ | Learn encoder $q(\mathbf{z}\|\mathbf{x}, \boldsymbol{\phi})$ | | Intractable expectation | Monte Carlo sampling ($L=1$ suffices) | | Non-differentiable sampling | Reparameterization trick |
--- ## VAE Architecture Overview
Your browser does not support the video tag.
--- # Questions?