**Key insight:** Let's rewrite the ELBO using the chain rule $p(\mathbf{x}, z|\boldsymbol{\theta}) = p(z|\mathbf{x}, \boldsymbol{\theta}) \cdot p(\mathbf{x}|\boldsymbol{\theta})$:
$
\begin{aligned}
\text{ELBO}(q, \boldsymbol{\theta}) &= \mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \log p(\mathbf{x}, z|\boldsymbol{\theta}) \right] - \mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \log q(z|\mathbf{x}) \right] \\
&= \mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \log p(z|\mathbf{x}, \boldsymbol{\theta}) \right] + \mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \log p(\mathbf{x}|\boldsymbol{\theta}) \right] - \mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \log q(z|\mathbf{x}) \right] \\
&= \log p(\mathbf{x}|\boldsymbol{\theta}) - \underbrace{\mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \log \frac{q(z|\mathbf{x})}{p(z|\mathbf{x}, \boldsymbol{\theta})} \right]}_{D_{\text{KL}}\left( q(z|\mathbf{x}) \,\|\, p(z|\mathbf{x}, \boldsymbol{\theta}) \right)}
\end{aligned}
$
The **KL divergence** measures how different one probability distribution is from another:
$
D_{\text{KL}}(q \,\|\, p) = \mathbb{E}_{z \sim q(z)} \left[ \log \frac{q(z)}{p(z)} \right] = \sum_z q(z) \log \frac{q(z)}{p(z)}
$
**Key properties:**
| Property | Meaning |
|:---------|:--------|
| $D_{\text{KL}}(q \,\|\, p) \geq 0$ | Always non-negative |
| $D_{\text{KL}}(q \,\|\, p) = 0 \iff q = p$ | Zero only when distributions are identical |
| $D_{\text{KL}}(q \,\|\, p) \neq D_{\text{KL}}(p \,\|\, q)$ | **Not symmetric** — order matters! |
**Intuition:** KL divergence measures the "extra bits" needed to encode samples from $q$ using a code optimized for $p$.
- Large $D_{\text{KL}}$: $q$ and $p$ are very different
- Small $D_{\text{KL}}$: $q$ approximates $p$ well
**Key insight:** Let's rewrite the ELBO using the chain rule $p(\mathbf{x}, z|\boldsymbol{\theta}) = p(z|\mathbf{x}, \boldsymbol{\theta}) \cdot p(\mathbf{x}|\boldsymbol{\theta})$:
$
\begin{aligned}
\text{ELBO}(q, \boldsymbol{\theta}) &= \mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \log p(\mathbf{x}, z|\boldsymbol{\theta}) \right] - \mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \log q(z|\mathbf{x}) \right] \\
&= \mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \log p(z|\mathbf{x}, \boldsymbol{\theta}) \right] + \mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \log p(\mathbf{x}|\boldsymbol{\theta}) \right] - \mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \log q(z|\mathbf{x}) \right] \\
&= \log p(\mathbf{x}|\boldsymbol{\theta}) - \mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \log \frac{q(z|\mathbf{x})}{p(z|\mathbf{x}, \boldsymbol{\theta})} \right]\\
&= \log p(\mathbf{x}|\boldsymbol{\theta}) - D_{\text{KL}}\left( q(z|\mathbf{x}) \,\|\, p(z|\mathbf{x}, \boldsymbol{\theta}) \right)
\end{aligned}
$
Rearranging gives the **fundamental decomposition**:
$
\log p(\mathbf{x}|\boldsymbol{\theta}) = \text{ELBO}(q, \boldsymbol{\theta}; \mathbf{x}) + D_{\text{KL}}\left( q(z|\mathbf{x}) \,\|\, p(z|\mathbf{x}, \boldsymbol{\theta}) \right)
$
**This tells us:**
| Property | Explanation |
|:---------|:------------|
| ELBO is a **lower bound** | Since $D_{\text{KL}} \geq 0$ always |
| The **gap** is the KL divergence | From $q$ to the true posterior $p(z\|\mathbf{x}, \boldsymbol{\theta})$ |
| **Tight bound** when $q = p(z\|\mathbf{x}, \boldsymbol{\theta})$ | Setting $q$ to the true posterior makes $D_{\text{KL}} = 0$ |
At iteration $t$, we have current parameters $\boldsymbol{\theta}^{(t)}$. From the decomposition:
$
\log p(\mathbf{x}|\boldsymbol{\theta}^{(t)}) = \text{ELBO}(q, \boldsymbol{\theta}^{(t)}; \mathbf{x}) + D_{\text{KL}}\left( q(z|\mathbf{x}) \,\|\, p(z|\mathbf{x}, \boldsymbol{\theta}^{(t)}) \right)
$
We want to first **maximize the ELBO w.r.t. $q$** to make the bound as tight as possible.
$
\begin{aligned}
q^{(t+1)}(z|\mathbf{x}) &= \arg\max_{q} \text{ELBO}(q, \boldsymbol{\theta}^{(t)}; \mathbf{x})\\
&= \arg\max_{q} \left( \log p(\mathbf{x}|\boldsymbol{\theta}^{(t)}) - D_{\text{KL}}\left( q(z|\mathbf{x}) \,\|\, p(z|\mathbf{x}, \boldsymbol{\theta}^{(t)}) \right) \right) \\
\end{aligned}
$
Since $\log p(\mathbf{x}|\boldsymbol{\theta}^{(t)})$ is constant w.r.t. $q$, this is equivalent to minimizing the KL divergence:
$
q^{(t+1)}(z|\mathbf{x}) = \arg\min_{q} D_{\text{KL}}\left( q(z|\mathbf{x}) \,\|\, p(z|\mathbf{x}, \boldsymbol{\theta}^{(t)}) \right)
$
---
## The E-Step: Making the Bound Tight
Since $\log p(\mathbf{x}|\boldsymbol{\theta}^{(t)})$ is constant w.r.t. $q$, this is equivalent to minimizing the KL divergence:
$
q^{(t+1)}(z|\mathbf{x}) = \arg\min_{q} D_{\text{KL}}\left( q(z|\mathbf{x}) \,\|\, p(z|\mathbf{x}, \boldsymbol{\theta}^{(t)}) \right)
$
The minimum is achieved when the two distributions are equal. In the case of GMMs, this gives the familiar E-step update:
$
\begin{aligned}
q^{(t+1)}(z=k|\mathbf{x}_i) &= p(z=k|\mathbf{x}_i, \boldsymbol{\theta}^{(t)})\\
&= \frac{p(\mathbf{x}_i|z=k, \boldsymbol{\theta}^{(t)}) \cdot p(z=k|\boldsymbol{\theta}^{(t)})}{p(\mathbf{x}_i|\boldsymbol{\theta}^{(t)})} \\
&= \frac{\mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_k^{(t)}, \boldsymbol{\Sigma}_k^{(t)})\cdot \pi_k^{(t)} }{\sum_{j=1}^K \pi_j^{(t)} \cdot \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_j^{(t)}, \boldsymbol{\Sigma}_j^{(t)})}
\end{aligned}
$
After the E-step, the bound is tight:
$
\log p(\mathbf{x}|\boldsymbol{\theta}^{(t)}) = \text{ELBO}(q^{(t+1)}, \boldsymbol{\theta}^{(t)}; \mathbf{x})
$
---
## The M-Step: Raising the Bound
Now we fix $q(z|\mathbf{x}) = q^{(t+1)}(z|\mathbf{x})$ and maximize the ELBO with respect to $\boldsymbol{\theta}$:
$
\boldsymbol{\theta}^{(t+1)} = \arg\max_{\boldsymbol{\theta}} \text{ELBO}(q^{(t+1)}, \boldsymbol{\theta}; \mathbf{x})
$
Recall the ELBO definition:
$
\text{ELBO}(q, \boldsymbol{\theta}; \mathbf{x}) = \mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \log p(\mathbf{x}, z|\boldsymbol{\theta}) \right] - \mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \log q(z|\mathbf{x}) \right]
$
Since $q$ is **fixed**, the entropy term $-\mathbb{E}_{z \sim q(z|\mathbf{x})}[\log q(z|\mathbf{x})]$ is constant w.r.t. $\boldsymbol{\theta}$:
$
\boldsymbol{\theta}^{(t+1)} = \arg\max_{\boldsymbol{\theta}} \mathbb{E}_{z \sim q^{(t+1)}(z|\mathbf{x})} \left[ \log p(\mathbf{x}, z|\boldsymbol{\theta}) \right]
$
---
## The M-Step: The Q-Function
The quantity we maximize is called the **Q-function** (expected complete-data log-likelihood):
$
Q(\boldsymbol{\theta}; \boldsymbol{\theta}^{(t)}) = \sum_{i=1}^{n} \mathbb{E}_{z_i \sim q^{(t+1)}(z_i|\mathbf{x}_i)} \left[ \log p(\mathbf{x}_i, z_i|\boldsymbol{\theta}) \right] = \sum_{i=1}^{n} \sum_z q^{(t+1)}(z_i|\mathbf{x}_i) \log p(\mathbf{x}_i, z_i|\boldsymbol{\theta})
$
*Notation: $Q(\boldsymbol{\theta}; \boldsymbol{\theta}^{(t)})$ means "Q as a function of $\boldsymbol{\theta}$, where $\boldsymbol{\theta}^{(t)}$ was used to compute $q^{(t+1)}$" — **not** a conditional probability!*
For GMMs, the joint factors as $p(\mathbf{x}_i, z_i=k|\boldsymbol{\theta}) = p(\mathbf{x}_i|z_i=k, \boldsymbol{\theta}) \cdot p(z_i=k|\boldsymbol{\theta}) = \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \cdot \pi_k$
Taking the log: $\log p(\mathbf{x}_i, z_i=k|\boldsymbol{\theta}) = \log \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) + \log \pi_k$
Substituting into the Q-function with $q^{(t+1)}(z_i = k) = \gamma_{ik}$:
$
Q(\boldsymbol{\theta}; \boldsymbol{\theta}^{(t)}) = \sum_{i=1}^{n} \sum_{k=1}^{K} \gamma_{ik} \left[ \log \pi_k + \log \mathcal{N}(\mathbf{x}_i | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right]
$
**Key insight:** The log is now **inside** the sum over $k$, not outside!
- Each term involves only **one** component's parameters
- This is a **weighted log-likelihood** — we can maximize in closed form!
---
## The M-Step: Closed-Form Updates
To maximize $Q(\boldsymbol{\theta}; \boldsymbol{\theta}^{(t)})$, we take derivatives and set to zero.
**For $\boldsymbol{\mu}_k$:** Using $\log \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) = -\frac{1}{2}(\mathbf{x}_i - \boldsymbol{\mu}_k)^\top \boldsymbol{\Sigma}_k^{-1}(\mathbf{x}_i - \boldsymbol{\mu}_k) + \text{const}$
$
\frac{\partial Q}{\partial \boldsymbol{\mu}_k} = \sum_{i=1}^{n} \gamma_{ik} \boldsymbol{\Sigma}_k^{-1}(\mathbf{x}_i - \boldsymbol{\mu}_k) = 0 \quad \Rightarrow \quad \boldsymbol{\mu}_k = \frac{\sum_{i=1}^{n} \gamma_{ik} \mathbf{x}_i}{\sum_{i=1}^{n} \gamma_{ik}}
$
**For $\pi_k$:** Maximize $\sum_k N_k \log \pi_k$ subject to $\sum_k \pi_k = 1$ using Lagrange multipliers:
$
\frac{\partial}{\partial \pi_k}\left[\sum_k N_k \log \pi_k + \lambda\left(\sum_k \pi_k - 1\right)\right] = \frac{N_k}{\pi_k} + \lambda = 0 \quad \Rightarrow \quad \pi_k = \frac{N_k}{n}
$
where $N_k = \sum_{i=1}^n \gamma_{ik}$ and $\lambda = -n$ from the constraint.
**For $\boldsymbol{\Sigma}_k$:** Similar derivation using matrix calculus gives:
$
\boldsymbol{\Sigma}_k = \frac{\sum_{i=1}^{n} \gamma_{ik} (\mathbf{x}_i - \boldsymbol{\mu}_k)(\mathbf{x}_i - \boldsymbol{\mu}_k)^\top}{\sum_{i=1}^{n} \gamma_{ik}}
$
---
## The M-Step: Summary
Maximizing $Q(\boldsymbol{\theta}; \boldsymbol{\theta}^{(t)})$ gives the familiar M-step updates:
$
\begin{aligned}
\pi_k^{(t+1)} &= \frac{1}{n} \sum_{i=1}^{n} \gamma_{ik} \\[0.8em]
\boldsymbol{\mu}_k^{(t+1)} &= \frac{\sum_{i=1}^{n} \gamma_{ik} \mathbf{x}_i}{\sum_{i=1}^{n} \gamma_{ik}} \\[0.8em]
\boldsymbol{\Sigma}_k^{(t+1)} &= \frac{\sum_{i=1}^{n} \gamma_{ik} (\mathbf{x}_i - \boldsymbol{\mu}_k^{(t+1)})(\mathbf{x}_i - \boldsymbol{\mu}_k^{(t+1)})^\top}{\sum_{i=1}^{n} \gamma_{ik}}
\end{aligned}
$
**After the M-step:**
$
\text{ELBO}(q^{(t+1)}, \boldsymbol{\theta}^{(t+1)}; \mathbf{x}) \geq \text{ELBO}(q^{(t+1)}, \boldsymbol{\theta}^{(t)}; \mathbf{x})
$
We've **raised the ELBO** by finding better parameters!
---
## Connecting the Dots
**From theory to practice:** The variational derivation justifies exactly what we presented intuitively!
| Intuitive View |
Variational Derivation |
| E-step: Compute soft assignments $\gamma_{ik}$ |
Minimize $D_{\text{KL}}(q \,\|\, p)$ → set $q = p(z|x, \boldsymbol{\theta})$ |
| M-step: Weighted MLE for each cluster |
Maximize $Q(\boldsymbol{\theta}; \boldsymbol{\theta}^{(t)}) = \mathbb{E}_{z \sim q(z|x)}[\log p(x,z|\boldsymbol{\theta})]$ |
| Log-likelihood increases |
ELBO ↑ (E-step tightens, M-step raises) |
**Why does EM converge?**
---
## Why the Log-Likelihood Increases
**After E-step** (before M-step): We set $q = p(z|x, \boldsymbol{\theta}^{(t)})$, so $D_{\text{KL}} = 0$
$
\log p(\mathbf{x}|\boldsymbol{\theta}^{(t)}) = \text{ELBO}(q, \boldsymbol{\theta}^{(t)}; \mathbf{x}) + \underbrace{D_{\text{KL}}(q \,\|\, p(z|\mathbf{x}, \boldsymbol{\theta}^{(t)}))}_{= 0}
$
Therefore: $\log p(\mathbf{x}|\boldsymbol{\theta}^{(t)}) = \text{ELBO}(q, \boldsymbol{\theta}^{(t)}; \mathbf{x})$
**After M-step**: We found $\boldsymbol{\theta}^{(t+1)}$ that maximizes ELBO, but $q$ is now stale (computed with old $\boldsymbol{\theta}^{(t)}$):
$
\log p(\mathbf{x}|\boldsymbol{\theta}^{(t+1)}) = \text{ELBO}(q, \boldsymbol{\theta}^{(t+1)}; \mathbf{x}) + \underbrace{D_{\text{KL}}(q \,\|\, p(z|\mathbf{x}, \boldsymbol{\theta}^{(t+1)}))}_{\geq 0}
$
**Combining the inequalities:**
1. M-step maximized ELBO: $\text{ELBO}(q, \boldsymbol{\theta}^{(t+1)}; \mathbf{x}) \geq \text{ELBO}(q, \boldsymbol{\theta}^{(t)}; \mathbf{x})$
2. KL is non-negative: $\log p(\mathbf{x}|\boldsymbol{\theta}^{(t+1)}) \geq \text{ELBO}(q, \boldsymbol{\theta}^{(t+1)}; \mathbf{x})$
$
\boxed{\log p(\mathbf{x}|\boldsymbol{\theta}^{(t+1)}) \geq \text{ELBO}(q, \boldsymbol{\theta}^{(t+1)}; \mathbf{x}) \geq \text{ELBO}(q, \boldsymbol{\theta}^{(t)}; \mathbf{x}) = \log p(\mathbf{x}|\boldsymbol{\theta}^{(t)})}
$
**The log-likelihood is monotonically non-decreasing!** EM is guaranteed to converge to a local maximum.
---
## Summary: Latent Variable Models & EM
**Latent Variable Models:**
- Introduce hidden variables $\mathbf{z}$ to model complex data distributions
- Marginal likelihood: $p(\mathbf{x}|\boldsymbol{\theta}) = \int p(\mathbf{x}, z|\boldsymbol{\theta}) \, dz$ — intractable due to log-of-sum
**The EM Algorithm:**
| Step |
What it does |
Why it works |
| E-Step |
Compute $\gamma_{ik} = p(z_i=k|\mathbf{x}_i, \boldsymbol{\theta}^{(t)})$ |
Minimizes KL divergence → makes ELBO tight |
| M-Step |
Update $\boldsymbol{\theta}$ via weighted MLE |
Maximizes Q-function → raises ELBO |
**Key Theoretical Insights:**
- **ELBO**: $\log p(\mathbf{x}|\boldsymbol{\theta}) = \text{ELBO}(q, \boldsymbol{\theta}; \mathbf{x}) + D_{\text{KL}}(q \,\|\, p(z|\mathbf{x}, \boldsymbol{\theta}))$
- **Convergence guarantee**: Log-likelihood is monotonically non-decreasing
- **Connection to Lecture 08**: E-step = full posterior, M-step = weighted MLE
**GMM as a Concrete Example:**
- Discrete latent $z \in \{1, \ldots, K\}$ = cluster assignment
- K-means = EM with hard assignments ($\gamma_{ik} \in \{0,1\}$)