$
\underbrace{p_{Y|X}(y|x)}_{\color{#BF85FC}{\textbf{Posterior}}} = \frac{\overbrace{p_{X|Y}(x|y)}^{\color{#55A49F}{\textbf{Likelihood}}} \cdot \overbrace{p_Y(y)}^{\color{#FE938C}{\textbf{Prior}}}}{\underbrace{p_X(x)}_{\color{#77af49}{\textbf{Evidence}}}}
$
---
## Example: Audio Event Detection
**Scenario:** A smart speaker detects an audio event and needs to classify it
**Classes:** $y \in \{\text{doorbell}, \text{dog bark}, \text{glass breaking}\}$
---
## Example: Audio Event Detection
**Observed:** Audio contains high-frequency burst pattern
**Applying Bayes' theorem:**
| Event | Prior × Likelihood | Posterior |
|-------|-------------------|-----------|
| Doorbell | $0.6 \times 0.3 = 0.18$ | $0.18/0.26 = 0.69$ |
| Dog bark | $0.35 \times 0.1 = 0.035$ | $0.035/0.26 = 0.14$ |
| Glass breaking | $0.05 \times 0.9 = 0.045$ | $0.045/0.26 = 0.17$ |
Evidence: $p_X(x) = 0.18 + 0.035 + 0.045 = 0.26$
**Result:** Despite glass breaking having the highest likelihood (0.9), the **doorbell** has the highest posterior (0.69) because it's much more common!
**The prior matters!**
---
## From Probabilities to Predictions
**From our audio example, we now have:**
- $p_{Y|X}(\text{doorbell}|x) = 0.69$
- $p_{Y|X}(\text{dog bark}|x) = 0.14$
- $p_{Y|X}(\text{glass breaking}|x) = 0.17$
**Question:** Which class should we actually predict?
**We need a decision rule!**
The **Bayesian Decision Rule** provides a principled way to make predictions based on posterior probabilities and a specified loss function.
---
## Bayesian Decision Rule
**Goal:** Choose class $\hat{y}$ that maximizes some criterion based on posterior probabilities
$\hat{y} = \arg\min_{\hat{y}} \sum_{y} \mathcal{L}(y, \hat{y}) \cdot p_{Y|X}(y|x)$
We iterate over all possible true classes $y$ to compute expected loss for predicting $\hat{y}$. What if the true class is $y$ but we predicted $\hat{y}$?
**Where:**
$\mathcal{L}(y, \hat{y})$ = loss incurred for predicting $\hat{y}$ when true class is $y$. I.e, 0-1 loss:
$\mathcal{L}(y, \hat{y}) = \begin{cases}
0 & \text{if } y = \hat{y} \\
1 & \text{if } y \neq \hat{y}
\end{cases}$
---
## From Bayesian Decision Rule to MAP
**Recall:** Bayesian Decision Rule with 0-1 loss
$\hat{y} = \arg\min_{\hat{y}} \sum_{y} \mathcal{L}(y, \hat{y}) \cdot p_{Y|X}(y|x)$
**Step 1:** With 0-1 loss, $\mathcal{L}(y, \hat{y}) = 0$ when $y = \hat{y}$, and $1$ otherwise
$\sum_{y} \mathcal{L}(y, \hat{y}) \cdot p_{Y|X}(y|x) = \sum_{y \neq \hat{y}} 1 \cdot p_{Y|X}(y|x) = \sum_{y \neq \hat{y}} p_{Y|X}(y|x)$
**Step 2:** Since probabilities sum to 1: $\displaystyle\sum_{y} p_{Y|X}(y|x) = p_{Y|X}(\hat{y}|x) + \sum_{y \neq \hat{y}} p_{Y|X}(y|x) = 1$
$\sum_{y \neq \hat{y}} p_{Y|X}(y|x) = 1 - p_{Y|X}(\hat{y}|x)$
---
## Maximum A Posteriori (MAP)
**Step 3:** The expected loss simplifies to:
$\text{Expected Loss} = 1 - p_{Y|X}(\hat{y}|x)$
**Minimizing** $1 - p_{Y|X}(\hat{y}|x)$ $\Leftrightarrow$ **Maximizing** $p_{Y|X}(\hat{y}|x)$
---
## MAP Components
**Uses both:**
1. **Likelihood** $p_{X|Y}(x|y)$: "How well does this class explain the audio features?"
2. **Prior** $p_Y(y)$: "How common is this sound event?"
3. **MAP Prediction**: Choose class maximizing the product of likelihood and prior
**Audio Event Detection with MAP:**
- $p_{X|Y}(x|\text{doorbell}) \times p_Y(\text{doorbell}) = 0.3 \times 0.6 = 0.18$
- $p_{X|Y}(x|\text{dog bark}) \times p_Y(\text{dog bark}) = 0.1 \times 0.35 = 0.035$
- $p_{X|Y}(x|\text{glass breaking}) \times p_Y(\text{glass breaking}) = 0.9 \times 0.05 = 0.045$
**MAP prediction: doorbell** ✓ (despite glass breaking having highest likelihood!)
**After normalization**: Devide by evidence $p_X(x) = 0.26$ to get posteriors => $p_{Y|X}(\text{doorbell}|x) = 0.69$
**MAP requires knowing the prior** $p_Y(y)$
- What if we don't know the prior?
- What if we want to ignore it?
- What if all classes are equally likely?
**Properties:**
- Choose class that makes the data **most likely**
- **Only** uses the likelihood
- **Ignores** prior probabilities
**Audio Event Detection with ML:**
- Glass breaking: $p_{X|Y}(\text{high freq burst}|\text{glass}) = 0.9$ ← highest!
- Doorbell: $0.3$, Dog bark: $0.1$
**ML predicts: glass breaking** (ignores that glass breaking is rare!)
**Key insight:** ML is just MAP with an assumption (uniform prior)
$\boxed{\text{ML} \subset \text{MAP}}$
$
\begin{aligned}
&\textbf{Bayesian Decision Rule} \text{ (most general)}\\\\
&\quad\downarrow\text{ Special case: 0-1 loss}\\\\
&\textbf{MAP} \text{ (uses prior + likelihood)}\\\\
&\quad\downarrow \text{ Special case: uniform prior}\\\\
&\textbf{ML} \text{ (likelihood only)}
\end{aligned}
$
**Each level is a simplification of the one above!**
---
## ML/MAP for Parameter Estimation
**Two Applications of ML/MAP:**
**Same principles, different purposes!**
**About the notation:**
- $p_{Y,X}(y,x)$ ← joint distribution **of** $Y$ and $X$ (no conditioning)
- $p_{Y|X}(y|x)$ ← distribution of $Y$ **given** $X$ (conditioning on $X$)
- $p_{Y|X,\Theta}(y|x,\theta)$ ← distribution of $Y$ **given** $X$ and $\Theta$ (conditioning on their joint)
The comma after the bar means "AND" — we condition on both simultaneously.
---
## Which Parameters?
**Parameters** $\boldsymbol{\theta}$ define your model (recall from Lecture 2):
| Model | Parameters $\boldsymbol{\theta}$ |
|-------|-------------------|
| Simple linear regression | $f_{\boldsymbol{\theta}}(x) = \theta_0 + \theta_1 x$ |
| Binary classifier | $f_{\boldsymbol{\theta}}(\mathbf{x}) = \theta_0 + \theta_1 x_1 + \theta_2 x_2$ |
| Perceptron | Weights $\mathbf{w}$ and bias $b$ in $\phi(\mathbf{w}^\top \mathbf{x} + b)$ |
| MLP / Neural network | All weights and biases across layers |
| Other NN architectures | Various weights, biases, convolution filters, etc. |
**Goal:** Find the best $\boldsymbol{\theta}^*$ that explains the training data!
---
## From MAP to MLE for Parameters
**Starting point: MAP estimate**
$\boldsymbol{\theta}_{\text{MAP}} = \arg\max_{\boldsymbol{\theta}} p_{\Theta|\mathcal{D}}(\boldsymbol{\theta}|\mathcal{D})$
**Apply Bayes' theorem to the parameters:**
$p_{\Theta|\mathcal{D}}(\boldsymbol{\theta}|\mathcal{D}) = \frac{p_{\mathcal{D}|\Theta}(\mathcal{D}|\boldsymbol{\theta}) \cdot p_\Theta(\boldsymbol{\theta})}{p_\mathcal{D}(\mathcal{D})}$
**Simplify the MAP objective:**
$\boldsymbol{\theta}_{\text{MAP}} = \arg\max_{\boldsymbol{\theta}} \frac{p_{\mathcal{D}|\Theta}(\mathcal{D}|\boldsymbol{\theta}) \cdot p_\Theta(\boldsymbol{\theta})}{p_\mathcal{D}(\mathcal{D})}$
---
## From MAP to MLE (continued)
Since $p_\mathcal{D}(\mathcal{D})$ doesn't depend on $\boldsymbol{\theta}$, we can drop it:
$\boldsymbol{\theta}_{\text{MAP}} = \arg\max_{\boldsymbol{\theta}} p_{\mathcal{D}|\Theta}(\mathcal{D}|\boldsymbol{\theta}) \cdot p_\Theta(\boldsymbol{\theta})$
**Two components:**
- $p_{\mathcal{D}|\Theta}(\mathcal{D}|\boldsymbol{\theta})$ = **Likelihood** of data given parameters
- $p_\Theta(\boldsymbol{\theta})$ = **Prior** on parameters
**What if we have no prior knowledge?** Or assume a uniform prior over $\boldsymbol{\theta}$?
$p_\Theta(\boldsymbol{\theta}) = \text{constant}$
Then MAP reduces to:
$\boldsymbol{\theta}_{\text{MLE}} = \arg\max_{\boldsymbol{\theta}} p_{\mathcal{D}|\Theta}(\mathcal{D}|\boldsymbol{\theta})$
**This is Maximum Likelihood Estimation (MLE)!**
---
## Maximum Likelihood Estimation (MLE)
**Idea:** Choose parameters that make the observed data most likely
Given training data $\mathcal{D} = \{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n)\}$:
$\boldsymbol{\theta}_{\text{MLE}} = \arg\max_{\boldsymbol{\theta}} p_{\mathcal{D}|\Theta}(\mathcal{D}|\boldsymbol{\theta}) = \arg\max_{\boldsymbol{\theta}} \prod_{i=1}^n p_{Y|X,\Theta}(y_i|\mathbf{x}_i, \boldsymbol{\theta})$
**But why does this equality hold?** Let's derive it step by step...
---
## Deriving the MLE Formula
**Step 1: Define the dataset**
$\mathcal{D} = \{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)\}$
We can separate into inputs and outputs: $\mathcal{D} = (X, Y)$ where:
- $X = \{\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n\}$ (all inputs)
- $Y = \{y_1, y_2, \ldots, y_n\}$ (all outputs)
**Step 2: What does $p_{\mathcal{D}|\Theta}(\mathcal{D}|\boldsymbol{\theta})$ mean?**
$p_{\mathcal{D}|\Theta}(\mathcal{D}|\boldsymbol{\theta}) = p_{X,Y|\Theta}(X, Y|\boldsymbol{\theta})$
This is the joint probability of observing all inputs **and** outputs given parameters $\boldsymbol{\theta}$.
---
## Deriving the MLE Formula (continued)
**Step 3: Apply conditional probability**
$p_{X,Y|\Theta}(X, Y|\boldsymbol{\theta}) = p_{Y|X,\Theta}(Y|X, \boldsymbol{\theta}) \cdot p_{X|\Theta}(X|\boldsymbol{\theta})$
**Key assumption in supervised learning:** We assume inputs $X$ are given/fixed (not modeled by $\boldsymbol{\theta}$), so we focus only on modeling $Y$ given $X$:
$p_{\mathcal{D}|\Theta}(\mathcal{D}|\boldsymbol{\theta}) \propto p_{Y|X,\Theta}(Y|X, \boldsymbol{\theta})$
**Step 4: Expand the joint probability of all outputs**
$p_{Y|X,\Theta}(Y|X, \boldsymbol{\theta}) = p_{Y|X,\Theta}(y_1, y_2, \ldots, y_n|\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n, \boldsymbol{\theta})$
---
## Deriving the MLE Formula (continued)
**Step 5: Apply i.i.d. assumption**
**Critical assumption:** Data points are **independent and identically distributed** (i.i.d.)
**Independence means:** Given $\mathbf{x}_i$ and $\boldsymbol{\theta}$, the output $y_i$ doesn't depend on other data points:
$p_{Y|X,\Theta}(y_i|\mathbf{x}_1, \ldots, \mathbf{x}_n, y_1, \ldots, y_{i-1}, y_{i+1}, \ldots, y_n, \boldsymbol{\theta}) = p_{Y|X,\Theta}(y_i|\mathbf{x}_i, \boldsymbol{\theta})$
**Because of independence, the joint probability factorizes:**
$p_{Y|X,\Theta}(y_1, \ldots, y_n|\mathbf{x}_1, \ldots, \mathbf{x}_n, \boldsymbol{\theta}) = \prod_{i=1}^n p_{Y|X,\Theta}(y_i|\mathbf{x}_i, \boldsymbol{\theta})$
**This is the product form in MLE!** The comma in $p_{Y|X,\Theta}$ indicates conditioning on both $X$ and parameters $\Theta$, not a joint distribution.
---
## MLE in Practice
In practice, use **log-likelihood** (easier to optimize):
$\boldsymbol{\theta}_{\text{MLE}} = \arg\max_{\boldsymbol{\theta}} \sum_{i=1}^n \log p_{Y|X,\Theta}(y_i|\mathbf{x}_i, \boldsymbol{\theta})$
**Why logarithm?**
- Products become sums (easier to compute gradients)
- Prevents numerical underflow with small probabilities
- Doesn't change the argmax (log is monotonic)
---
## Example: Audio Event Detection
**Setup:**
- Data: $\mathcal{D} = \{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n)\}$
- $\mathbf{x}_i \in \mathbb{R}^d$: audio features (MFCCs, spectral centroid, zero-crossing rate, etc.)
- $y_i \in \{0, 1\}$: label (1 = glass breaking, 0 = not glass breaking)
- $\boldsymbol{\theta} = \mathbf{w} \in \mathbb{R}^d$: weight vector
**Model:**
$p_{Y|X,\Theta}(y=1|\mathbf{x}, \mathbf{w}) = \sigma(\mathbf{w}^\top \mathbf{x}) = \frac{1}{1 + e^{-\mathbf{w}^\top \mathbf{x}}}$
where $\sigma$ is the sigmoid function.
---
## Applying MLE
**Likelihood for one example** (handles both $y_i = 0$ and $y_i = 1$):
$p_{Y|X,\Theta}(y_i|\mathbf{x}_i, \mathbf{w}) = \sigma(\mathbf{w}^\top \mathbf{x}_i)^{y_i} \cdot (1-\sigma(\mathbf{w}^\top \mathbf{x}_i))^{1-y_i}$
**Why this form?** The exponents act as "switches" (Bernoulli trick):
| $y_i$ | Formula becomes | Result |
|-------|-----------------|--------|
| 1 | $\sigma(\cdot)^1 \cdot (1-\sigma(\cdot))^0$ | $\sigma(\mathbf{w}^\top \mathbf{x}_i)$ |
| 0 | $\sigma(\cdot)^0 \cdot (1-\sigma(\cdot))^1$ | $1 - \sigma(\mathbf{w}^\top \mathbf{x}_i)$ |
**Full log-likelihood** (assuming i.i.d. data):
$L(\mathbf{w}) = \sum_{i=1}^n \left[ y_i \log \sigma(\mathbf{w}^\top \mathbf{x}_i) + (1-y_i) \log(1-\sigma(\mathbf{w}^\top \mathbf{x}_i)) \right]$
**This is the binary cross-entropy loss!**
**MLE solution** (no closed form → use gradient descent):
$\mathbf{w}_{\text{MLE}} = \arg\max_{\mathbf{w}} \sum_{i=1}^n \left[ y_i \log \sigma(\mathbf{w}^\top \mathbf{x}_i) + (1-y_i) \log(1-\sigma(\mathbf{w}^\top \mathbf{x}_i)) \right]$
---
## Example: Audio Event Detection
**Data:**
| Audio Sample | Features $\mathbf{x}_i$ | Label $y_i$ |
|--------------|------------------------|-------------|
| 1 | [high freq, short burst, sharp onset] | 1 (glass) |
| 2 | [mid freq, sustained, smooth onset] | 0 (not glass) |
| 3 | [high freq, impulse, broadband] | 1 (glass) |
**Goal:** Find weights $\mathbf{w}$ that maximize $\displaystyle\prod_{i=1}^3 p_{Y|X,\Theta}(y_i|\mathbf{x}_i, \mathbf{w})$, such that:
- $\sigma(\mathbf{w}^\top \mathbf{x}_1) \approx 1$ (high probability of glass breaking for sample 1)
- $\sigma(\mathbf{w}^\top \mathbf{x}_2) \approx 0$ (low probability for sample 2)
- $\sigma(\mathbf{w}^\top \mathbf{x}_3) \approx 1$ (high probability of glass breaking for sample 3)
---
## MAP for Parameters
**Idea:** Choose parameters with highest posterior probability given data
**Apply Bayes' theorem to parameters:**
$p_{\Theta|\mathcal{D}}(\boldsymbol{\theta}|\mathcal{D}) = \frac{p_{\mathcal{D}|\Theta}(\mathcal{D}|\boldsymbol{\theta}) \cdot p_\Theta(\boldsymbol{\theta})}{p_\mathcal{D}(\mathcal{D})}$
**MAP estimate:**
$\boldsymbol{\theta}_{\text{MAP}} = \arg\max_{\boldsymbol{\theta}} p_{\Theta|\mathcal{D}}(\boldsymbol{\theta}|\mathcal{D}) = \arg\max_{\boldsymbol{\theta}} p_{\mathcal{D}|\Theta}(\mathcal{D}|\boldsymbol{\theta}) \cdot p_\Theta(\boldsymbol{\theta})$
**Relationship:** $\boxed{\text{MAP} = \text{MLE} + \text{Prior on parameters}}$
---
## Example: Audio Event Detection
**Recall MLE objective:**
$\mathbf{w}_{\text{MLE}} = \arg\max_{\mathbf{w}} \sum_{i=1}^n \left[ y_i \log \sigma(\mathbf{w}^\top \mathbf{x}_i) + (1-y_i) \log(1-\sigma(\mathbf{w}^\top \mathbf{x}_i)) \right]$
**Add a prior:** Assume weights are Gaussian $p_W(\mathbf{w}) = \mathcal{N}(\mathbf{0}, \lambda^{-1}\mathbf{I})$
$\log p_W(\mathbf{w}) = -\frac{\lambda}{2}\|\mathbf{w}\|^2 + \text{const}$
**MAP objective** (log-posterior = log-likelihood + log-prior):
$\mathbf{w}_{\text{MAP}} = \arg\max_{\mathbf{w}} \left[\underbrace{\mathbf{w}_{\text{MLE}}}_{\text{cross-entropy (from likelihood)}} - \underbrace{\frac{\lambda}{2}\|\mathbf{w}\|^2}_{\text{L2 regularization (from prior)}} \right]$
**This is regularized logistic regression for audio classification!**
---
## Regularization = Bayesian Prior
Regularization is just MAP estimation!
**MLE** (no regularization):
$\mathbf{w}_{\text{MLE}} = \arg\max_{\mathbf{w}} \sum_{i=1}^n \left[ y_i \log \sigma(\mathbf{w}^\top \mathbf{x}_i) + (1-y_i) \log(1-\sigma(\mathbf{w}^\top \mathbf{x}_i)) \right]$
This is **logistic regression with cross-entropy loss**!
**MAP** with Gaussian prior $p_W(\mathbf{w}) \sim \mathcal{N}(\mathbf{0}, \lambda^{-1}\mathbf{I})$:
$\mathbf{w}_{\text{MAP}} = \arg\max_{\mathbf{w}} \left[ \sum_{i=1}^n \left[ y_i \log \sigma(\mathbf{w}^\top \mathbf{x}_i) + (1-y_i) \log(1-\sigma(\mathbf{w}^\top \mathbf{x}_i)) \right] - \frac{\lambda}{2}\|\mathbf{w}\|^2 \right]$
This is **L2-regularized logistic regression**!
---
## Regularization Types and Priors
| Regularization | Prior Distribution | Effect on Audio Classifier |
|----------------|-------------------|---------------------------|
| None | Uniform (no prior) | MLE: weights can grow large |
| L2 (Ridge) | Gaussian $\mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I})$ | Prefer small weights for all audio features |
| L1 (Lasso) | Laplace | Prefer sparse weights (few features matter) |
| Elastic Net | Gaussian + Laplace | Combination of both |
**Key insight:**
$\boxed{\text{Regularization} = \text{Prior belief about parameters}}$
**Regularization strength $\lambda$** = How strongly you believe weights should be small
**For audio:** L1 might select only the most discriminative spectral features!
---
## Comparison: Both Applications
| Aspect |
Parameter Learning |
Classification |
| Phase |
Training |
Testing/Inference |
| Unknown |
Parameters $\theta$ |
Class $y$ |
| Given |
Training data $\mathcal{D}$ |
Features $x$, learned $\theta$ |
| MLE/ML |
$\arg\max_\theta p_{\mathcal{D}|\Theta}(\mathcal{D}|\theta)$ |
$\arg\max_{\hat{y}} p_{X|Y}(x|\hat{y},\theta)$ |
| MAP |
$\arg\max_\theta p_{\Theta|\mathcal{D}}(\theta|\mathcal{D})$ |
$\arg\max_{\hat{y}} p_{Y|X}(\hat{y}|x,\theta)$ |
---
## Key Takeaways
1. **Probabilistic ML** provides a principled framework for handling uncertainty
2. **Bayesian classification** uses Bayes' theorem to compute $p_{Y|X}(y|x)$
3. **Decision rules** (ML, MAP, Bayesian) tell us how to make predictions
4. **Same principles apply** to both classification (predict $y$) and parameter learning (estimate $\theta$)
5. **MAP is generally better** than ML because it uses more information
6. **Regularization is Bayesian:** It's just MAP with a prior on parameters!
7. **Hierarchy:** Bayesian Decision → MAP → ML (each is a special case)