**What's the reason for the mean square residual error mathematically?**
Assuming an expected prediction squared error over different training sets $D$ and noise realizations $\epsilon$:
$
\mathbb{E}_{D,\epsilon} \left[ (y - \hat{f}_{\boldsymbol{\theta}}(\mathbf{x}))^2 \right]
$
, where $y = f^*(\mathbf{x}) + \epsilon$, $\epsilon \sim \mathcal{N}(0, \sigma^2)$ and $\hat{f}_{\boldsymbol{\theta}}(\mathbf{x})$ model prediction with parameters $\boldsymbol{\theta}$ trained on dataset $D$.
This can be decomposed into three components:
$
\underbrace{\left( \mathbb{E}_D \left[ \hat{f}_{\boldsymbol{\theta}}(\mathbf{x}) \right] - f^*(\mathbf{x}) \right)^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}_D \left[ \left( \hat{f}_{\boldsymbol{\theta}}(\mathbf{x}) - \mathbb{E}_D \left[ \hat{f}_{\boldsymbol{\theta}}(\mathbf{x}) \right] \right)^2 \right]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible noise}}
$
Full derivation in the [extras section](https://faressc.github.io/dl4ad/extras/mathematical_derivations/02_machine_learning_fundamentals.md).
---
## Bias-Variance Tradeoff Visualization
---
## Bias-Variance Tradeoff Traditional Plot
https://www.ibm.com/think/topics/bias-variance-tradeoff/
---
## Regularization
To prevent overfitting, we can add a regularization term to the empirical risk minimization objective:
$
\boldsymbol{\theta}^* = \argmin\limits_{\boldsymbol{\theta} \in \Theta} \left( \hat{R}(\boldsymbol{\theta}) + \lambda \mathcal{R}(\boldsymbol{\theta})\right)
$
where $\mathcal{R}(\boldsymbol{\theta})$ is the regularization term (e.g., L1 or L2 norm) and $\lambda$ is a hyperparameter that controls the strength of the regularization.
**Common Regularization Terms**:
- L2 Regularization (Ridge): $\mathcal{R}(\boldsymbol{\theta}) = \lVert \boldsymbol{\theta} \rVert_2^2 = \sum_{j=1}^{p} \theta_j^2$, with $p$ denoting the number of parameters, encourages smaller parameter values, leading to smoother functions.
- L1 Regularization (Lasso): $\mathcal{R}(\boldsymbol{\theta}) = \lVert \boldsymbol{\theta} \rVert_1 = \sum_{j=1}^{p} |\theta_j|$, encourages sparsity in the parameters, leading to simpler models.
---
## Modern Risk Curves
Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32), 15849–15854. https://doi.org/10.1073/pnas.1903070116
---
## Modern Risk Curves: Double Descent
**Phenomenon**: As model capacity increases, test risk first decreases, then spikes at the interpolation threshold, and finally decreases again in the overparameterized regime
| Regime | Model Cap. | Generalization Behavior |
|--------|---------|------------------------|
| **Underparameterized** | < Data complexity | Classical U-curve: bias-variance tradeoff |
| **Interpolation Threshold** | ≈ Training samples | **Peak risk!** Only one way to fit data → worst generalization |
| **Overparameterized** | ≫ Training samples | Second descent: Many solutions → inductive bias picks the ones that generalize well |
**Bottom Line**: Modern deep learning challenges classical understanding — overparameterized models can generalize well despite fitting training data perfectly — e.g. optimization algorithms implicitly regularize by finding simple solutions among many possible interpolations
---
## Gradient Descent
**Goal**: Find parameters that minimize the empirical risk:
$
\boldsymbol{\theta}^* = \argmin\limits_{\boldsymbol{\theta} \in \Theta} \hat{R}(\boldsymbol{\theta})
$
**Gradient descent iteratively updates**:
$
\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \nabla_{\boldsymbol{\theta}} \hat{R}(\boldsymbol{\theta}_t), \text{ for } t = 0, 1, 2, \ldots, T
$
where $\eta > 0$ is the learning rate and $T$ is the total number of iterations.
- **Convergence**: Steepest descent → critical point where $\nabla_{\boldsymbol{\theta}} \hat{R}(\boldsymbol{\theta}^*) = 0$
- **Critical points**: Local minima (stable, target) / Maxima (unstable, avoided) / Saddle points (unstable, avoided)
- **Note**: GD with appropriate $\eta$ converges to local minima or saddle points, not maxima (SGD adds noise that helps escape saddle points)
---
## Gradient Descent Algorithm
```python
def gradient_descent(theta_initial, learning_rate, max_iterations,
compute_gradient, tolerance=1e-6):
"""
Gradient Descent optimization algorithm.
Args:
theta_initial: Initial parameters (numpy array), typically sampled
from N(0, 0.1) for each parameter
learning_rate: Learning rate η > 0
max_iterations: Maximum number of iterations T
compute_gradient: Function that computes ∇_θ R̂(θ)
tolerance: Convergence tolerance δ (optional)
Returns:
theta_star: Optimized parameters θ*
"""
theta = theta_initial.copy()
for t in range(max_iterations):
# Compute gradient at current parameters
g = compute_gradient(theta)
# Update parameters
theta = theta - learning_rate * g
# Check convergence criterion
if np.linalg.norm(g) < tolerance:
break
return theta
```
**Key considerations**: Learning rate $\eta$ controls step size — too large may overshoot, too small may converge slowly
---
## Gradient Descent Visualization
---
## Variants of Optimization Algorithms
**Gradient Descent (GD)**: Full-batch updates using entire dataset to compute gradients
$
\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \nabla_{\boldsymbol{\theta}} \hat{R}(\boldsymbol{\theta}_t) = \boldsymbol{\theta}_t - \frac{\eta}{N} \sum_{i=1}^{N} \nabla_{\boldsymbol{\theta}} \ell(f_{\boldsymbol{\theta}_t}(\mathbf{x}_i), \mathbf{y}_i)
$
**Stochastic Gradient Descent (SGD)**: Updates parameters using gradient from a single randomly selected sample per iteration
$
\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \nabla_{\boldsymbol{\theta}} \ell(f_{\boldsymbol{\theta}_t}(\mathbf{x}_{i_t}), \mathbf{y}_{i_t}), \quad \text{where } i_t \sim \text{Uniform}(\{1, \ldots, N\})
$
**Mini-batch Gradient Descent**: Updates parameters using gradients from a small random subset (mini-batch)
$
\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \frac{\eta}{|B_t|} \sum_{i \in B_t} \nabla_{\boldsymbol{\theta}} \ell(f_{\boldsymbol{\theta}_t}(\mathbf{x}_i), \mathbf{y}_i), \quad \text{where } B_t \subset \{1, \ldots, N\}, |B_t| = b
$
**Other Variants**: Momentum, AdaGrad, RMSProp, Adam — adapt learning rates and incorporate momentum
---
## The Key Components
0
Dataset ($D$): A collection of input-output pairs $D = \lbrace(\mathbf{x}_i, \mathbf{y}_i)\rbrace_{i=1}^{N}$. The dataset must be representative of the true underlying function $f^*: \mathcal{X} \to \mathcal{Y}$.
1
Function or Model ($f_{\boldsymbol{\theta}}$): A parameterized function that maps inputs $\mathbf{x}_i$ to predicted outputs $\hat{\mathbf{y}}_i = f_{\boldsymbol{\theta}}(\mathbf{x}_i)$. The choice of function defines the function space $\mathcal{F}_{\Theta}$.
2
Parameters ($\boldsymbol{\theta}$): The set of parameters that define the specific function within the function space. These parameters are adjusted during training to minimize the empirical risk.
3
Loss Function ($\mathcal{L}$): A function that quantifies the difference between the predicted outputs $\hat{\mathbf{y}}_i$ and the true labels $\mathbf{y}_i$. The choice of loss function depends on the task (e.g., regression vs. classification).
4
Optimization Algorithm: A method for adjusting the parameters $\boldsymbol{\theta}$ to minimize the empirical risk $\hat{R}(\boldsymbol{\theta})$. Common algorithms include Gradient Descent and its variants.
---
## Example: Simple Linear Regression (Formulation)
- **Function**: $f_{\boldsymbol{\theta}}(x): \mathbb{R} \to \mathbb{R}$ defined as:
$
f_{\boldsymbol{\theta}}(x) = \theta_0 + \theta_1 x
$
- **Parameter space**: $\Theta = \mathbb{R}^2$ with parameters $\boldsymbol{\theta} = (\theta_0, \theta_1)$
- **Dataset**: $D = \lbrace(x_i, y_i)\rbrace$ for $i = 1, \ldots, N$
- **Input space**: $\mathcal{X} = \mathbb{R}$
- **Output space**: $\mathcal{Y} = \mathbb{R}$
- **Loss function**: Mean Squared Error (MSE):
$
\mathcal{L}(\boldsymbol{\theta}) = \frac{1}{N} \sum_{i=1}^{N} (y_i - f_{\boldsymbol{\theta}}(x_i))^2
$
---
## Example: Simple Linear Regression
---
## Example: Binary Classification (Formulation)
- **Function**: $f_{\boldsymbol{\theta}}(x): \mathbb{R}^2 \to \mathbb{R}$ defined as:
$
f_{\boldsymbol{\theta}}(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \text{, with } \hat{y} = \text{sign}(f_{\boldsymbol{\theta}}(x))
$
- **Parameter space**: $\Theta = \mathbb{R}^3$ with parameters $\boldsymbol{\theta} = (\theta_0, \theta_1, \theta_2)$
- **Dataset**: $D = \lbrace(x_i, y_i)\rbrace$ for $i = 1, \ldots, N$
- **Input space**: $\mathcal{X} = \mathbb{R}^2$
- **Output space**: $\mathcal{Y} = \lbrace -1, +1 \rbrace$ (binary labels)
- **Loss function**: Mean hinge loss:
$
\mathcal{L}(\boldsymbol{\theta}) = \frac{1}{N} \sum_{i=1}^{N} \max(0, 1 - y_i f_{\boldsymbol{\theta}}(x_i))
$
---
## Binary Classification
---
# Python Implementation