$
\hat{y} =
\begin{cases}
1 & \text{if } \sum_{i=1}^{N} w_i x_i \geq T \\
0 & \text{otherwise}
\end{cases}
\quad\to\quad
\hat{y} = \phi\left( \sum_{i=1}^{N} w_i x_i - T \right)\text{ where } \phi(z) =
\begin{cases}
1 & \text{if } z \geq 0 \\
0 & \text{otherwise}
\end{cases}
$
---
## Frank Rosenblatt's Perceptron (1958)
Or if $\mathbf{x}$ includes a bias input $x_{0} = 1$, we can fold $b$ into the weights:
$
\hat{y} = \phi\left( \mathbf{w}^\top \mathbf{x} \right) \text{ where } \mathbf{x} = [1, x_1, x_2, \ldots, x_N]^\top \text{ and } \mathbf{w} = [b, w_1, w_2, \ldots, w_N]^\top
$
---
## Recall: Simple Linear Regression
- **Function**: $f_{\boldsymbol{\theta}}(x): \mathbb{R} \to \mathbb{R}$ defined as:
$
f_{\boldsymbol{\theta}}(x) = \theta_0 + \theta_1 x
$
- **Parameter space**: $\Theta = \mathbb{R}^2$ with parameters $\boldsymbol{\theta} = (\theta_0, \theta_1)$
- **Dataset**: $D = \lbrace(x_i, y_i)\rbrace$ for $i = 1, \ldots, N$
- **Input space**: $\mathcal{X} = \mathbb{R}$
- **Output space**: $\mathcal{Y} = \mathbb{R}$
- **Loss function**: Mean Squared Error (MSE):
$
\mathcal{L}(\boldsymbol{\theta}) = \frac{1}{N} \sum_{i=1}^{N} (y_i - f_{\boldsymbol{\theta}}(x_i))^2
$
---
## Example: Simple Linear Regression
---
## Comparison to Linear Regression
$
\begin{aligned}
\text{Perceptron: } & \quad f_{\mathbf{w}}(\mathbf{x}) = \phi\left( \mathbf{w}^\top \mathbf{x} \right), \quad \phi(z) = \begin{cases} 1 & \text{if } z \geq 0 \\ 0 & \text{otherwise} \end{cases}\\
\text{Linear Regression: } & \quad f_{\boldsymbol{\theta}}(\mathbf{x}) = \boldsymbol{\theta}^\top \mathbf{x}
\end{aligned}
$
**Key Differences**:
- **Output space**: Perceptron outputs binary labels $\mathcal{Y} = \lbrace 0, 1\rbrace$; Linear regression outputs continuous values $\mathcal{Y} = \mathbb{R}$
- **Function space**: Both belong to $\mathcal{F}_1^{(n)}$ **before** activation — perceptron adds non-linearity via $\phi$
---
## Example: Binary Classification
- **Function**: $f_{\boldsymbol{\theta}}(x): \mathbb{R}^2 \to \mathbb{R}$ defined as:
$
f_{\boldsymbol{\theta}}(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 \text{, with } \hat{y} = \text{sign}(f_{\boldsymbol{\theta}}(x))
$
- **Parameter space**: $\Theta = \mathbb{R}^3$ with parameters $\boldsymbol{\theta} = (\theta_0, \theta_1, \theta_2)$
- **Dataset**: $D = \lbrace(x_i, y_i)\rbrace$ for $i = 1, \ldots, N$
- **Input space**: $\mathcal{X} = \mathbb{R}^2$
- **Output space**: $\mathcal{Y} = \lbrace -1, +1 \rbrace$ (binary labels)
- **Loss function**: Mean hinge loss:
$
\mathcal{L}(\boldsymbol{\theta}) = \frac{1}{N} \sum_{i=1}^{N} \max(0, 1 - y_i f_{\boldsymbol{\theta}}(x_i))
$
---
## Binary Classification
---
## Comparison to Binary Classification
Both the perceptron and linear binary classifiers perform **binary classification** using linear decision boundaries:
$
\begin{aligned}
\text{Perceptron: } & \quad f_{\mathbf{w}}(\mathbf{x}) = \phi\left( \mathbf{w}^\top \mathbf{x} \right), \quad \phi(z) = \begin{cases} 1 & \text{if } z \geq 0 \\ 0 & \text{otherwise} \end{cases}\\
\text{Linear Classifier: } & \quad f_{\boldsymbol{\theta}}(\mathbf{x}) = \boldsymbol{\theta}^\top \mathbf{x} \text{, with } \hat{y} = \text{sign}(f_{\boldsymbol{\theta}}(\mathbf{x})) = \begin{cases} +1 & \text{if } f_{\boldsymbol{\theta}}(\mathbf{x}) \geq 0 \\ -1 & \text{otherwise} \end{cases}
\end{aligned}
$
**Key Similarities**:
- **Decision boundary**: Both use a linear hyperplane to separate classes
- **Function space**: Both belong to $\mathcal{F}_1^{(n)}$ before applying the output function
If we change the perceptron activation to a sign function, both models become equivalent!
→ This means we can use the same training algorithm for both models!
---
## Is an Activation Function Really Necessary?
Consider a 2-layer network **without** activation functions:
$
\begin{aligned}
\mathbf{h}^{(1)} &= \mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)} \\
\hat{\mathbf{y}} &= \mathbf{W}^{(2)} \mathbf{h}^{(1)} + \mathbf{b}^{(2)}
\end{aligned}
$
Substituting the first equation into the second:
$
\begin{aligned}
\hat{\mathbf{y}} &= \mathbf{W}^{(2)} (\mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}) + \mathbf{b}^{(2)} \\
&= \mathbf{W}^{(2)} \mathbf{W}^{(1)} \mathbf{x} + \mathbf{W}^{(2)} \mathbf{b}^{(1)} + \mathbf{b}^{(2)}
\end{aligned}
$
This is equivalent to a **single linear layer**:
$
\hat{\mathbf{y}} = \mathbf{W} \mathbf{x} + \mathbf{b}
$
where $\mathbf{W} = \mathbf{W}^{(2)} \mathbf{W}^{(1)}$ and $\mathbf{b} = \mathbf{W}^{(2)} \mathbf{b}^{(1)} + \mathbf{b}^{(2)}$
**Key Insight**: Without non-linear activation functions, stacking multiple layers is equivalent to a single linear transformation!
→ The network cannot learn non-linear decision boundaries
→ Activation functions are **essential** for deep learning
---
## Differentiable Activation Functions
To enable gradient flow through the activation as well, we can use differentiable alternatives such as:
| Activation |
Function |
Derivative |
| Sigmoid |
$\sigma(z) = \frac{1}{1 + e^{-z}}$ |
$\frac{d\sigma}{dz} = \sigma(z)(1 - \sigma(z))$ |
| Tanh |
$\tanh(z) = \frac{e^{z} - e^{-z}}{e^{z} + e^{-z}}$ |
$\frac{d\tanh}{dz} = 1 - \tanh^2(z)$ |
| ReLU |
$\text{ReLU}(z) = \max(0, z)$ |
$\frac{d\text{ReLU}}{dz} = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{otherwise} \end{cases}$ |
| Leaky ReLU |
$\text{LeakyReLU}(z) = \max(\alpha z, z)$ |
$\frac{d\text{LeakyReLU}}{dz} = \begin{cases} 1 & \text{if } z > 0 \\ \alpha & \text{otherwise} \end{cases}$ |
Our gradients can now flow through the activation function!
→ We can use loss functions that depend on the final output of the perceptron!
→ We can chain multiple perceptrons together to form multi-layer perceptrons (MLPs)!
---
## Example: Binary Classification with MSE
- **Function**: $f_{\boldsymbol{\theta}}(x): \mathbb{R}^2 \to (-1, 1)$ defined as:
$
f_{\boldsymbol{\theta}}(x) = \tanh(w_0 + w_1 x_1 + w_2 x_2)
$
- **Parameter space**: $\Theta = \mathbb{R}^3$ with parameters $\boldsymbol{\theta} = (w_0, w_1, w_2)$
- **Dataset**: $D = \lbrace(x_i, y_i)\rbrace$ for $i = 1, \ldots, N$
- **Input space**: $\mathcal{X} = \mathbb{R}^2$
- **Output space**: $\mathcal{Y} = (-1, 1)$
- **Loss function**: Mean squared error (MSE):
$
\mathcal{L}(\boldsymbol{\theta}) = \frac{1}{N} \sum_{i=1}^{N} \left(y_i - \hat{y}_i \right)^2
$
---
## Example: Binary Classification with Sigmoid
- **Function**: $f_{\boldsymbol{\theta}}(x): \mathbb{R}^2 \to (0, 1)$ defined as:
$
f_{\boldsymbol{\theta}}(x) = \sigma(w_0 + w_1 x_1 + w_2 x_2) \text{, with } \sigma(z) = \frac{1}{1 + e^{-z}}
$
- **Parameter space**: $\Theta = \mathbb{R}^3$ with parameters $\boldsymbol{\theta} = (w_0, w_1, w_2)$
- **Dataset**: $D = \lbrace(x_i, y_i)\rbrace$ for $i = 1, \ldots, N$
- **Input space**: $\mathcal{X} = \mathbb{R}^2$
- **Output space**: $\mathcal{Y} = (0, 1)$ (probabilistic outputs)
- **Loss function**: Binary cross-entropy loss:
$
\mathcal{L}(\boldsymbol{\theta}) = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]
$
---
## Multilayer Perceptrons
Input Values: $\mathbf{x} = [x_1, x_2, \ldots, x_N]^\top$ represent the features fed into the perceptron.
Bias Term: $1$ is added to the input vector to allow shifting the activation threshold.
Output Values: $\hat{\mathbf{y}} = [\hat{y}_1, \hat{y}_2, \ldots, \hat{y}_K]^\top$ represent the predicted outputs of the perceptron.
Hidden Units: $\mathbf{h}^{(l)} = [h_1^{(l)}, h_2^{(l)}, \ldots, h_{M^{(l)}}^{(l)}]^\top$ represent intermediate computations within the $l$-th layer.
Generated with https://alexlenail.me/NN-SVG/
---
## Forward Propagation
**Hidden Layer Computation:**
$
\begin{aligned}
\mathbf{z}^{(l)} & = \mathbf{W}^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}\text{ or} \\
\mathbf{z}^{(l)} & = \mathbf{W}^{(l)} \mathbf{h}^{(l-1)}\\
\mathbf{h}^{(l)} & = \sigma(\mathbf{z}^{(l)})
\end{aligned}
$
where:
- $\mathbf{W}^{(l)} \in \mathbb{R}^{M \times M'}$ is the weight matrix
- $\mathbf{b}^{(l)} \in \mathbb{R}^{M}$ is the bias vector
- $\sigma(\cdot)$ is the activation function
- $\mathbf{h}^{(0)} = \mathbf{x}$ (input layer)
**Output Layer Computation:**
$
\begin{aligned}
\mathbf{z}^{(L)} & = \mathbf{W}^{(L)} \mathbf{h}^{(L-1)} + \mathbf{b}^{(L)} \text{ or} \\
\mathbf{z}^{(L)} & = \mathbf{W}^{(L)} \mathbf{h}^{(L)}\\
\hat{\mathbf{y}} & = \sigma_{L}(\mathbf{z}^{(L)})
\end{aligned}
$
where:
- $\mathbf{W}^{(L)} \in \mathbb{R}^{K \times M}$ is the output weight matrix
- $\mathbf{b}^{(L)} \in \mathbb{R}^{K}$ is the output bias vector
- $\sigma_{L}(\cdot)$ is the output activation
- $L$ is the index of the last hidden layer
In element-wise form, each neuron computes:
$
\begin{aligned}
z_j^{(l)} & = \sum_{i=1}^{M'} W_{ji}^{(l)} h_i^{(l-1)} + b_j^{(l)} \\
h_j^{(l)} & = \sigma(z_j^{(l)})
\end{aligned}
$
where:
- $i$ indexes neurons in the previous layer
- $j$ indexes neurons in the current layer
---
## Backpropagation
How do we train a multilayer perceptron with many layers?
Compute gradients of the loss $\mathcal{L}$ for
each layer $l$ to update parameters using gradient descent or its variants:
$
\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} \quad \text{and} \quad \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}}
$
---
## Backpropagation
The loss $\mathcal{L}$ has a **deep dependency chain**:
$\mathcal{L}$ depends on $\hat{\mathbf{y}}$
↓ which depends on $\mathbf{W}^{(L)}$ and $\mathbf{b}^{(L)}$
↓ which depends on $\mathbf{h}^{(L-1)}$
↓ which depends on $\mathbf{W}^{(L-1)}$ and $\mathbf{b}^{(L-1)}$
↓ and so on...
Backpropagation is an efficient algorithm to compute these gradients using the chain rule!
---
## Backpropagation: Output Layer
MSE Loss: $\mathcal{L} = \frac{1}{N}\sum_{i=1}^{N}\Vert\mathbf{y}_i - \hat{\mathbf{y}}_i\Vert^2 = \frac{1}{N}\sum_{i=1}^{N}\sum_{j}(y_{ij} - \hat{y}_{ij})^2$
**Step 1**: Compute gradient w.r.t. output layer pre-activation $\mathbf{z}_i^{(L)}$ for each sample $i$
Apply the **chain rule**: $\mathcal{L}_i$ depends on $\mathbf{z}_i^{(L)}$ through $\hat{\mathbf{y}}_i$. For each sample $i$ and output neuron $j$:
$
\frac{\partial \mathcal{L}_i}{\partial z_{ij}^{(L)}} = \color{#FF6B6B}{\frac{\partial \mathcal{L}_i}{\partial \hat{y}_{ij}}} \color{black}{\cdot} \color{#4ECDC4}{\frac{\partial \hat{y}_{ij}}{\partial z_{ij}^{(L)}}} \color{black}{=} \color{#FF6B6B}{\frac{2}{N}(\hat{y}_{ij} - y_{ij})} \color{black}{\cdot} \color{#4ECDC4}{\sigma'(z_{ij}^{(L)})}
$
In vector form for sample $i$, this gives us the **error term**:
$
\boldsymbol{\delta}_i^{(L)} = \frac{2}{N}(\hat{\mathbf{y}}_i - \mathbf{y}_i) \odot \sigma'(\mathbf{z}_i^{(L)})
$
where $\odot$ is element-wise multiplication.
---
## Backpropagation: Output Layer
**Step 2**: Compute gradients w.r.t. weights and biases for sample $i$
Given $\boldsymbol{\delta}_i^{(L)} = \frac{\partial \mathcal{L}_i}{\partial \mathbf{z}_i^{(L)}}$ for sample $i$ and forward pass $z_{ij}^{(L)} = \sum_{k=1}^{M^{(L-1)}} W_{jk}^{(L)} h_{ik}^{(L-1)} + b_j^{(L)}$:
**Weight gradients**: Apply chain rule to $W_{jk}^{(L)}$ (weight connecting neuron $k$ in layer $L-1$ to neuron $j$ in layer $L$)
$
\frac{\partial \mathcal{L}_i}{\partial W_{jk}^{(L)}} = \color{#FF6B6B}{\frac{\partial \mathcal{L}_i}{\partial z_{ij}^{(L)}}} \color{black}{\cdot} \color{#4ECDC4}{\frac{\partial z_{ij}^{(L)}}{\partial W_{jk}^{(L)}}} \color{black}{=} \color{#FF6B6B}{\delta_{ij}^{(L)}} \color{black}{\cdot} \color{#4ECDC4}{h_{ik}^{(L-1)}}
$
In matrix form: $\frac{\partial \mathcal{L}_i}{\partial \mathbf{W}^{(L)}} = \boldsymbol{\delta}_i^{(L)} (\mathbf{h}_i^{(L-1)})^\top$
**Bias gradients**: Apply chain rule to $b_j^{(L)}$ (bias for neuron $j$ in layer $L$)
$
\frac{\partial \mathcal{L}_i}{\partial b_j^{(L)}} = \color{#FF6B6B}{\frac{\partial \mathcal{L}_i}{\partial z_{ij}^{(L)}}} \color{black}{\cdot} \color{#4ECDC4}{\frac{\partial z_{ij}^{(L)}}{\partial b_j^{(L)}}} \color{black}{=} \color{#FF6B6B}{\delta_{ij}^{(L)}} \color{black}{\cdot} \color{#4ECDC4}{1} \color{black}{=} \delta_{ij}^{(L)}
$
---
## Backpropagation: Hidden Layers
**Step 3**: Propagate error backwards to hidden layer $l$ for sample $i$
To compute $\frac{\partial \mathcal{L}_i}{\partial z_{ij}^{(l)}}$, we use the chain rule through layer $l+1$, as $\mathcal{L}_i$ depends on $z_{ij}^{(l)}$ via all neurons in layer $l+1$:
$
\frac{\partial \mathcal{L}_i}{\partial z_{ij}^{(l)}} = \sum_{m=1}^{M^{(l+1)}} \color{#FF6B6B}{\frac{\partial \mathcal{L}_i}{\partial z_{im}^{(l+1)}}} \color{black}{\cdot} \color{#95E1D3}{\frac{\partial z_{im}^{(l+1)}}{\partial h_{ij}^{(l)}}} \color{black}{\cdot} \color{#4ECDC4}{\frac{\partial h_{ij}^{(l)}}{\partial z_{ij}^{(l)}}}
$
where $\color{#FF6B6B}{\delta_{im}^{(l+1)}}$ = error next layer, $\color{#95E1D3}{\frac{\partial z_{im}^{(l+1)}}{\partial h_{ij}^{(l)}} = W_{mj}^{(l+1)}}$ = weight connecting layers and $\color{#4ECDC4}{\frac{\partial h_{ij}^{(l)}}{\partial z_{ij}^{(l)}} = \sigma'(z_{ij}^{(l)})}$.
This gives us the **error term** for hidden layer $l$:
$
\delta_{ij}^{(l)} = \left(\sum_{m=1}^{M^{(l+1)}} W_{mj}^{(l+1)} \delta_{im}^{(l+1)}\right) \sigma'(z_{ij}^{(l)})
$
In vector form: $\boldsymbol{\delta}_i^{(l)} = \left[(\mathbf{W}^{(l+1)})^\top \boldsymbol{\delta}_i^{(l+1)}\right] \odot \sigma'(\mathbf{z}_i^{(l)})$
---
## Backpropagation: Hidden Layers
**Step 4**: Compute gradients w.r.t. weights and biases for sample $i$ (same as output layer!)
Given $\boldsymbol{\delta}_i^{(l)} = \frac{\partial \mathcal{L}_i}{\partial \mathbf{z}_i^{(l)}}$ for sample $i$ and forward pass $z_{ij}^{(l)} = \sum_{k=1}^{M^{(l-1)}} W_{jk}^{(l)} h_{ik}^{(l-1)} + b_j^{(l)}$:
**Weight gradients**: Apply chain rule to $W_{jk}^{(l)}$ (weight connecting neuron $k$ in layer $l-1$ to neuron $j$ in layer $l$)
$
\frac{\partial \mathcal{L}_i}{\partial W_{jk}^{(l)}} = \color{#FF6B6B}{\frac{\partial \mathcal{L}_i}{\partial z_{ij}^{(l)}}} \color{black}{\cdot} \color{#4ECDC4}{\frac{\partial z_{ij}^{(l)}}{\partial W_{jk}^{(l)}}} \color{black}{=} \color{#FF6B6B}{\delta_{ij}^{(l)}} \color{black}{\cdot} \color{#4ECDC4}{h_{ik}^{(l-1)}}
$
In matrix form: $\frac{\partial \mathcal{L}_i}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}_i^{(l)} (\mathbf{h}_i^{(l-1)})^\top$
**Bias gradients**: Apply chain rule to $b_j^{(l)}$ (bias for neuron $j$ in layer $l$)
$
\frac{\partial \mathcal{L}_i}{\partial b_j^{(l)}} = \color{#FF6B6B}{\frac{\partial \mathcal{L}_i}{\partial z_{ij}^{(l)}}} \color{black}{\cdot} \color{#4ECDC4}{\frac{\partial z_{ij}^{(l)}}{\partial b_j^{(l)}}} \color{black}{=} \color{#FF6B6B}{\delta_{ij}^{(l)}} \color{black}{\cdot} \color{#4ECDC4}{1} \color{black}{=} \delta_{ij}^{(l)}
$
---
## Backpropagation: Algorithm Summary
**Forward Pass**:
1. Input: $\mathbf{h}^{(0)} = \mathbf{x}$
2. For $l = 1, \ldots, L$:
- $\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}$
- $\mathbf{h}^{(l)} = \sigma(\mathbf{z}^{(l)})$
3. Output: $\hat{\mathbf{y}} = \mathbf{h}^{(L)}$
4. Loss: $\mathcal{L}(\mathbf{y}, \hat{\mathbf{y}})$
**Backward Pass**:
1. Output layer: $\boldsymbol{\delta}^{(L)} = \frac{\partial \mathcal{L}}{\partial \hat{\mathbf{y}}} \odot \sigma'(\mathbf{z}^{(L)})$
2. For $l = L-1, \ldots, 1$:
- $\boldsymbol{\delta}^{(l)} = [(\mathbf{W}^{(l+1)})^\top \boldsymbol{\delta}^{(l+1)}] \odot \sigma'(\mathbf{z}^{(l)})$
3. Gradients for all layers:
- $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)} (\mathbf{h}^{(l-1)})^\top$
- $\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)}$
**Weight Update** (Gradient Descent):
$
\begin{aligned}
\mathbf{W}^{(l)} & \leftarrow \mathbf{W}^{(l)} - \eta \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} \\
\mathbf{b}^{(l)} & \leftarrow \mathbf{b}^{(l)} - \eta \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}}
\end{aligned}
$
where $\eta$ is the learning rate.
---
## Backpropagation: Element-wise View
For a single neuron $j$ in layer $l$, the gradient with respect to its weight $W_{ji}^{(l)}$ is:
$
\frac{\partial \mathcal{L}}{\partial W_{ji}^{(l)}} = \delta_j^{(l)} h_i^{(l-1)}
$
where the error term $\delta_j^{(l)}$ is computed as:
$
\delta_j^{(l)} = \begin{cases}
\left(\frac{\partial \mathcal{L}}{\partial \hat{y}_j}\right) \sigma'(z_j^{(L)}) & \text{if } l = L \text{ (output layer)} \\
\\
\left(\sum_{k=1}^{M^{(l+1)}} W_{kj}^{(l+1)} \delta_k^{(l+1)}\right) \sigma'(z_j^{(l)}) & \text{if } l < L \text{ (hidden layer)}
\end{cases}
$
**Key Insight**: Each neuron's error $\delta_j^{(l)}$ depends on:
1. The weighted sum of errors from neurons in the next layer
2. The derivative of its own activation function
This recursive structure enables efficient gradient computation through the chain rule!
---
## Multilayer Perceptrons
Source: https://github.com/acids-ircam/creative_ml
---
## Multilayer Perceptrons
Source: https://github.com/acids-ircam/creative_ml
---
## Neural Network as Space Transformer
---
## Regularization Techniques
To prevent overfitting in multilayer perceptrons, we can use various regularization techniques:
- **L1 or L2 Regularization (Weight Decay for SGD)**: Adds a penalty term to the loss function proportional to the magnitude of the weights.
$
\begin{aligned}
\mathcal{L}_{reg} & = \mathcal{L} + \mathcal{R} \\
\mathcal{R}_1 & = \lambda \sum_{l} \sum_{i,j} |W_{ij}^{(l)}| \quad \text{(L1 Regularization)} \\
\mathcal{R}_2 & = \lambda \sum_{l} \sum_{i,j} (W_{ij}^{(l)})^2 \quad \text{(L2 Regularization)}
\end{aligned}
$
- **Batch Normalization**: Normalizes the inputs of each layer to have zero mean and unit variance, improving training stability.
$
\begin{aligned}
\mu_B & = \frac{1}{m} \sum_{i=1}^{m} z_i \\
\sigma_B^2 & = \frac{1}{m} \sum_{i=1}^{m} (z_i - \mu_B)^2\\
\hat{z}_i & = \frac{z_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}
\end{aligned}
$
- **Dropout**: Randomly sets a fraction of the neurons to zero during training to prevent co-adaptation.
Source: https://medium.com/@amarbudhiraja/https-medium-com-amarbudhiraja-learning-less-to-learn-better-dropout-in-deep-machine-learning-74334da4bfc5
- **Early Stopping**: Monitors validation loss during training and stops when it starts to increase.
---
## Weight Initialization Strategies
**Why is proper initialization important?**
- **Worst-case**: Initializing all weights to zero leads to identical gradients and no learning.
- **Random initialization**: Helps break symmetry, but naive methods can lead to vanishing/exploding gradients.
**Xavier/Glorot Initialization** (for sigmoid/tanh):
$
W_{ij} \sim \mathcal{N}\left(0, \frac{2}{n_{in} + n_{out}}\right) \quad \text{or} \quad W_{ij} \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{in} + n_{out}}}, \sqrt{\frac{6}{n_{in} + n_{out}}}\right)
$
→ Sigmoid/tanh derivatives: $\sigma'(z) \approx 0.25$ and $\tanh'(z) \approx 1$ (scale gradients uniformly)
→ Must balance variance for **both** forward ($n_{in}$) **and** backward ($n_{out}$) passes equally
**He Initialization** (for ReLU):
$
W_{ij} \sim \mathcal{N}\left(0, \frac{2}{n_{in}}\right) \quad \text{or} \quad W_{ij} \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{in}}}, \sqrt{\frac{6}{n_{in}}}\right)
$
→ ReLU derivative: $\frac{d\text{ReLU}}{dz} = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{otherwise} \end{cases}$ (either passes gradient or blocks it)
→ Backward pass **inherits** the sparsity pattern from forward pass (same neurons are dead)
→ Only need to preserve variance in forward pass; backward naturally follows with variance $\frac{2}{n_{in}}$
---
# Python Implementation