**When does it explode?**
- If $\|\mathbf{W}_{hh}\| > 1$ and the product of derivatives grows unbounded
- Gradient magnitude: $\|\text{gradient}\| \approx (\|\mathbf{W}_{hh}\| \cdot \max_k \sigma'(\mathbf{z}_k))^{T-t}$
- Even with bounded activation derivatives, large $\|\mathbf{W}_{hh}\|$ can cause explosion
where $\mathbf{g}$ is the gradient vector and $\theta$ is the clipping threshold.
Consequences:
Parameter updates become extremely large
Network weights oscillate wildly
Training becomes unstable, leading to NaN values
Model fails to converge
---
## Long Short-Term Memory (LSTM) Layers
- LSTMs are a type of recurrent neural network designed to mitigate the vanishing gradient problem and capture long-term dependencies in sequential data
- They use a memory cell with gating mechanisms (input, forget, output gates) to control information flow and maintain long-term dependencies
Input Gate: Controls how much new information from the current input $\mathbf{x}_t$ and previous hidden state $\mathbf{h}_{t-1}$ is added to the cell state $\mathbf{c}_t$.
Cell State Update: Combines the previous cell state $\mathbf{c}_{t-1}$ (modulated by forget gate) and the candidate cell state $\tilde{\mathbf{c}}_t$ (modulated by input gate) to form the new cell state $\mathbf{c}_t$.
- GRUs are a simplified variant of LSTMs that combine the forget and input gates into a single "update gate"
- They use fewer parameters than LSTMs while maintaining comparable performance
- GRUs have two gates: reset gate and update gate, making them computationally more efficient
Update Gate: Controls how much of the previous hidden state $\mathbf{h}_{t-1}$ to keep and how much of the candidate hidden state $\tilde{\mathbf{h}}_t$ to add.
Candidate Hidden State: Computes new information that could be added to the hidden state, using the reset gate to selectively forget parts of $\mathbf{h}_{t-1}$.
where $\tilde{\mathbf{h}}_t$ is the candidate hidden state and $\mathbf{r}_t \odot \mathbf{h}_{t-1}$ applies the reset gate to the previous hidden state.
Hidden State Update: Combines the previous hidden state $\mathbf{h}_{t-1}$ and candidate hidden state $\tilde{\mathbf{h}}_t$ using the update gate $\mathbf{z}_t$.
where: $(1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1}$ keeps parts of the old hidden state, $\mathbf{z}_t \odot \tilde{\mathbf{h}}_t$ adds parts of the new candidate hidden state
**Key differences from LSTM**:
- No separate cell state — hidden state serves both roles
- Fewer parameters: 3 weight matrices per gate vs. 4 in LSTM
- Update gate implicitly combines forget and input gates: $(1 - \mathbf{z}_t)$ forgets, $\mathbf{z}_t$ adds new info
---
## Gates Beyond LSTM and GRU
- Gating mechanisms can be integrated into other architectures, such as convolutional neural networks (CNNs) and transformer models
- Gates help control information flow, improve gradient propagation, and enhance model performance across various tasks
- Examples include attention gates in transformers and gated convolutional layers in CNNs
---
# Python Implementation