# Convolutional Layers --- ## Mathematical Foundations
Calculus & Linear Algebra
Basis for optimization algorithms and machine learning model operations
1676
Chain Rule
Leibniz, G. W.
1805
Least Squares
Legendre, A. M.
1809
Normal Equations
Gauss, C. F.
1847
Gradient Descent
Cauchy, A. L.
1858
Eigenvalue Theory
Cayley & Hamilton
1901
PCA
Pearson, K.
1951
Stochastic Gradient Descent
Robbins & Monro
Probability & Statistics
Basis for Bayesian methods, statistical inference, and generative models
1763
Bayes' Theorem
Bayes, T.
1812
Bayesian Probability
Laplace, P. S.
1815
Gaussian Distribution
Gauss, C. F.
1830
Central Limit Theorem
Various
1922
Maximum Likelihood
Fisher, R.
Information & Computation
Foundations of algorithmic thinking and information theory
1843
First Computer Algorithm
Lovelace, A.
1936
Turing Machine
Turing, A.
1947
Linear Programming
Dantzig, G.
1948
Information Theory
Shannon, C.
--- ## Early History of Neural Networks
Architectures & Layers
Evolution of network architectures and layer innovations
1943
Artificial Neurons
McCulloch & Pitts
1957
Perceptron
Rosenblatt, F.
1965
Deep Networks
Ivakhnenko & Lapa
1979
Convolutional Networks
Fukushima, K.
1982
Recurrent Networks
Hopfield
1989
LSTM
Hochreiter & Schmidhuber
2006
Deep Belief Networks
Hinton, G. et al.
2012
AlexNet
Krizhevsky et al.
Training & Optimization
Methods for efficient learning and gradient-based optimization
1967
Stochastic Gradient Descent for NN
Amari, S.
1970
Automatic Differentiation
Linnainmaa, S.
1986
Backpropagation for NN
Hinton et al.
1992
Weight Decay
Krogh & Hertz
2009
Convolutional DBNs & Prob. Max Pooling
Lee, H. et al.
2010
ReLU & Xavier Init
Nair, Hinton & Glorot
2012
Dropout
Hinton, G. et al.
Software & Datasets
Tools, platforms, and milestones that enabled practical deep learning
1997
Deep Blue
IBM
1998
MNIST Dataset & LeNet 5
LeCun, Y. et al.
2002
Torch Framework
Torch Team
2007
CUDA Platform
NVIDIA
2009
ImageNet Dataset
Deng, J. et al.
2011
Siri
Apple Inc.
--- ## The Deep Learning Era
Deep architectures
Deep architectures and generative models transforming AI capabilities
2013
Variational Autoencoders
Kingma et al.
2014
Generative Adversarial Nets
Goodfellow et al.
2015
ResNet & Diffusion
He et al. & Sohl-Dickstein et al.
2016
Style Transfer & WaveNet
Gatys & van den Oord
2017
Transformers
Vaswani et al.
2021
ViT & CLIP
Dosovitskiy & Radford
2022
Diffusion Transformer
Peebles & Xie
2023
Mamba
Gu & Dao
Training & Optimization
Advanced learning techniques and representation learning breakthroughs
2013
Word2Vec
Mikolov, T. et al.
2014
Attention Mechanism
Bahdanau, D. et al.
2015
BatchNorm & Adam
Ioffe & Kingma
2016
Layer Normalization
Ba, J. L. et al.
2020
DDPM
Ho, J. et al.
Software & Applications
Practical deployment and mainstream adoption of deep learning systems
2016
AlphaGo
Silver, D. et al.
2017
PyTorch
Paszke, A. et al.
2018
GPT-1
Radford & Devlin
2020
GPT-3
Brown, T. B. et al.
2022
ChatGPT & Stable Diffusion
OpenAI & Stability AI
2023
LLaMA
Touvron, H. et al.
--- ## Recap: Linear Layers
- Linear layers are the building blocks of neural networks, consisting of weights and biases - A linear layer with one output and a step activation function is named a perceptron - In order to stack multiple linear layers and learn complex patterns, we need a differentiable non-linear activation function - Common activation functions include ReLU, sigmoid, and tanh
- The forward pass of a multi-layer perceptron (MLP) consists of multiple linear transformations followed by non-linear activations - In the backward pass, we "go back" through the network to update the weights using gradient descent.
**Forward Pass**: 1. Input: $\mathbf{h}^{(0)} = \mathbf{x}$ 2. For $l = 1, \ldots, L$: - $\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}$ - $\mathbf{h}^{(l)} = \sigma(\mathbf{z}^{(l)})$ 3. Output: $\hat{\mathbf{y}} = \mathbf{h}^{(L)}$ 4. Loss: $\mathcal{L}(\mathbf{y}, \hat{\mathbf{y}})$
**Backward Pass**: 1. Output layer: $\boldsymbol{\delta}^{(L)} = \frac{\partial \mathcal{L}}{\partial \hat{\mathbf{y}}} \odot \sigma'(\mathbf{z}^{(L)})$ 2. For $l = L-1, \ldots, 1$: - $\boldsymbol{\delta}^{(l)} = [(\mathbf{W}^{(l+1)})^\top \boldsymbol{\delta}^{(l+1)}] \odot \sigma'(\mathbf{z}^{(l)})$ 3. Gradients for all layers: - $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)} (\mathbf{h}^{(l-1)})^\top$ - $\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)}$
Can linear layers retain temporal information? - Yes, but limited: Linear layers learn position-specific patterns (e.g., "value at t=5 relates to value at t=8") - No translation invariance—same pattern at different positions must be learned separately - Fixed input length, parameters scale poorly with sequence length
--- ## Recap: Finite Impulse Response (FIR) Filters - Digital filters implemented by convolving the input signal with a finite impulse response
**Time domain:**
$ y[n] = \sum_{k=0}^{M-1} h[k] x[n-k] $
where $y[n]$ is the output, $x[n]$ is the input, $h[k]$ is the impulse response, and $M$ is the filter order. **Frequency domain (via DFT):**
$ \begin{aligned} H(e^{j\omega}) & = \sum_{k=0}^{M-1} h[k] e^{-j\omega k} \\ Y(e^{j\omega}) & = H(e^{j\omega}) X(e^{j\omega}) \end{aligned} $
FIR filters are inherently stable, meaning they do not produce unbounded output for a bounded input.
--- ## FIR Filters Animation
Your browser does not support the video tag.
--- ## Convolutional Layers
- Convolutional layers apply learnable FIR filters to input data, enabling the model to capture local patterns - They are translation invariant, meaning the same filter is applied across the entire input - Convolutional layers can handle variable-length inputs and have fewer parameters compared to fully connected layers - Commonly used in many audio tasks **1D Convolution:**
$ y[n] = \sum_{k=0}^{M-1} w[k] x[n-k] + b $
with learnable weights $w[k]$ of size $M$ and bias $b$. The output length is given by:
$ L_{out} = L_{in} - M + 1 $
--- ## 2D Convolutional Layers
- 2D convolutional layers extend the concept of 1D convolutions to two-dimensional data, such as images - They apply learnable 2D filters (kernels) to capture spatial patterns in the input - Commonly used in computer vision tasks for feature extraction and pattern recognition **2D Convolution:**
$ y[m,n] = \sum_{i=0}^{M-1} \sum_{j=0}^{N-1} w[i,j] x[m-i,n-j] + b $
where $w[i,j]$ is the 2D filter of size $M \times N$ and $b$ is the bias. The output has dimensions:
$ H_{out} = H_{in} - M + 1, \quad W_{out} = W_{in} - N + 1 $
Source:
https://github.com/vdumoulin/conv_arithmetic/tree/master
--- ## Transposed Convolutional Layers
- Transposed convolutional layers, also known as deconvolutional layers, are used for upsampling feature maps - They reverse the spatial transformation of convolutional layers, increasing the spatial dimensions of the input - Commonly used in generative models and image segmentation tasks **1D Transposed Convolution:**
$ y[n] = \sum_{k=0}^{M-1} w[k] \cdot x'\left[n+k\right] + b $
where $x'[i]$ is the input with $p = M - 1$ zeros at each edge. The output length is given by:
$ L_{out} = L_{in} + M - 1 $
Source:
https://github.com/vdumoulin/conv_arithmetic/tree/master
--- ## Options in Convolutional Layers
Padding
: Adding $p$ extra points at the edges of the input to control the spatial dimensions of the output
Strides
: The step size $s$ with which the filter moves across the input, affecting the output size
Dilations
: Spacing out the filter elements by a factor of $d$ to increase the receptive field without increasing parameters
Source:
https://github.com/vdumoulin/conv_arithmetic/tree/master
```python class torch.nn.Conv1d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros', device=None, dtype=None) ```
--- ## Multi-Head Convolutional Layers - Multi-head convolutional layers consist of multiple parallel convolutional kernels (heads) that process the input simultaneously - Each head learns different features from the input, allowing the model to capture a diverse set of patterns - Each kernel has its own set of weights and biases, and gets convolved with all input channels **Multi-Head 1D Convolution:**
$ y_{j}[n] = \sum_{i=0}^{C_{in}-1} \sum_{k=0}^{M-1} w_{i,j}[k] x_{i}[n-k] + b_{j} $
where $C_{in}$ is the number of input channels, $y_{j}[n]$ is the output of head $j$, $x_{i}[n]$ is the input from channel $i$, $w_{i,j}[k]$ are the weights for head $j$ and channel $i$, and $b_{j}$ is the bias for head $j$.
Grouped convolutions are a special case of multi-head convolutions where each head processes a distinct subset of input channels.
--- ## Options in Transposed Convolutional Layers
Padding
: Reduces the inherent padding of transposed convolution to $p' = M - 1 - p$
Strides
: Inserts $(s-1)$ zeros between input elements
Dilations
: Spacing out the filter elements by a factor of $d$ to increase the receptive field without increasing parameters
Source:
https://github.com/vdumoulin/conv_arithmetic/tree/master
```python class torch.nn.ConvTranspose1d(in_channels, out_channels, kernel_size, stride=1, padding=0, output_padding=0, groups=1, bias=True, dilation=1, padding_mode='zeros', device=None, dtype=None) ```
--- ## Ambiguities in Transposed Convolutions - Transposed convolutions can lead to ambiguities in output size due to the interplay of padding, stride, and kernel size - The `output_padding` parameter is used to resolve these ambiguities by specifying additional size to add to the output shape
Standard Convolution
Transposed Convolution
Input size: 6×6, Kernel: 3×3, Stride: 2, Padding: 1
Output size: 3×3
Input size: 3×3, Kernel: 3×3, Stride: 2, Padding: 1
Output size: 5×5 or 6×6?
Source:
https://github.com/vdumoulin/conv_arithmetic/tree/master
--- ## AlexNet Example - AlexNet is a pioneering convolutional neural network architecture that achieved significant success in image classification tasks - Consists of: convolutional layers followed by fully connected layers, utilizing ReLU activations and max pooling for downsampling - AlexNet marked a breakthrough in deep learning
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25.
--- ## Pooling Layers - Pooling layers reduce the spatial dimensions of the input, helping to decrease computational load and control overfitting - Common types of pooling include max pooling and average pooling **1D Max Pooling:**
$ y[n] = \max\limits_{0 \leq k < M} x[n \cdot s + k] $
where $M$ is the pool size and $s$ is the stride. The output length is given by:
$ L_{out} = \left\lfloor \frac{L_{in} - M}{s} \right\rfloor + 1 $
2D Max Pooling with pool size 2x2 and stride 2.
--- ## WaveNet Example - WaveNet is a deep generative model for raw audio waveforms that utilizes dilated causal convolutions to capture long-range temporal dependencies - It employs multiple layers of dilated convolutions with increasing dilation factors, allowing the receptive field to grow exponentially with depth
Oord, A. van den, Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., & Kavukcuoglu, K. (2016). WaveNet: A Generative Model for Raw Audio (No. arXiv:1609.03499). https://doi.org/10.48550/arXiv.1609.03499
Oord, A. van den, Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., & Kavukcuoglu, K. (2016). WaveNet: A Generative Model for Raw Audio (No. arXiv:1609.03499). https://doi.org/10.48550/arXiv.1609.03499
--- ## Recall: Backpropagation in MLPs
**Forward Pass**: 1. Input: $\mathbf{h}^{(0)} = \mathbf{x}$ 2. For $l = 1, \ldots, L$: - $\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}$ - $\mathbf{h}^{(l)} = \sigma(\mathbf{z}^{(l)})$ 3. Output: $\hat{\mathbf{y}} = \mathbf{h}^{(L)}$ 4. Loss: $\mathcal{L}(\mathbf{y}, \hat{\mathbf{y}})$
**Backward Pass**: 1. Output layer: $\boldsymbol{\delta}^{(L)} = \frac{\partial \mathcal{L}}{\partial \hat{\mathbf{y}}} \odot \sigma'(\mathbf{z}^{(L)})$ 2. For $l = L-1, \ldots, 1$: - $\boldsymbol{\delta}^{(l)} = [(\mathbf{W}^{(l+1)})^\top \boldsymbol{\delta}^{(l+1)}] \odot \sigma'(\mathbf{z}^{(l)})$ 3. Gradients for all layers: - $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)} (\mathbf{h}^{(l-1)})^\top$ - $\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)}$
--- ## Backpropagation in Convolutional Layers
How do we compute gradients for convolutional layers?
Like MLPs, we need to compute gradients of the loss $\mathcal{L}$ for
each layer
$l$ to update parameters:
$ \frac{\partial \mathcal{L}}{\partial \mathbf{w}^{(l)}} \quad \text{and} \quad \frac{\partial \mathcal{L}}{\partial b^{(l)}} $
where $\mathbf{w}^{(l)}$ is the convolutional kernel and $b^{(l)}$ is the bias for layer $l$.
--- ## Forward Propagation in Conv Layers
For layer $l$ in a **single-channel 1D convolution** the forward pass computes:
$ \begin{aligned} z^{(l)}[n] & = \sum_{k=0}^{M-1} w^{(l)}[k] \cdot h^{(l-1)}[n-k] + b^{(l)} \\ h^{(l)}[n] & = \sigma(z^{(l)}[n]) \end{aligned} $
where: - $h^{(l-1)}[n]$ is the input from the previous layer - $w^{(l)}[k]$ is the learnable kernel of size $M$ - $b^{(l)}$ is the learnable bias - $\sigma(\cdot)$ is the activation function - Output length: $L_{out} = L_{in} - M + 1$
**In vector notation** (treating convolution as matrix multiplication):
$ \mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{h}^{(l-1)} + b^{(l)} \mathbf{1} $
where $\mathbf{W}^{(l)}$ is a Toeplitz matrix representing the convolution operation.
Source:
https://stackoverflow.com/questions/34536264/how-can-i-generate-a-toeplitz-matrix-in-the-correct-form-for-performing-discrete
--- ## Backpropagation: Output Layer
MSE Loss
: $\mathcal{L} = \frac{1}{N}\sum_{i=1}^{N}\sum_{n}(y_i[n] - \hat{y}_i[n])^2$ **Step 1**: Compute gradient w.r.t. output layer pre-activation $z_i^{(L)}[n]$ for each sample $i$
Apply the **chain rule**: $\mathcal{L}_i$ depends on $z_i^{(L)}[n]$ through $\hat{y}_i[n]$. For each sample $i$ and time step $n$:
$ \frac{\partial \mathcal{L}_i}{\partial z_i^{(L)}[n]} = \color{#FF6B6B}{\frac{\partial \mathcal{L}_i}{\partial \hat{y}_i[n]}} \color{black}{\cdot} \color{#4ECDC4}{\frac{\partial \hat{y}_i[n]}{\partial z_i^{(L)}[n]}} \color{black}{=} \color{#FF6B6B}{\frac{2}{N}(\hat{y}_i[n] - y_i[n])} \color{black}{\cdot} \color{#4ECDC4}{\sigma'(z_i^{(L)}[n])} $
This gives us the **error term** for each time step:
$ \delta_i^{(L)}[n] = \frac{2}{N}(\hat{y}_i[n] - y_i[n]) \cdot \sigma'(z_i^{(L)}[n]) $
In vector form: $\boldsymbol{\delta}_i^{(L)} = \frac{2}{N}(\hat{\mathbf{y}}_i - \mathbf{y}_i) \odot \sigma'(\mathbf{z}_i^{(L)})$
--- ## Backpropagation: Output Layer Gradients
**Step 2**: Compute gradients w.r.t. kernel weights and bias for sample $i$
Given $\delta_i^{(L)}[n]$ and forward pass $z_i^{(L)}[n] = \sum_{k=0}^{M-1} w^{(L)}[k] \cdot h_i^{(L-1)}[n-k] + b^{(L)}$:
**Kernel weight gradients**: Apply chain rule to $w^{(L)}[k]$ (weight at position $k$ in the kernel)
$ \frac{\partial \mathcal{L}_i}{\partial w^{(L)}[k]} = \sum_{n} \color{#FF6B6B}{\frac{\partial \mathcal{L}_i}{\partial z_i^{(L)}[n]}} \color{black}{\cdot} \color{#4ECDC4}{\frac{\partial z_i^{(L)}[n]}{\partial w^{(L)}[k]}} \color{black}{=} \sum_{n} \color{#FF6B6B}{\delta_i^{(L)}[n]} \color{black}{\cdot} \color{#4ECDC4}{h_i^{(L-1)}[n-k]} $
This is a **convolution** of the error signal with the input!
**Bias gradient**: Apply chain rule to $b^{(L)}$
$ \frac{\partial \mathcal{L}_i}{\partial b^{(L)}} = \sum_{n} \color{#FF6B6B}{\delta_i^{(L)}[n]} \color{black}{\cdot} \color{#4ECDC4}{\frac{\partial z_i^{(L)}[n]}{\partial b^{(L)}}} \color{black}{=} \sum_{n} \delta_i^{(L)}[n] $
The bias gradient is simply the **sum** of all error terms!
--- ## Backpropagation: Hidden Layers
**Step 3**: Propagate error backwards to hidden layer $l$ for sample $i$
In $\frac{\partial \mathcal{L}_i}{\partial h_i^{(l)}[n]}$, the loss depends on $h_i^{(l)}[n]$ through all positions in layer $l+1$ where this activation is used:
$ \frac{\partial \mathcal{L}_i}{\partial h_i^{(l)}[n]} = \sum_{m} \color{#FF6B6B}{\frac{\partial \mathcal{L}_i}{\partial z_i^{(l+1)}[m]}} \color{black}{\cdot} \color{#4ECDC4}{\frac{\partial z_i^{(l+1)}[m]}{\partial h_i^{(l)}[n]}} $
Since $z_i^{(l+1)}[m] = \sum_{k=0}^{M-1} w^{(l+1)}[k] \cdot h_i^{(l)}[m-k] + b^{(l+1)}$, we have:
$ \begin{aligned} \frac{\partial z_i^{(l+1)}[m]}{\partial h_i^{(l)}[n]} &= \begin{cases} w^{(l+1)}[m-n] & \text{if } 0 \leq m-n < M \\ 0 & \text{otherwise} \end{cases}\\ \rightarrow \frac{\partial \mathcal{L}_i}{\partial h_i^{(l)}[n]} &= \sum_{k=0}^{M-1} \delta_i^{(l+1)}[n+k] \cdot w^{(l+1)}[k] \end{aligned} $
This is a **full convolution** (with flipped kernel) of the error with the weights!
--- ## Backpropagation: Hidden Layer Gradients
**Step 4**: Compute error term and gradients for hidden layer $l$
First, compute the error term by applying the activation derivative:
$ \delta_i^{(l)}[n] = \color{#FF6B6B}{\frac{\partial \mathcal{L}_i}{\partial h_i^{(l)}[n]}} \color{black}{\cdot} \color{#4ECDC4}{\frac{\partial h_i^{(l)}[n]}{\partial z_i^{(l)}[n]}} \color{black}{=} \left(\sum_{k=0}^{M-1} \delta_i^{(l+1)}[n+k] \cdot w^{(l+1)}[k]\right) \cdot \sigma'(z_i^{(l)}[n]) $
**Kernel weight gradients**: Same as output layer!
$ \frac{\partial \mathcal{L}_i}{\partial w^{(l)}[k]} = \sum_{n} \delta_i^{(l)}[n] \cdot h_i^{(l-1)}[n-k] $
**Bias gradient**: Same as output layer!
$ \frac{\partial \mathcal{L}_i}{\partial b^{(l)}} = \sum_{n} \delta_i^{(l)}[n] $
--- ## Backpropagation: Algorithm Summary
**Forward Pass**: 1. Input: $\mathbf{h}^{(0)} = \mathbf{x}$ 2. For $l = 1, \ldots, L$: - $z^{(l)}[n] = \sum_{k=0}^{M-1} w^{(l)}[k] h^{(l-1)}[n-k] + b^{(l)}$ - $h^{(l)}[n] = \sigma(z^{(l)}[n])$ 3. Output: $\hat{\mathbf{y}} = \mathbf{h}^{(L)}$ 4. Loss: $\mathcal{L}(\mathbf{y}, \hat{\mathbf{y}})$
**Backward Pass**: 1. Output layer: $\delta^{(L)}[n] = \frac{\partial \mathcal{L}}{\partial \hat{y}[n]} \cdot \sigma'(z^{(L)}[n])$ 2. For $l = L-1, \ldots, 1$: - $\delta^{(l)}[n] = \left[\sum_{k=0}^{M-1} \delta^{(l+1)}[n+k] w^{(l+1)}[k]\right] \sigma'(z^{(l)}[n])$ 3. Gradients for all layers: - $\frac{\partial \mathcal{L}}{\partial w^{(l)}[k]} = \sum_{n} \delta^{(l)}[n] h^{(l-1)}[n-k]$ - $\frac{\partial \mathcal{L}}{\partial b^{(l)}} = \sum_{n} \delta^{(l)}[n]$
**Weight Update** (Gradient Descent):
$ \begin{aligned} w^{(l)}[k] & \leftarrow w^{(l)}[k] - \eta \frac{\partial \mathcal{L}}{\partial w^{(l)}[k]} \\ b^{(l)} & \leftarrow b^{(l)} - \eta \frac{\partial \mathcal{L}}{\partial b^{(l)}} \end{aligned} $
where $\eta$ is the learning rate.
--- ## Key Differences: Conv vs MLP Backprop
Aspect
MLP
Convolutional Layer
Weight Gradients
Outer product: $\boldsymbol{\delta}^{(l)} (\mathbf{h}^{(l-1)})^\top$
Convolution: $\sum_{n} \delta^{(l)}[n] h^{(l-1)}[n-k]$
Error Propagation
Matrix-vector: $(\mathbf{W}^{(l+1)})^\top \boldsymbol{\delta}^{(l+1)}$
Full convolution:
$\sum_{k} \delta^{(l+1)}[n+k] w^{(l+1)}[k]$
Bias Gradients
Direct error: $\boldsymbol{\delta}^{(l)}$
Sum of errors: $\sum_{n} \delta^{(l)}[n]$
Parameter Sharing
Each weight unique to one connection
Same kernel weights reused across input
Computational Cost
$\sim O(M \times M')$ per layer
$\sim O(L \times M)$ per layer
(Much smaller with big input lengths!)
**Key Insight**: Convolution in the forward pass becomes convolution in the backward pass! → Weight gradients: convolve error with input
→ Error propagation: full convolution of error with flipped kernel
→ Parameter sharing means fewer gradients to compute for the same number of connections
--- ## Multi-Channel Backpropagation
**Forward pass** for multi-channel convolution:
$ z_j^{(l)}[n] = \sum_{i=0}^{C_{in}-1} \sum_{k=0}^{M-1} w_{i,j}^{(l)}[k] h_i^{(l-1)}[n-k] + b_j^{(l)} $
where $i$ indexes input channels and $j$ indexes output channels (heads).
**Gradients w.r.t. kernel weights**:
$ \frac{\partial \mathcal{L}}{\partial w_{i,j}^{(l)}[k]} = \sum_{n} \delta_j^{(l)}[n] \cdot h_i^{(l-1)}[n-k] $
Compute separately for each input channel $i$ and output channel $j$ combination.
**Error propagation to previous layer**:
$ \frac{\partial \mathcal{L}}{\partial h_i^{(l-1)}[n]} = \sum_{j=0}^{C_{out}-1} \sum_{k=0}^{M-1} \delta_j^{(l)}[n+k] \cdot w_{i,j}^{(l)}[k] $
Sum contributions from all output channels $j$ that use input channel $i$.
--- ## Backpropagation with Stride
**Stride $s > 1$**: Complicates gradient computation Forward pass with stride:
$ z^{(l)}[n] = \sum_{k=0}^{M-1} w^{(l)}[k] \cdot h^{(l-1)}[n \cdot s - k] + b^{(l)} $
Gradient w.r.t. input becomes **sparse**—only every $s$-th position receives gradients:
$ \frac{\partial \mathcal{L}}{\partial h^{(l-1)}[m]} = \begin{cases} \sum_{k} \delta^{(l)}[n] \cdot w^{(l)}[k] & \text{if } m = n \cdot s - k \\ 0 & \text{otherwise} \end{cases} $
--- ## Backpropagation Through Pooling Layers
**Max Pooling** forward pass:
$ y[n] = \max\limits_{0 \leq k < M} x[n \cdot s + k] $
**Backward pass**: Gradient flows only to the **maximum element**
$ \frac{\partial \mathcal{L}}{\partial x[m]} = \begin{cases} \frac{\partial \mathcal{L}}{\partial y[n]} & \text{if } x[m] = y[n] \text{ and } m = n \cdot s + k^* \\ 0 & \text{otherwise} \end{cases} $
where $k^* = \arg\max\limits_{k} x[n \cdot s + k]$ **Routing gradients**: Must track which positions were selected during forward pass!
--- ## Transposed Convolution Backpropagation
**Forward pass** (transposed convolution):
$ y[n] = \sum_{k=0}^{M-1} w[k] \cdot x'[n+k] + b $
where $x'$ is input with padding $p = M - 1$ at edges.
**Backward pass**: The gradient w.r.t. input becomes a **standard convolution**!
$ \frac{\partial \mathcal{L}}{\partial x[m]} = \sum_{k=0}^{M-1} \delta[m-k] \cdot w[k] $
**Weight gradients**: Convolution of error with padded input
$ \frac{\partial \mathcal{L}}{\partial w[k]} = \sum_{n} \delta[n] \cdot x'[n+k] $
**Key Insight**: Transposed convolution and standard convolution are **transpose operations** of each other! → Forward transposed conv = Backward standard conv
→ Backward transposed conv = Forward standard conv
--- ## Computational Graph Perspective
Each output node $z[n]$ depends on: ↓ **Local input window**: $h[n-k]$ for $k = 0, \ldots, M-1$ ↓ **Shared weights**: $w[k]$ for $k = 0, \ldots, M-1$ ↓ **Shared bias**: $b$
**Gradient accumulation**: Since each weight is used multiple times (weight sharing), we must **sum** gradients from all positions:
$ \frac{\partial \mathcal{L}}{\partial w[k]} = \sum_{n} \frac{\partial \mathcal{L}}{\partial z[n]} \cdot \frac{\partial z[n]}{\partial w[k]} = \sum_{n} \delta[n] \cdot h[n-k] $
This is why weight gradients are computed via convolution!
--- # Python Implementation