# Tricks of the Trade --- ## Mathematical Foundations
Calculus & Linear Algebra
Basis for optimization algorithms and machine learning model operations
1676
Chain Rule
Leibniz, G. W.
1805
Least Squares
Legendre, A. M.
1809
Normal Equations
Gauss, C. F.
1847
Gradient Descent
Cauchy, A. L.
1858
Eigenvalue Theory
Cayley & Hamilton
1901
PCA
Pearson, K.
1951
Stochastic Gradient Descent
Robbins & Monro
Probability & Statistics
Basis for Bayesian methods, statistical inference, and generative models
1763
Bayes' Theorem
Bayes, T.
1812
Bayesian Probability
Laplace, P. S.
1815
Gaussian Distribution
Gauss, C. F.
1830
Central Limit Theorem
Various
1922
Maximum Likelihood
Fisher, R.
Information & Computation
Foundations of algorithmic thinking and information theory
1843
First Computer Algorithm
Lovelace, A.
1936
Turing Machine
Turing, A.
1947
Linear Programming
Dantzig, G.
1948
Information Theory
Shannon, C.
--- ## Early History of Neural Networks
Architectures & Layers
Evolution of network architectures and layer innovations
1943
Artificial Neurons
McCulloch & Pitts
1957
Perceptron
Rosenblatt, F.
1965
Deep Networks
Ivakhnenko & Lapa
1979
Convolutional Networks
Fukushima, K.
1982
Recurrent Networks
Hopfield
1989
LSTM
Hochreiter & Schmidhuber
2006
Deep Belief Networks
Hinton, G. et al.
2012
AlexNet
Krizhevsky et al.
Training & Optimization
Methods for efficient learning and gradient-based optimization
1967
Stochastic Gradient Descent for NN
Amari, S.
1970
Automatic Differentiation
Linnainmaa, S.
1986
Backpropagation for NN
Hinton et al.
1992
Weight Decay
Krogh & Hertz
2009
Convolutional DBNs & Prob. Max Pooling
Lee, H. et al.
2010
ReLU & Xavier Init
Nair, Hinton & Glorot
2012
Dropout
Hinton, G. et al.
Software & Datasets
Tools, platforms, and milestones that enabled practical deep learning
1997
Deep Blue
IBM
1998
MNIST Dataset & LeNet 5
LeCun, Y. et al.
2002
Torch Framework
Torch Team
2007
CUDA Platform
NVIDIA
2009
ImageNet Dataset
Deng, J. et al.
2011
Siri
Apple Inc.
--- ## The Deep Learning Era
Deep architectures
Deep architectures and generative models transforming AI capabilities
2013
Variational Autoencoders
Kingma et al.
2014
Generative Adversarial Nets
Goodfellow et al.
2015
ResNet & Diffusion
He et al. & Sohl-Dickstein et al.
2016
Style Transfer & WaveNet
Gatys & van den Oord
2017
Transformers
Vaswani et al.
2021
ViT & CLIP
Dosovitskiy & Radford
2022
Diffusion Transformer
Peebles & Xie
2023
Mamba
Gu & Dao
Training & Optimization
Advanced learning techniques and representation learning breakthroughs
2013
Word2Vec
Mikolov, T. et al.
2014
Attention Mechanism
Bahdanau, D. et al.
2015
BatchNorm & Adam
Ioffe & Kingma
2016
Layer Normalization
Ba, J. L. et al.
2020
DDPM
Ho, J. et al.
Software & Applications
Practical deployment and mainstream adoption of deep learning systems
2016
AlphaGo
Silver, D. et al.
2017
PyTorch
Paszke, A. et al.
2018
GPT-1
Radford & Devlin
2020
GPT-3
Brown, T. B. et al.
2022
ChatGPT & Stable Diffusion
OpenAI & Stability AI
2023
LLaMA
Touvron, H. et al.
--- ## Motivation for this Lecture - Many fancy frameworks give the illusion that neural network training can magicly solve data science problems, with a few lines of code - Just like other libraries or modules, that abstract away complexity ```python >>> your_data = # plug your awesome dataset here >>> model = SuperCrossValidator(SuperDuper.fit, your_data, ResNet50, SGDOptimizer) # conquer world here ``` ```python >>> r = requests.get('https://api.github.com/user', auth=('user', 'pass')) >>> r.status_code 200 ```
Source:
The Recipe for Training Neural Networks
by Andrej Karpathy
Unfortunately, there is no magic network, normalization, or optimizer that fits all problems! It all depends on the data and the task at hand
--- ## Motivation for this Lecture - Neural network training fails silently most of the time - In code if you plug an integer where a string is expected, you get an error - You can easily unit test small parts of your code - But how do you know if your neural network is learning correctly? - Your model could be syntactically correct, but still there can be logical bugs - And often even with the bugs the model trains surprisingly well, but the performance is suboptimal
- Lecture covers practical tips to debug and optimize neural network training - Don't rush—understand mechanics and apply tricks systematically - Start with simple baseline, add complexity incrementally
--- # How do we start? --- ## Become one with the Data - Use a feature representation that makes sense for your data (Use the knowledge from MIRMLA course) - Understand the data you are working with - Visualize samples from the dataset - Check for class imbalance - Visualize distributions of features and pay special attention to outliers - Finally, normalize or standardize features if necessary - Check for data leakage between train and validation sets --- ## Set up a Simple Baseline Model
- Fix a random seed for reproducibility - Start with a very simple "toy" model architecture - Compute a simple human-understandable baseline metrics (e.g., accuracy, confusion matrix) on the train and validation set (use k-fold cross-validation for small datasets) - Verify the loss function and metrics at initialization (e.g., random predictions should yield expected loss) - Initialize weights properly (e.g. if you are regressing some values with mean 100, initialize the last layer bias to 100) - Use a small subset (as little as 2 samples) of the train set to verify that the model can overfit it (i.e., loss goes to zero) - Analyze and visualize model predictions at different layer stages (e.g., attention maps, embeddings, feature maps) - Increase the complexity of the model gradually and monitor the performance on train and validation sets - Visualize and analyze predictions on a fixed (unshuffled) set of samples from the validation set after every epoch - Check the weights and neuron, as well as their gradients - compute statistics for the different layers (e.g., make sure they are not vanishing or exploding)
--- ## Overfit
- Look into the related literature for similar problems and datasets and find an architecture that works well - Do not use data augmentation or regularization at this stage - The Adam optimizer is a good default choice for most problems with a learning rate of 1e-3 - Make sure your model can overfit on a small subset of the training data (e.g., 100 samples) - Gradually increase the model complexity one step at a time until you can overfit on the full training set - Be careful not to overcomplicate the model too early - Beware of learning rate schedules if they are dependent on the number of epochs - When training deep models, check for vanishing or exploding gradients and apply residual connections if necessary - When having unstable activation scales consider using normalization layers
--- ## Regularize
- Once you can overfit the training set, try to improve the generalization performance - The best regularization method is to get more data - If that is not possible, try data augmentation techniques suitable for your data modality (only on the training set) - Decrease the model complexity if possible - Pay attention to spuriously correlated features in the data and try to remove features that do not generalize well - Add dropout, but pay attention with dropout and batch normalization together - Try weight decay (L2 regularization) on the weights of the model - Introduce early stopping based on the validation performance - Transfer learning from a pretrained model can help regularization as well as it can be considered as inductive bias towards solutions that generalize well
--- ## Tune
- Once you have a working model with good generalization performance, try to tune the hyperparameters - Have a good version control system in place to track experiments - e.g., [dvc](https://dvc.org/) - Have a systematic way to log and visualize training and validation metrics - e.g., [tensorboard](https://www.tensorflow.org/tensorboard) or [wandb](https://wandb.ai/) (Commercial) - Optimize computation efficiency i.e., use mixed precision training - Use random search or Bayesian optimization instead of grid search - i.e. with [optuna](https://optuna.org/) - Focus on tuning the learning rate first, as it has the largest impact on performance, try using a learning rate finder, consider using warmup strategies - Then tune the batch size, model architecture, and regularization parameters - Consider using learning rate schedules, adaptive optimizers or different input representations - Monitor the training and validation performance closely to avoid overfitting during hyperparameter tuning - Use ensembles of models or mixtures of experts to boost performance further - Finally, let the model train for a longer time to see if the performance improves further and use model checkpointing to save the best performing model
--- # Tricks of the Trade --- ## Choice of Activation Functions
Activation
Function
Typical Use Case
Network Type
ReLU
$\text{ReLU}(z) = \max(0, z)$
Hidden layers (default choice)
CNNs, MLPs, ResNets
Leaky ReLU / PReLU
$\text{LeakyReLU}(z) = \max(\alpha z, z)$
Hidden layers (when dying ReLU is an issue)
Deep CNNs, GANs
GELU
$\text{GELU}(z) = z \cdot \Phi(z)$
Hidden layers in modern architectures
Transformers, BERT, GPT
Swish / SiLU
$\text{Swish}(z) = \frac{z}{1 + e^{-z}}$
Hidden layers in deep networks
EfficientNet, modern CNNs
Tanh
$\tanh(z) = \frac{e^{z} - e^{-z}}{e^{z} + e^{-z}}$
Hidden layers, gates
RNNs, LSTMs, GRUs
Sigmoid
$\sigma(z) = \frac{1}{1 + e^{-z}}$
Output layer (binary classification), gates
Binary classifiers, LSTM gates
Softmax
$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$
Output layer (multi-class classification)
Multi-class classifiers
Linear
$f(z) = z$
Output layer (regression)
Regression models
Your browser does not support the video tag.
Your browser does not support the video tag.
Your browser does not support the video tag.
Your browser does not support the video tag.
Your browser does not support the video tag.
Your browser does not support the video tag.
Your browser does not support the video tag.
--- ## Choice of Initialization Schemes
Initialization
Method
Typical Use Case
Network Type
Xavier / Glorot
$\mathbf{W} \sim \mathcal{U}\left[-\sqrt{\frac{6}{n_{in} + n_{out}}}, \sqrt{\frac{6}{n_{in} + n_{out}}}\right]$
Hidden layers with tanh/sigmoid activations
MLPs, shallow networks
He (Kaiming)
$\mathbf{W} \sim \mathcal{N}\left(0, \frac{2}{n_{in}}\right)$
Hidden layers with ReLU activations
CNNs, ResNets, deep networks
LeCun
$\mathbf{W} \sim \mathcal{N}\left(0, \frac{1}{n_{in}}\right)$
Hidden layers with SELU activations
Self-normalizing networks - Networks designed to maintain mean and variance without normalization
Orthogonal
$\mathbf{W}$ = orthogonal matrix
Recurrent connections
RNNs, LSTMs, GRUs
Zero
$\mathbf{W} = 0$
Bias terms only
All networks (biases)
Constant
$\mathbf{W} = c$
Specific layer requirements
Output layers (regression)
**Key Principle:** Match initialization to activation function to maintain stable gradient flow - Use He for ReLU and variants - Use Xavier for tanh/sigmoid - Use Orthogonal for recurrent connections
--- ## Choice of Optimizers
Optimizer
Update Rule
Typical Use Case
Network Type
Mini-batch SGD + Momentum
$\mathbf{m}_{t} = \beta \mathbf{m}_{t-1} + \nabla \mathcal{L}$
$\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \mathbf{m}_{t}$
Computer vision, training from scratch - noisier updates can better find global minima
CNNs, ResNets, image classification
Mini-batch SGD + RMSprop
$\mathbf{v}_t = \beta \mathbf{v}_{t-1} + (1-\beta)(\nabla \mathcal{L})^2$
$\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \frac{\eta}{\sqrt{\mathbf{v}_t + \epsilon}} \nabla \mathcal{L}$
Recurrent networks, non-stationary objectives
RNNs, online learning
Adam (RMSprop + Momentum)
$\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1-\beta_1)\nabla \mathcal{L}$
$\mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1-\beta_2)(\nabla \mathcal{L})^2$
$\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \frac{\eta}{\sqrt{\mathbf{v}_t + \epsilon}} \mathbf{m}_t$
Default choice for most problems
Transformers, GANs, general purpose
AdamW
Adam + decoupled weight decay
Modern deep learning, large models
BERT, GPT, ViT, large-scale models
**Key Principle:** Match optimizer to your problem characteristics - **Adam/AdamW**: Default choice for most modern architectures (LR ~ 1e-3 to 1e-4) - **SGD + Momentum**: Best for CNNs when training from scratch (LR ~ 0.1 with schedule) - **RMSprop**: Good for RNNs and non-stationary problems - **AdamW**: Preferred over Adam for large models with weight decay
--- ## Learning Rate Schedules
- LR schedules can significantly impact convergence and performance - Use step-based schedules (not epoch-based) for flexibility across batch sizes
Schedule
Formula
Description
Step Decay
$\eta_t = \eta_0 \times \gamma^{\lfloor t / T \rfloor}$
Simple baseline, works well for CNNs
Linear Decay
$\eta_t = \eta_0 - \frac{(\eta_0 - \eta_{min}) \cdot t}{T}$
Linear decay from initial to minimum LR
Exponential Decay
$\eta_t = \eta_0 \times \gamma^t$
Smooth continuous decay
Cosine Annealing
$\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{\pi t}{T}\right)\right)$
Transformers, modern architectures, smoother than step decay
One Cycle Policy
Warmup then cosine annealing
Fast convergence, good generalization, allows big learning rates, limited training budget
Warm Restarts (SGDR)
$\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{\pi T_{cur}}{T_i}\right)\right)$
Snapshot ensembling, escape local minima, exploration
Your browser does not support the video tag.
Your browser does not support the video tag.
Your browser does not support the video tag.
Your browser does not support the video tag.
Your browser does not support the video tag.
Your browser does not support the video tag.
Attention: Learning rate schedules interact with optimizers differently; I.e. consider the momentum term when designing schedules.
--- ## Residual Connections - Researchers found that when stacking many layers, the training error first decreases and then increases, indicating a fundamental optimization issue - Even with proper initialization - the gradient updates in early layers are very unpredictable and unstable
Source:
Understanding Deep Learning (Prince)
Source:
Understanding Deep Learning (Prince)
- Residual connections (skip connections) help mitigate this issue by allowing gradients to flow directly through the network - essentially bypassing some layers
Where to place Residual Connections?
Source:
Understanding Deep Learning (Prince)
Where to place Residual Connections?
Source:
Understanding Deep Learning (Prince)
- Residual connections have become a standard component in deep architectures (e.g., ResNets, Transformers) to facilitate training of very deep networks - They help maintain the loss landscape smoothness and improve convergence
Source:
Understanding Deep Learning (Prince)
- If the input and output dimensions differ, use a linear projection (1x1 convolution) to match dimensions before addition
Attention: When using residual connections, the variance of the outputs can increase, so consider using normalization layers to stabilize training.
--- ## Normalization Layers
- Normalization layers stabilize training by controlling the distribution of activations across layers - mitigates internal covariate shift: **General form:** $\hat{x} = \gamma \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$ (where $\gamma, \beta$ are learnable; RMS Norm omits $\mu$ and $\beta$)
Normalization
Normalized Over
When to Use
Network Type
Batch Norm
Across the batch - computes mean/variance over all samples for each feature
Large batch sizes, CNNs
ResNets, VGG
Layer Norm
Across features - computes mean/variance over all features in each sample
Small batches, sequences
Transformers, RNNs, NLP
Instance Norm
Across spatial dimensions (H×W per channel, per sample) - normalizes each channel independently
Style transfer, GANs
Image generation, artistic style
Group Norm
Across channel groups + spatial dimensions - divides channels into groups
Small batches, alternatives to BN
Object detection, segmentation
RMS Norm
Across features (like Layer Norm but without mean centering) - only normalizes by RMS
Transformers, efficiency
LLMs, modern transformers
Source:
Understanding Deep Learning (Prince)
**Key Principles:** - Batch Norm for CNNs with large batches - Layer Norm for transformers and RNNs - Avoid Batch Norm + Dropout together (variance issues) - Layer Norm + Dropout works well (common in Transformers) - Place normalization after activation in residual blocks (post-activation)
--- ## Regularization Techniques - Regularization prevents overfitting by constraining model complexity or adding controlled noise during training
Technique
Method
Typical Values
When to Use
Dropout
Randomly zero neurons with probability $p$
$p = 0.2$ to $0.5$
MLPs, avoid with Batch Norm
Weight Decay (L2)
Add $\lambda \|\mathbf{W}\|^2$ to loss
$\lambda = 1e-4$ to $1e-5$
All networks, use with AdamW
Data Augmentation
Transform inputs (crop, flip, noise, etc.)
Task-specific
Limited data, computer vision, audio
Early Stopping
Stop when validation loss stops improving
Patience: 5-20 epochs
All tasks, prevents overfitting
Label Smoothing
Soften one-hot labels: $y = (1-\alpha)y + \alpha/K$
$\alpha = 0.1$
Classification, improve calibration
**Best Practices:** - Start without regularization, overfit first - Add data augmentation before other techniques - Use weight decay with all optimizers - Combine multiple techniques carefully as they can have interactions that degrade performance
--- ## Transfer Learning & Pretrained Models - Transfer learning leverages pretrained models to improve performance on new tasks with less data and computation - **Key Insight:** Features learned on large datasets transfer well to related tasks, especially lower-level features
Approach
Method
When to Use
Feature Extraction
Freeze pretrained layers, train only new head
Small dataset, similar domain
Fine-tuning
Unfreeze layers, train with small LR (1e-5 to 1e-4)
Medium/large dataset, related domain
Discriminative LR
Lower LR for early layers, higher for head
Avoid catastrophic forgetting
**Popular Sources:** Vision (ImageNet, CLIP) • Audio (AudioSet, Wav2Vec 2.0, Whisper) • Text (BERT, GPT) • Multi-modal (CLIP, DALL-E)
**Best Practices:** - Match input preprocessing to pretrained model requirements - Consider domain similarity when choosing layers to transfer - Use lower LR to preserve pretrained features
--- # Python Implementation