Tricks of the Trade

# Tricks of the Trade

---

## Mathematical Foundations

<div class="timeline-container" style="flex-direction: row;">
    <div style="width: 20%;">
        <div class="timeline-title">Calculus & Linear Algebra</div>
        <div class="timeline-text">Basis for optimization algorithms and machine learning model operations</div>
    </div>
    <div class="timeline" style="width: 80%; --start-year: 1676; --end-year: 1951;" data-timeline-fragments-select="1676:0,1805:0,1809:0,1847:0,1951:0">
        <div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1676;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1676;">
    <div class="timeline-content">
        <div class="timeline-year">1676</div>
        <div class="timeline-name">Chain Rule</div>
        <div class="timeline-author">Leibniz, G. W.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1676;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1805;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1805;">
    <div class="timeline-content">
        <div class="timeline-year">1805</div>
        <div class="timeline-name">Least Squares</div>
        <div class="timeline-author">Legendre, A. M.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1805;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1809;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1809;">
    <div class="timeline-content">
        <div class="timeline-year">1809</div>
        <div class="timeline-name">Normal Equations</div>
        <div class="timeline-author">Gauss, C. F.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1809;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1847;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1847;">
    <div class="timeline-content">
        <div class="timeline-year">1847</div>
        <div class="timeline-name">Gradient Descent</div>
        <div class="timeline-author">Cauchy, A. L.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1847;"></div>
<div class="timeline-dot" style="--year: 1858;"></div>
<div class="timeline-item" style="--year: 1858;">
    <div class="timeline-content">
        <div class="timeline-year">1858</div>
        <div class="timeline-name">Eigenvalue Theory</div>
        <div class="timeline-author">Cayley & Hamilton</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1858;"></div>
<div class="timeline-dot" style="--year: 1901;"></div>
<div class="timeline-item" style="--year: 1901;">
    <div class="timeline-content">
        <div class="timeline-year">1901</div>
        <div class="timeline-name">PCA</div>
        <div class="timeline-author">Pearson, K.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1901;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1951;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1951;">
    <div class="timeline-content">
    <div class="timeline-year">1951</div>
    <div class="timeline-name">Stochastic Gradient Descent</div>
    <div class="timeline-author">Robbins & Monro</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1951;"></div>
    </div>
</div>

<div class="timeline-container" style="flex-direction: row;">
    <div style="width: 20%;">
        <div class="timeline-title">Probability & Statistics</div>
        <div class="timeline-text">Basis for Bayesian methods, statistical inference, and generative models</div>
    </div>
    <div class="timeline" style="width: 80%; --start-year: 1676; --end-year: 1951;" data-timeline-fragments-select="1815:0">
        <div class="timeline-dot" style="--year: 1763;"></div>
<div class="timeline-item" style="--year: 1763;">
    <div class="timeline-content">
        <div class="timeline-year">1763</div>
        <div class="timeline-name">Bayes' Theorem</div>
        <div class="timeline-author">Bayes, T.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1763;"></div>
<div class="timeline-dot" style="--year: 1812;"></div>
<div class="timeline-item" style="--year: 1812;">
    <div class="timeline-content">
        <div class="timeline-year">1812</div>
        <div class="timeline-name">Bayesian Probability</div>
        <div class="timeline-author">Laplace, P. S.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1812;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1815;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1815;">
    <div class="timeline-content">
        <div class="timeline-year">1815</div>
        <div class="timeline-name">Gaussian Distribution</div>
        <div class="timeline-author">Gauss, C. F.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1815;"></div>
<div class="timeline-dot" style="--year: 1830;"></div>
<div class="timeline-item" style="--year: 1830;">
    <div class="timeline-content">
        <div class="timeline-year">1830</div>
        <div class="timeline-name">Central Limit Theorem</div>
        <div class="timeline-author">Various</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1830;"></div>
<div class="timeline-dot" style="--year: 1922;"></div>
<div class="timeline-item" style="--year: 1922;">
    <div class="timeline-content">
        <div class="timeline-year">1922</div>
        <div class="timeline-name">Maximum Likelihood</div>
        <div class="timeline-author">Fisher, R.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1922;"></div>
    </div>
</div>

<div class="timeline-container" style="flex-direction: row;">
    <div style="width: 20%;">
        <div class="timeline-title">Information & Computation</div>
        <div class="timeline-text">Foundations of algorithmic thinking and information theory</div>
    </div>
    <div class="timeline" style="width: 80%; --start-year: 1676; --end-year: 1951;" data-timeline-fragments-select="1843:0,1936:0,1947:0,1948:0">
        <div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1843;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1843;">
    <div class="timeline-content">
        <div class="timeline-year">1843</div>
        <div class="timeline-name">First Computer Algorithm</div>
        <div class="timeline-author">Lovelace, A.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1843;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1936;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1936;">
    <div class="timeline-content">
        <div class="timeline-year">1936</div>
        <div class="timeline-name">Turing Machine</div>
        <div class="timeline-author">Turing, A.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1936;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1947;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1947;">
    <div class="timeline-content">
        <div class="timeline-year">1947</div>
        <div class="timeline-name">Linear Programming</div>
        <div class="timeline-author">Dantzig, G.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1947;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1948;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1948;">
    <div class="timeline-content">
        <div class="timeline-year">1948</div>
        <div class="timeline-name">Information Theory</div>
        <div class="timeline-author">Shannon, C.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1948;"></div>
    </div>
</div>

---

## Early History of Neural Networks

<div class="timeline-container" style="flex-direction: row;">
    <div style="width: 20%;">
        <div class="timeline-title">Architectures & Layers</div>
        <div class="timeline-text">Evolution of network architectures and layer innovations</div>
    </div>
    <div class="timeline" style="width: 80%; --start-year: 1943; --end-year: 2012;" data-timeline-fragments-select="1943:0,1957:0,1965:0,1979:0,1982:0,1989:0,2012:0">
        <div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1943;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1943;">
    <div class="timeline-content">
        <div class="timeline-year">1943</div>
        <div class="timeline-name">Artificial Neurons</div>
        <div class="timeline-author">McCulloch & Pitts</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1943;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1957;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1957;">
    <div class="timeline-content">
        <div class="timeline-year">1957</div>
        <div class="timeline-name">Perceptron</div>
        <div class="timeline-author">Rosenblatt, F.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1957;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1965;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1965;">
    <div class="timeline-content">
        <div class="timeline-year">1965</div>
        <div class="timeline-name">Deep Networks</div>
        <div class="timeline-author">Ivakhnenko & Lapa</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1965;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1979;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1979;">
    <div class="timeline-content">
        <div class="timeline-year">1979</div>
        <div class="timeline-name">Convolutional Networks</div>
        <div class="timeline-author">Fukushima, K.</div>
    </div>
</div> 
<div class="timeline-connector" style="--year: 1979;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1982;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1982;">
    <div class="timeline-content">
        <div class="timeline-year">1982</div>
        <div class="timeline-name">Recurrent Networks</div>
        <div class="timeline-author">Hopfield</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1982;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1989;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1989;">
    <div class="timeline-content">
        <div class="timeline-year">1989</div>
        <div class="timeline-name">LSTM</div>
        <div class="timeline-author">Hochreiter & Schmidhuber</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1989;"></div>
<div class="timeline-dot" style="--year: 2006;"></div>
<div class="timeline-item" style="--year: 2006;">
    <div class="timeline-content">
        <div class="timeline-year">2006</div>
        <div class="timeline-name">Deep Belief Networks</div>
        <div class="timeline-author">Hinton, G. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2006;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2012;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2012;">
    <div class="timeline-content">
        <div class="timeline-year">2012</div>
        <div class="timeline-name">AlexNet</div>
        <div class="timeline-author">Krizhevsky et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2012;"></div>
    </div>
</div>

<div class="timeline-container" style="flex-direction: row;">
    <div style="width: 20%;">
        <div class="timeline-title">Training & Optimization</div>
        <div class="timeline-text">Methods for efficient learning and gradient-based optimization</div>
    </div>
    <div class="timeline" style="width: 80%; --start-year: 1943; --end-year: 2012;" data-timeline-fragments-select="1967:0,1970:0,1986:0,1992:0,2009:0,2010:0,2012:0">
        <div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1967;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1967;">
    <div class="timeline-content">
        <div class="timeline-year">1967</div>
        <div class="timeline-name">Stochastic Gradient Descent for NN</div>
        <div class="timeline-author">Amari, S.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1967;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1970;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1970;">
    <div class="timeline-content">
        <div class="timeline-year">1970</div>
        <div class="timeline-name">Automatic Differentiation</div>
        <div class="timeline-author">Linnainmaa, S.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1970;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1986;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1986;">
    <div class="timeline-content">
        <div class="timeline-year">1986</div>
        <div class="timeline-name">Backpropagation for NN</div>
        <div class="timeline-author">Hinton et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1986;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1992;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1992;">
    <div class="timeline-content">
        <div class="timeline-year">1992</div>
        <div class="timeline-name">Weight Decay</div>
        <div class="timeline-author">Krogh & Hertz</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1992;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2009;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2009;">
    <div class="timeline-content">
    <div class="timeline-year">2009</div>
    <div class="timeline-name">Convolutional DBNs & Prob. Max Pooling</div>
    <div class="timeline-author">Lee, H. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2009;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2010;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2010;">
    <div class="timeline-content">
        <div class="timeline-year">2010</div>
        <div class="timeline-name">ReLU & Xavier Init</div>
        <div class="timeline-author">Nair, Hinton & Glorot</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2010;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2012;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2012;">
    <div class="timeline-content">
        <div class="timeline-year">2012</div>
        <div class="timeline-name">Dropout</div>
        <div class="timeline-author">Hinton, G. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2012;"></div>
    </div>
</div>

<div class="timeline-container" style="flex-direction: row;">
    <div style="width: 20%;">
        <div class="timeline-title">Software & Datasets</div>
        <div class="timeline-text">Tools, platforms, and milestones that enabled practical deep learning</div>
    </div>
    <div class="timeline" style="width: 80%; --start-year: 1943; --end-year: 2012;" data-timeline-fragments-select="2002:0,2007:0,">
        <div class="timeline-dot" style="--year: 1997;"></div>
<div class="timeline-item" style="--year: 1997;">
    <div class="timeline-content">
        <div class="timeline-year">1997</div>
        <div class="timeline-name">Deep Blue</div>
        <div class="timeline-author">IBM</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1997;"></div>
<div class="timeline-dot" style="--year: 1998;"></div>
<div class="timeline-item" style="--year: 1998;">
    <div class="timeline-content">
        <div class="timeline-year">1998</div>
        <div class="timeline-name">MNIST Dataset & LeNet 5</div>
        <div class="timeline-author">LeCun, Y. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1998;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2002;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2002;">
    <div class="timeline-content">
        <div class="timeline-year">2002</div>
        <div class="timeline-name">Torch Framework</div>
        <div class="timeline-author">Torch Team</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2002;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2007;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2007;">
    <div class="timeline-content">
        <div class="timeline-year">2007</div>
        <div class="timeline-name">CUDA Platform</div>
        <div class="timeline-author">NVIDIA</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2007;"></div>
<div class="timeline-dot" style="--year: 2009;"></div>
<div class="timeline-item" style="--year: 2009;">
    <div class="timeline-content">
        <div class="timeline-year">2009</div>
        <div class="timeline-name">ImageNet Dataset</div>
        <div class="timeline-author">Deng, J. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2009;"></div>
<div class="timeline-dot" style="--year: 2011;"></div>
<div class="timeline-item" style="--year: 2011;">
    <div class="timeline-content">
        <div class="timeline-year">2011</div>
        <div class="timeline-name">Siri</div>
        <div class="timeline-author">Apple Inc.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2011;"></div>
    </div>
</div>

---

## The Deep Learning Era

<div class="timeline-container" style="flex-direction: row;">
    <div style="width: 20%;">
        <div class="timeline-title">Deep architectures</div>
        <div class="timeline-text">Deep architectures and generative models transforming AI capabilities</div>
    </div>
    <div class="timeline" style="width: 80%; --start-year: 2013; --end-year: 2023;" data-timeline-fragments-select="2015:0,2016:0,2017:0,2021:0">
        <div class="timeline-dot" style="--year: 2013;"></div>
<div class="timeline-item" style="--year: 2013;">
    <div class="timeline-content">
        <div class="timeline-year">2013</div>
        <div class="timeline-name">Variational Autoencoders</div>
        <div class="timeline-author">Kingma et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2013;"></div>
<div class="timeline-dot" style="--year: 2014;"></div>
<div class="timeline-item" style="--year: 2014;">
    <div class="timeline-content">
        <div class="timeline-year">2014</div>
        <div class="timeline-name">Generative Adversarial Nets</div>
        <div class="timeline-author">Goodfellow et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2014;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2015;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2015;">
    <div class="timeline-content">
        <div class="timeline-year">2015</div>
        <div class="timeline-name">ResNet & Diffusion</div>
        <div class="timeline-author">He et al. & Sohl-Dickstein et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2015;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2016;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2016;">
    <div class="timeline-content">
        <div class="timeline-year">2016</div>
        <div class="timeline-name">Style Transfer & WaveNet</div>
        <div class="timeline-author">Gatys & van den Oord</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2016;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2017;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2017;">
    <div class="timeline-content">
        <div class="timeline-year">2017</div>
        <div class="timeline-name">Transformers</div>
        <div class="timeline-author">Vaswani et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2017;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2021;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2021;">
    <div class="timeline-content">
        <div class="timeline-year">2021</div>
        <div class="timeline-name">ViT & CLIP</div>
        <div class="timeline-author">Dosovitskiy & Radford</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2021;"></div>
<div class="timeline-dot" style="--year: 2022;"></div>
<div class="timeline-item" style="--year: 2022;">
    <div class="timeline-content">
        <div class="timeline-year">2022</div>
        <div class="timeline-name">Diffusion Transformer</div>
        <div class="timeline-author">Peebles & Xie</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2022;"></div>
<div class="timeline-dot" style="--year: 2023;"></div>
<div class="timeline-item" style="--year: 2023;">
    <div class="timeline-content">
        <div class="timeline-year">2023</div>
        <div class="timeline-name">Mamba</div>
        <div class="timeline-author">Gu & Dao</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2023;"></div>
    </div>
</div>

<div class="timeline-container" style="flex-direction: row;">
    <div style="width: 20%;">
        <div class="timeline-title">Training & Optimization</div>
        <div class="timeline-text">Advanced learning techniques and representation learning breakthroughs</div>
    </div>
    <div class="timeline" style="width: 80%; --start-year: 2013; --end-year: 2023;" data-timeline-fragments-select="2013:0,2014:0,2015:0,2016:0">
        <div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2013;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2013;">
    <div class="timeline-content">
        <div class="timeline-year">2013</div>
        <div class="timeline-name">Word2Vec</div>
        <div class="timeline-author">Mikolov, T. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2013;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2014;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2014;">
    <div class="timeline-content">
        <div class="timeline-year">2014</div>
        <div class="timeline-name">Attention Mechanism</div>
        <div class="timeline-author">Bahdanau, D. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2014;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2015;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2015;">
    <div class="timeline-content">
        <div class="timeline-year">2015</div>
        <div class="timeline-name">BatchNorm & Adam</div>
        <div class="timeline-author">Ioffe & Kingma</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2015;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2016;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2016;">
    <div class="timeline-content">
        <div class="timeline-year">2016</div>
        <div class="timeline-name">Layer Normalization</div>
        <div class="timeline-author">Ba, J. L. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2016;"></div>
<div class="timeline-dot" style="--year: 2020;"></div>
<div class="timeline-item" style="--year: 2020;">
    <div class="timeline-content">
        <div class="timeline-year">2020</div>
        <div class="timeline-name">DDPM</div>
        <div class="timeline-author">Ho, J. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2020;"></div>
    </div>
</div>

<div class="timeline-container" style="flex-direction: row;">
    <div style="width: 20%;">
        <div class="timeline-title">Software & Applications</div>
        <div class="timeline-text">Practical deployment and mainstream adoption of deep learning systems</div>
    </div>
    <div class="timeline" style="width: 80%; --start-year: 2013; --end-year: 2023;" data-timeline-fragments-select="2017:0,2018:0,2020:0,2022:0,2023:0">
        <div class="timeline-dot" style="--year: 2016;"></div>
<div class="timeline-item" style="--year: 2016;">
    <div class="timeline-content">
        <div class="timeline-year">2016</div>
        <div class="timeline-name">AlphaGo</div>
        <div class="timeline-author">Silver, D. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2016;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2017;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2017;">
    <div class="timeline-content">
        <div class="timeline-year">2017</div>
        <div class="timeline-name">PyTorch</div>
        <div class="timeline-author">Paszke, A. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2017;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2018;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2018;">
    <div class="timeline-content">
        <div class="timeline-year">2018</div>
        <div class="timeline-name">GPT-1</div>
        <div class="timeline-author">Radford & Devlin</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2018;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2020;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2020;">
    <div class="timeline-content">
        <div class="timeline-year">2020</div>
        <div class="timeline-name">GPT-3</div>
        <div class="timeline-author">Brown, T. B. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2020;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2022;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2022;">
    <div class="timeline-content">
        <div class="timeline-year">2022</div>
        <div class="timeline-name">ChatGPT & Stable Diffusion</div>
        <div class="timeline-author">OpenAI & Stability AI</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2022;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2023;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2023;">
    <div class="timeline-content">
        <div class="timeline-year">2023</div>
        <div class="timeline-name">LLaMA</div>
        <div class="timeline-author">Touvron, H. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2023;"></div>
    </div>
</div>

---

## Motivation for this Lecture

- Many fancy frameworks give the illusion that neural network training can magicly solve data science problems, with a few lines of code
- Just like other libraries or modules, that abstract away complexity

```python
>>> your_data = # plug your awesome dataset here
>>> model = SuperCrossValidator(SuperDuper.fit, your_data, ResNet50, SGDOptimizer)
# conquer world here
```

```python
>>> r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
>>> r.status_code
200
```

<div class="reference" style="text-align: center;">
    Source: <a href="https://karpathy.github.io/2019/04/25/recipe/">The Recipe for Training Neural Networks</a> by Andrej Karpathy
</div>

<div class="highlight image-overlay fragment" style="width: 80%">
    Unfortunately, there is no magic network, normalization, or optimizer that fits all problems!
    It all depends on the data and the task at hand
</div>

---

## Motivation for this Lecture

- Neural network training fails silently most of the time
- In code if you plug an integer where a string is expected, you get an error
- You can easily unit test small parts of your code
- But how do you know if your neural network is learning correctly?
- Your model could be syntactically correct, but still there can be logical bugs
- And often even with the bugs the model trains surprisingly well, but the performance is suboptimal

- Lecture covers practical tips to debug and optimize neural network training
- Don't rush—understand mechanics and apply tricks systematically
- Start with simple baseline, add complexity incrementally

</div>

---

# How do we start?

---

## Become one with the Data

- Use a feature representation that makes sense for your data (Use the knowledge from MIRMLA course)
- Understand the data you are working with
- Visualize samples from the dataset
- Check for class imbalance
- Visualize distributions of features and pay special attention to outliers
- Finally, normalize or standardize features if necessary
- Check for data leakage between train and validation sets

---

## Set up a Simple Baseline Model

- Fix a random seed for reproducibility
- Start with a very simple "toy" model architecture
- Compute a simple human-understandable baseline metrics (e.g., accuracy, confusion matrix) on the train and validation set (use k-fold cross-validation for small datasets)
- Verify the loss function and metrics at initialization (e.g., random predictions should yield expected loss)
- Initialize weights properly (e.g. if you are regressing some values with mean 100, initialize the last layer bias to 100)
- Use a small subset (as little as 2 samples) of the train set to verify that the model can overfit it (i.e., loss goes to zero)
- Analyze and visualize model predictions at different layer stages (e.g., attention maps, embeddings, feature maps)
- Increase the complexity of the model gradually and monitor the performance on train and validation sets
- Visualize and analyze predictions on a fixed (unshuffled) set of samples from the validation set after every epoch
- Check the weights and neurons, as well as their gradients - compute statistics for the different layers (e.g., make sure they are not vanishing or exploding)

</div>

---

## Overfit

- Look into the related literature for similar problems and datasets and find an architecture that works well
- Do not use data augmentation or regularization at this stage
- The Adam optimizer is a good default choice for most problems with a learning rate of 1e-3
- Make sure your model can overfit on a small subset of the training data (e.g., 100 samples)
- Gradually increase the model complexity one step at a time until you can overfit on the full training set
- Be careful not to overcomplicate the model too early
- Beware of learning rate schedules
- When training deep models, check for vanishing or exploding gradients and apply residual connections if necessary
- When having unstable activation scales consider using normalization layers

</div>

---

## Regularize

- Once you can overfit the training set, try to improve the generalization performance
- The best regularization method is to get more data
- If that is not possible, try data augmentation techniques suitable for your data modality (only on the training set)
- Decrease the model complexity if possible
- Pay attention to spuriously correlated features in the data and try to remove features that do not generalize well
- Add dropout, but pay attention with dropout and batch normalization together
- Try weight decay (L2 regularization) on the weights of the model
- Introduce early stopping based on the validation performance
- Transfer learning from a pretrained model can help regularization as well as it can be considered as inductive bias towards solutions that generalize well

</div>

---

## Tune

- Once you have a working model with good generalization performance, try to tune the hyperparameters
- Have a good version control system in place to track experiments - e.g., [dvc](https://dvc.org/)
- Have a systematic way to log and visualize training and validation metrics - e.g., [tensorboard](https://www.tensorflow.org/tensorboard) or [wandb](https://wandb.ai/) (Commercial)
- Optimize computation efficiency i.e., use mixed precision training
- Use random search or Bayesian optimization instead of grid search - i.e. with [optuna](https://optuna.org/)
- Focus on tuning the learning rate first, as it has the largest impact on performance, try using a learning rate finder, consider using warmup strategies
- Then tune the batch size, model architecture, and regularization parameters
- Consider using learning rate schedules, adaptive optimizers or different input representations
- Monitor the training and validation performance closely to avoid overfitting during hyperparameter tuning
- Use ensembles of models or mixtures of experts to boost performance further
- Finally, let the model train for a longer time to see if the performance improves further and use model checkpointing to save the best performing model

</div>

---

# Tricks of the Trade

---

## Choice of Activation Functions

<table>
<thead>
<tr>
<th>Activation</th>
<th>Function</th>
<th>Typical Use Case</th>
<th>Network Type</th>
</tr>
</thead>
<tbody>
<tr class="fragment" data-fragment-index="1">
<td><strong>ReLU</strong></td>
<td>$\text{ReLU}(z) = \max(0, z)$</td>
<td>Hidden layers (default choice)</td>
<td>CNNs, MLPs, ResNets</td>
</tr>
<tr class="fragment" data-fragment-index="3">
<td><strong>Leaky ReLU / PReLU</strong></td>
<td>$\text{LeakyReLU}(z) = \max(\alpha z, z)$</td>
<td>Hidden layers (when dying ReLU is an issue)</td>
<td>Deep CNNs, GANs</td>
</tr>
<tr class="fragment" data-fragment-index="5">
<td><strong>GELU</strong></td>
<td>$\text{GELU}(z) = z \cdot \Phi(z)$</td>
<td>Hidden layers in modern architectures</td>
<td>Transformers, BERT, GPT</td>
</tr>
<tr class="fragment" data-fragment-index="7">
<td><strong>Swish / SiLU</strong></td>
<td>$\text{Swish}(z) = \frac{z}{1 + e^{-z}}$</td>
<td>Hidden layers in deep networks</td>
<td>EfficientNet, modern CNNs</td>
</tr>
<tr class="fragment" data-fragment-index="9">
<td><strong>Tanh</strong></td>
<td>$\tanh(z) = \frac{e^{z} - e^{-z}}{e^{z} + e^{-z}}$</td>
<td>Hidden layers, gates</td>
<td>RNNs, LSTMs, GRUs</td>
</tr>
<tr class="fragment" data-fragment-index="11">
<td><strong>Sigmoid</strong></td>
<td>$\sigma(z) = \frac{1}{1 + e^{-z}}$</td>
<td>Output layer (binary classification), gates</td>
<td>Binary classifiers, LSTM gates</td>
</tr>
<tr class="fragment" data-fragment-index="13">
<td><strong>Softmax</strong></td>
<td>$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$</td>
<td>Output layer (multi-class classification)</td>
<td>Multi-class classifiers</td>
</tr>
<tr class="fragment" data-fragment-index="15">
<td><strong>Linear</strong></td>
<td>$f(z) = z$</td>
<td>Output layer (regression)</td>
<td>Regression models</td>
</tr>
</tbody>
</table>

</div>

<div class="fragment appear-vanish image-overlay" data-fragment-index="2" style="text-align: center; width: 1200px; height: auto;">
    <video width="100%" data-autoplay loop muted controls>
        <source src="assets/videos/03-perceptrons/1080p60/ReLUActivationVisualization.mp4" type="video/mp4">
        Your browser does not support the video tag.
    </video>
</div>

<div class="fragment appear-vanish image-overlay" data-fragment-index="4" style="text-align: center; width: 1200px; height: auto;">
    <video width="100%" data-autoplay loop muted controls>
        <source src="assets/videos/03-perceptrons/1080p60/LeakyReLUActivationVisualization.mp4" type="video/mp4">
        Your browser does not support the video tag.
    </video>
</div>

<div class="fragment appear-vanish image-overlay" data-fragment-index="6" style="text-align: center; width: 1200px; height: auto;">
    <video width="100%" data-autoplay loop muted controls>
        <source src="assets/videos/07-tricks_of_the_trade/1080p60/GELUActivationVisualization.mp4" type="video/mp4">
        Your browser does not support the video tag.
    </video>
</div>

<div class="fragment appear-vanish image-overlay" data-fragment-index="8" style="text-align: center; width: 1200px; height: auto;">
    <video width="100%" data-autoplay loop muted controls>
        <source src="assets/videos/07-tricks_of_the_trade/1080p60/SwishActivationVisualization.mp4" type="video/mp4">
        Your browser does not support the video tag.
    </video>
</div>

<div class="fragment appear-vanish image-overlay" data-fragment-index="10" style="text-align: center; width: 1200px; height: auto;">
    <video width="100%" data-autoplay loop muted controls>
        <source src="assets/videos/03-perceptrons/1080p60/TanhActivationVisualization.mp4" type="video/mp4">
        Your browser does not support the video tag.
    </video>
</div>

<div class="fragment appear-vanish image-overlay" data-fragment-index="12" style="text-align: center; width: 1200px; height: auto;">
    <video width="100%" data-autoplay loop muted controls>
        <source src="assets/videos/03-perceptrons/1080p60/SigmoidActivationVisualization.mp4" type="video/mp4">
        Your browser does not support the video tag.
    </video>
</div>

<div class="fragment appear-vanish image-overlay" data-fragment-index="14" style="text-align: center; width: 1200px; height: auto;">
    <video width="100%" data-autoplay loop muted controls>
        <source src="assets/videos/07-tricks_of_the_trade/1080p60/SoftmaxActivationVisualization.mp4" type="video/mp4">
        Your browser does not support the video tag.
    </video>
</div>

---

## Choice of Initialization Schemes

<table>
<thead>
<tr>
<th>Initialization</th>
<th>Method</th>
<th>Typical Use Case</th>
<th>Network Type</th>
</tr>
</thead>
<tbody>
<tr class="fragment" data-fragment-index="1">
<td><strong>Xavier / Glorot</strong></td>
<td>$\mathbf{W} \sim \mathcal{U}\left[-\sqrt{\frac{6}{n_{in} + n_{out}}}, \sqrt{\frac{6}{n_{in} + n_{out}}}\right]$</td>
<td>Hidden layers with tanh/sigmoid activations</td>
<td>MLPs, shallow networks</td>
</tr>
<tr class="fragment" data-fragment-index="2">
<td><strong>He (Kaiming)</strong></td>
<td>$\mathbf{W} \sim \mathcal{N}\left(0, \frac{2}{n_{in}}\right)$</td>
<td>Hidden layers with ReLU activations</td>
<td>CNNs, ResNets, deep networks</td>
</tr>
<tr class="fragment" data-fragment-index="3">
<td><strong>LeCun</strong></td>
<td>$\mathbf{W} \sim \mathcal{N}\left(0, \frac{1}{n_{in}}\right)$</td>
<td>Hidden layers with SELU activations</td>
<td>Self-normalizing networks - Networks designed to maintain mean and variance without normalization</td>
</tr>
<tr class="fragment" data-fragment-index="4">
<td><strong>Orthogonal</strong></td>
<td>$\mathbf{W}$ = orthogonal matrix</td>
<td>Recurrent connections</td>
<td>RNNs, LSTMs, GRUs</td>
</tr>
<tr class="fragment" data-fragment-index="5">
<td><strong>Zero</strong></td>
<td>$\mathbf{W} = 0$</td>
<td>Bias terms only</td>
<td>All networks (biases)</td>
</tr>
<tr class="fragment" data-fragment-index="6">
<td><strong>Constant</strong></td>
<td>$\mathbf{W} = c$</td>
<td>Specific layer requirements</td>
<td>Output layers (regression)</td>
</tr>
</tbody>
</table>

</div>

**Key Principle:** Match initialization to activation function to maintain stable gradient flow
- Use He for ReLU and variants
- Use Xavier for tanh/sigmoid
- Use Orthogonal for recurrent connections

</div>

---

## Choice of Optimizers

<table>
<colgroup>
<col style="width: 15%;">
<col style="width: 35%;">
<col style="width: 30%;">
<col style="width: 20%;">
</colgroup>
<thead>
<tr>
<th>Optimizer</th>
<th>Update Rule</th>
<th>Typical Use Case</th>
<th>Network Type</th>
</tr>
</thead>
<tbody>
<tr class="fragment" data-fragment-index="1">
<tr class="fragment" data-fragment-index="2">
<td><strong>Mini-batch SGD + Momentum</strong></td>
<td>$\mathbf{m}_{t} = \beta \mathbf{m}_{t-1} + \nabla \mathcal{L}$ <br> $\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \mathbf{m}_{t}$</td>
<td>Computer vision, training from scratch - noisier updates can better find global minima</td>
<td>CNNs, ResNets, image classification</td>
</tr>
<tr class="fragment" data-fragment-index="3">
<td><strong>Mini-batch SGD + RMSprop</strong></td>
<td>$\mathbf{v}_t = \beta \mathbf{v}_{t-1} + (1-\beta)(\nabla \mathcal{L})^2$ <br> $\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \frac{\eta}{\sqrt{\mathbf{v}_t + \epsilon}} \nabla \mathcal{L}$</td>
<td>Recurrent networks, non-stationary objectives</td>
<td>RNNs, online learning</td>
</tr>
<tr class="fragment" data-fragment-index="4">
<td><strong>Adam (RMSprop + Momentum)</strong></td>
<td>$\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1-\beta_1)\nabla \mathcal{L}$ <br> $\mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1-\beta_2)(\nabla \mathcal{L})^2$ <br> $\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \frac{\eta}{\sqrt{\mathbf{v}_t + \epsilon}} \mathbf{m}_t$</td>
<td>Default choice for most problems</td>
<td>Transformers, GANs, general purpose</td>
</tr>
<tr class="fragment" data-fragment-index="5">
<td><strong>AdamW</strong></td>
<td>Adam + decoupled weight decay</td>
<td>Modern deep learning, large models</td>
<td>BERT, GPT, ViT, large-scale models</td>
</tr>
</tbody>
</table>

</div>

**Key Principle:** Match optimizer to your problem characteristics
- **Adam/AdamW**: Default choice for most modern architectures (LR ~ 1e-3 to 1e-4)
- **SGD + Momentum**: Best for CNNs when training from scratch (LR ~ 0.1 with schedule)
- **RMSprop**: Good for RNNs and non-stationary problems
- **AdamW**: Preferred over Adam for large models with weight decay

</div>

---

## Learning Rate Schedules

- LR schedules can significantly impact convergence and performance
- Use step-based schedules (not epoch-based) for flexibility across batch sizes

</div>

<table style="margin-top: 10px;">
<thead>
<tr>
<th>Schedule</th>
<th>Formula</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr class="fragment" data-fragment-index="1">
<td><strong>Step Decay</strong></td>
<td>$\eta_t = \eta_0 \times \gamma^{\lfloor t / T \rfloor}$</td>
<td>Simple baseline, works well for CNNs</td>
</tr>
<tr class="fragment" data-fragment-index="3">
<td><strong>Linear Decay</strong></td>
<td>$\eta_t = \eta_0 - \frac{(\eta_0 - \eta_{min}) \cdot t}{T}$</td>
<td>Linear decay from initial to minimum LR</td>
</tr>
<tr class="fragment" data-fragment-index="5">
<td><strong>Exponential Decay</strong></td>
<td>$\eta_t = \eta_0 \times \gamma^t$</td>
<td>Smooth continuous decay</td>
</tr>
<tr class="fragment" data-fragment-index="7">
<td><strong>Cosine Annealing</strong></td>
<td>$\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{\pi t}{T}\right)\right)$</td>
<td>Transformers, modern architectures, smoother than step decay</td>
</tr>
<tr class="fragment" data-fragment-index="9">
<td><strong>One Cycle Policy</strong></td>
<td>Warmup then cosine annealing</td>
<td>Fast convergence, good generalization, allows big learning rates, limited training budget</td>
</tr>
<tr class="fragment" data-fragment-index="11">
<td><strong>Warm Restarts (SGDR)</strong></td>
<td>$\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{\pi T_{cur}}{T_i}\right)\right)$</td>
<td>Snapshot ensembling, escape local minima, exploration</td>
</tr>
</tbody>
</table>

</div>

<div class="fragment appear-vanish image-overlay" data-fragment-index="2" style="text-align: center; width: 1200px; height: auto;">
    <video width="100%" data-autoplay loop muted controls>
        <source src="assets/videos/07-tricks_of_the_trade/1080p60/StepDecaySchedule.mp4" type="video/mp4">
        Your browser does not support the video tag.
    </video>
</div>

<div class="fragment appear-vanish image-overlay" data-fragment-index="4" style="text-align: center; width: 1200px; height: auto;">
    <video width="100%" data-autoplay loop muted controls>
        <source src="assets/videos/07-tricks_of_the_trade/1080p60/LinearDecaySchedule.mp4" type="video/mp4">
        Your browser does not support the video tag.
    </video>
</div>

<div class="fragment appear-vanish image-overlay" data-fragment-index="6" style="text-align: center; width: 1200px; height: auto;">
    <video width="100%" data-autoplay loop muted controls>
        <source src="assets/videos/07-tricks_of_the_trade/1080p60/ExponentialDecaySchedule.mp4" type="video/mp4">
        Your browser does not support the video tag.
    </video>
</div>

<div class="fragment appear-vanish image-overlay" data-fragment-index="8" style="text-align: center; width: 1200px; height: auto;">
    <video width="100%" data-autoplay loop muted controls>
        <source src="assets/videos/07-tricks_of_the_trade/1080p60/CosineAnnealingSchedule.mp4" type="video/mp4">
        Your browser does not support the video tag.
    </video>
</div>

<div class="fragment appear-vanish image-overlay" data-fragment-index="10" style="text-align: center; width: 1200px; height: auto;">
    <video width="100%" data-autoplay loop muted controls>
        <source src="assets/videos/07-tricks_of_the_trade/1080p60/OneCyclePolicySchedule.mp4" type="video/mp4">
        Your browser does not support the video tag.
    </video>
</div>

<div class="fragment appear-vanish image-overlay" data-fragment-index="12" style="text-align: center; width: 1200px; height: auto;">
    <video width="100%" data-autoplay loop muted controls>
        <source src="assets/videos/07-tricks_of_the_trade/1080p60/WarmRestartsSchedule.mp4" type="video/mp4">
        Your browser does not support the video tag.
    </video>
</div>

<div class="highlight image-overlay fragment" data-fragment-index="13" style="width: 80%; text-align: left;">
    Attention: Learning rate schedules interact with optimizers differently; I.e. consider the momentum term when designing schedules.
</div>

---

## Residual Connections

- Researchers found that when stacking many layers, the training error first decreases and then increases, indicating a fundamental optimization issue
- Even with proper initialization - the gradient updates in early layers are very unpredictable and unstable

<div class="image-overlay fragment appear-vanish" data-fragment-index="1" style="position: absolute; left: 960px; top: 540px; text-align: center; width: 90%;">
    <img src="assets/images/07-tricks_of_the_trade/gradient_of_inputs_across_layer_sizes.png" alt="Residual Connections Visualization" style="width: 100%; height: auto;">
    <div class="reference" style="text-align: center;">
    Source: <a href="https://github.com/udlbook/udlbook" target="_blank">Understanding Deep Learning (Prince)</a>
    </div>
</div>

<div class="image-overlay fragment appear-vanish" data-fragment-index="2" style="position: absolute; left: 960px; top: 540px; text-align: center; width: 60%;">
    <img src="assets/images/07-tricks_of_the_trade/loss_landscape_for_no_res.png" alt="Residual Connections Visualization" style="width: 100%; height: auto;">
    <div class="reference" style="text-align: center;">
    Source: <a href="https://github.com/udlbook/udlbook" target="_blank">Understanding Deep Learning (Prince)</a>
    </div>
</div>

- Residual connections (skip connections) help mitigate this issue by allowing gradients to flow directly through the network - essentially bypassing some layers

</div>

<div class="image-overlay fragment appear-vanish" data-fragment-index="4" style="position: absolute; left: 960px; top: 540px; text-align: center;">
    <strong>Where to place Residual Connections?</strong>
    <img src="assets/images/07-tricks_of_the_trade/position_in_residual_layers.png" alt="Residual Connections Visualization" style="width: 100%; height: auto;">
    <div class="reference" style="text-align: center;">
    Source: <a href="https://github.com/udlbook/udlbook" target="_blank">Understanding Deep Learning (Prince)</a>
    </div>
</div>

<div class="image-overlay fragment appear-vanish" data-fragment-index="5" style="position: absolute; left: 960px; top: 540px; text-align: center;">
    <strong>Where to place Residual Connections?</strong>
    <img src="assets/images/07-tricks_of_the_trade/position_in_residual_layers_solution.png" alt="Residual Connections Visualization" style="width: 100%; height: auto;">
    <div class="reference" style="text-align: center;">
    Source: <a href="https://github.com/udlbook/udlbook" target="_blank">Understanding Deep Learning (Prince)</a>
    </div>
</div>

- Residual connections have become a standard component in deep architectures (e.g., ResNets, Transformers) to facilitate training of very deep networks
- They help maintain the loss landscape smoothness and improve convergence

</div>

<div class="image-overlay fragment appear-vanish" data-fragment-index="7" style="position: absolute; left: 960px; top: 540px; text-align: center; width: 90%;">
    <img src="assets/images/07-tricks_of_the_trade/loss_landscape_for_resnets.png" alt="Residual Connections Visualization" style="width: 100%; height: auto;">
    <div class="reference" style="text-align: center;">
    Source: <a href="https://github.com/udlbook/udlbook" target="_blank">Understanding Deep Learning (Prince)</a>
    </div>
</div>

- If the input and output dimensions differ, use a linear projection (1x1 convolution) to match dimensions before addition

</div>

<div class="image-overlay fragment highlight" data-fragment-index="9" style="text-align: left; width: 80%;">
    Attention: When using residual connections, the variance of the outputs can increase, so consider using normalization layers to stabilize training.
</div>

---

## Normalization Layers

- Normalization layers stabilize training by controlling the distribution of activations across layers - mitigates internal covariate shift:  **General form:** $\hat{x} = \gamma \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$ (where $\gamma, \beta$ are learnable; RMS Norm omits $\mu$ and $\beta$)

</div>

<table>
<thead>
<tr>
<th>Normalization</th>
<th>Normalized Over</th>
<th>When to Use</th>
<th>Network Type</th>
</tr>
</thead>
<tbody>
<tr class="fragment" data-fragment-index="1">
<td><strong>Batch Norm</strong></td>
<td>Across the batch - computes mean/variance over all samples for each feature</td>
<td>Large batch sizes, CNNs</td>
<td>ResNets, VGG</td>
</tr>
<tr class="fragment" data-fragment-index="2">
<td><strong>Layer Norm</strong></td>
<td>Across features - computes mean/variance over all features in each sample</td>
<td>Small batches, sequences</td>
<td>Transformers, RNNs, NLP</td>
</tr>
<tr class="fragment" data-fragment-index="3">
<td><strong>Instance Norm</strong></td>
<td>Across spatial dimensions (H×W per channel, per sample) - normalizes each channel independently</td>
<td>Style transfer, GANs</td>
<td>Image generation, artistic style</td>
</tr>
<tr class="fragment" data-fragment-index="4">
<td><strong>Group Norm</strong></td>
<td>Across channel groups + spatial dimensions - divides channels into groups</td>
<td>Small batches, alternatives to BN</td>
<td>Object detection, segmentation</td>
</tr>
<tr class="fragment" data-fragment-index="5">
<td><strong>RMS Norm</strong></td>
<td>Across features (like Layer Norm but without mean centering) - only normalizes by RMS</td>
<td>Transformers, efficiency</td>
<td>LLMs, modern transformers</td>
</tr>
</tbody>
</table>

</div>

<div class="image-overlay fragment appear-vanish" data-fragment-index="6" style="position: absolute; left: 960px; top: 540px; text-align: center; width: 90%;">
    <img src="assets/images/07-tricks_of_the_trade/norms.png" alt="Residual Connections Visualization" style="width: 100%; height: auto;">
    <div class="reference" style="text-align: center;">
    Source: <a href="https://github.com/udlbook/udlbook" target="_blank">Understanding Deep Learning (Prince)</a>
    </div>
</div>

**Key Principles:**
- Batch Norm for CNNs with large batches
- Layer Norm for transformers and RNNs
- Avoid Batch Norm + Dropout together (variance issues)
- Layer Norm + Dropout works well (common in Transformers)
- Place normalization after activation in residual blocks (post-activation)

</div>

---

## Regularization Techniques

- Regularization prevents overfitting by constraining model complexity or adding controlled noise during training

<table>
<thead>
<tr>
<th>Technique</th>
<th>Method</th>
<th>Typical Values</th>
<th>When to Use</th>
</tr>
</thead>
<tbody>
<tr class="fragment" data-fragment-index="1">
<td><strong>Dropout</strong></td>
<td>Randomly zero neurons with probability $p$</td>
<td>$p = 0.2$ to $0.5$</td>
<td>MLPs, avoid with Batch Norm</td>
</tr>
<tr class="fragment" data-fragment-index="2">
<td><strong>Weight Decay (L2)</strong></td>
<td>Add $\lambda \|\mathbf{W}\|^2$ to loss</td>
<td>$\lambda = 1e-4$ to $1e-5$</td>
<td>All networks, use with AdamW</td>
</tr>
<tr class="fragment" data-fragment-index="3">
<td><strong>Data Augmentation</strong></td>
<td>Transform inputs (crop, flip, noise, etc.)</td>
<td>Task-specific</td>
<td>Limited data, computer vision, audio</td>
</tr>
<tr class="fragment" data-fragment-index="4">
<td><strong>Early Stopping</strong></td>
<td>Stop when validation loss stops improving</td>
<td>Patience: 5-20 epochs</td>
<td>All tasks, prevents overfitting</td>
</tr>
<tr class="fragment" data-fragment-index="5">
<td><strong>Label Smoothing</strong></td>
<td>Soften one-hot labels: $y = (1-\alpha)y + \alpha/K$</td>
<td>$\alpha = 0.1$</td>
<td>Classification, improve calibration</td>
</tr>
</tbody>
</table>

</div>

**Best Practices:**
- Start without regularization, overfit first
- Add data augmentation before other techniques
- Use weight decay with all optimizers
- Combine multiple techniques carefully as they can have interactions that degrade performance

</div>

---

## Transfer Learning & Pretrained Models

- Transfer learning leverages pretrained models to improve performance on new tasks with less data and computation
- **Key Insight:** Features learned on large datasets transfer well to related tasks, especially lower-level features

<table>
<thead>
<tr>
<th>Approach</th>
<th>Method</th>
<th>When to Use</th>
</tr>
</thead>
<tbody>
<tr class="fragment" data-fragment-index="1">
<td><strong>Feature Extraction</strong></td>
<td>Freeze pretrained layers, train only new head</td>
<td>Small dataset, similar domain</td>
</tr>
<tr class="fragment" data-fragment-index="2">
<td><strong>Fine-tuning</strong></td>
<td>Unfreeze layers, train with small LR (1e-5 to 1e-4)</td>
<td>Medium/large dataset, related domain</td>
</tr>
<tr class="fragment" data-fragment-index="3">
<td><strong>Discriminative LR</strong></td>
<td>Lower LR for early layers, higher for head</td>
<td>Avoid catastrophic forgetting</td>
</tr>
</tbody>
</table>

</div>

**Popular Sources:** Vision (ImageNet, CLIP) • Audio (AudioSet, Wav2Vec 2.0, Whisper) • Text (BERT, GPT) • Multi-modal (CLIP, DALL-E)

</div>

**Best Practices:**

- Match input preprocessing to pretrained model requirements
- Consider domain similarity when choosing layers to transfer
- Use lower LR to preserve pretrained features

</div>

---

# Python Implementation