Variational Autoencoder

# Variational Autoencoder

---

## Mathematical Foundations

<div class="timeline-container" style="flex-direction: row;">
    <div style="width: 20%;">
        <div class="timeline-title">Calculus & Linear Algebra</div>
        <div class="timeline-text">Basis for optimization algorithms and machine learning model operations</div>
    </div>
    <div class="timeline" style="width: 80%; --start-year: 1676; --end-year: 1951;" data-timeline-fragments-select="1676:0,1805:0,1809:0,1847:0,1951:0">
        <div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1676;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1676;">
    <div class="timeline-content">
        <div class="timeline-year">1676</div>
        <div class="timeline-name">Chain Rule</div>
        <div class="timeline-author">Leibniz, G. W.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1676;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1805;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1805;">
    <div class="timeline-content">
        <div class="timeline-year">1805</div>
        <div class="timeline-name">Least Squares</div>
        <div class="timeline-author">Legendre, A. M.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1805;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1809;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1809;">
    <div class="timeline-content">
        <div class="timeline-year">1809</div>
        <div class="timeline-name">Normal Equations</div>
        <div class="timeline-author">Gauss, C. F.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1809;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1847;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1847;">
    <div class="timeline-content">
        <div class="timeline-year">1847</div>
        <div class="timeline-name">Gradient Descent</div>
        <div class="timeline-author">Cauchy, A. L.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1847;"></div>
<div class="timeline-dot" style="--year: 1858;"></div>
<div class="timeline-item" style="--year: 1858;">
    <div class="timeline-content">
        <div class="timeline-year">1858</div>
        <div class="timeline-name">Eigenvalue Theory</div>
        <div class="timeline-author">Cayley & Hamilton</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1858;"></div>
<div class="timeline-dot" style="--year: 1901;"></div>
<div class="timeline-item" style="--year: 1901;">
    <div class="timeline-content">
        <div class="timeline-year">1901</div>
        <div class="timeline-name">PCA</div>
        <div class="timeline-author">Pearson, K.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1901;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1951;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1951;">
    <div class="timeline-content">
    <div class="timeline-year">1951</div>
    <div class="timeline-name">Stochastic Gradient Descent</div>
    <div class="timeline-author">Robbins & Monro</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1951;"></div>
    </div>
</div>

<div class="timeline-container" style="flex-direction: row;">
    <div style="width: 20%;">
        <div class="timeline-title">Probability & Statistics</div>
        <div class="timeline-text">Basis for Bayesian methods, statistical inference, and generative models</div>
    </div>
    <div class="timeline" style="width: 80%; --start-year: 1676; --end-year: 1951;" data-timeline-fragments-select="1763:0,1812:0,1815:0,1922:0">
        <div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1763;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1763;">
    <div class="timeline-content">
        <div class="timeline-year">1763</div>
        <div class="timeline-name">Bayes' Theorem</div>
        <div class="timeline-author">Bayes, T.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1763;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1812;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1812;">
    <div class="timeline-content">
        <div class="timeline-year">1812</div>
        <div class="timeline-name">Bayesian Probability</div>
        <div class="timeline-author">Laplace, P. S.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1812;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1815;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1815;">
    <div class="timeline-content">
        <div class="timeline-year">1815</div>
        <div class="timeline-name">Gaussian Distribution</div>
        <div class="timeline-author">Gauss, C. F.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1815;"></div>
<div class="timeline-dot" style="--year: 1830;"></div>
<div class="timeline-item" style="--year: 1830;">
    <div class="timeline-content">
        <div class="timeline-year">1830</div>
        <div class="timeline-name">Central Limit Theorem</div>
        <div class="timeline-author">Various</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1830;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1922;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1922;">
    <div class="timeline-content">
        <div class="timeline-year">1922</div>
        <div class="timeline-name">Maximum Likelihood</div>
        <div class="timeline-author">Fisher, R.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1922;"></div>
    </div>
</div>

<div class="timeline-container" style="flex-direction: row;">
    <div style="width: 20%;">
        <div class="timeline-title">Information & Computation</div>
        <div class="timeline-text">Foundations of algorithmic thinking and information theory</div>
    </div>
    <div class="timeline" style="width: 80%; --start-year: 1676; --end-year: 1951;" data-timeline-fragments-select="1843:0,1936:0,1947:0,1948:0">
        <div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1843;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1843;">
    <div class="timeline-content">
        <div class="timeline-year">1843</div>
        <div class="timeline-name">First Computer Algorithm</div>
        <div class="timeline-author">Lovelace, A.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1843;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1936;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1936;">
    <div class="timeline-content">
        <div class="timeline-year">1936</div>
        <div class="timeline-name">Turing Machine</div>
        <div class="timeline-author">Turing, A.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1936;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1947;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1947;">
    <div class="timeline-content">
        <div class="timeline-year">1947</div>
        <div class="timeline-name">Linear Programming</div>
        <div class="timeline-author">Dantzig, G.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1947;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1948;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1948;">
    <div class="timeline-content">
        <div class="timeline-year">1948</div>
        <div class="timeline-name">Information Theory</div>
        <div class="timeline-author">Shannon, C.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1948;"></div>
    </div>
</div>

---

## Early History of Neural Networks

<div class="timeline-container" style="flex-direction: row;">
    <div style="width: 20%;">
        <div class="timeline-title">Architectures & Layers</div>
        <div class="timeline-text">Evolution of network architectures and layer innovations</div>
    </div>
    <div class="timeline" style="width: 80%; --start-year: 1943; --end-year: 2012;" data-timeline-fragments-select="1943:0,1957:0,1965:0,1979:0,1982:0,1989:0,2012:0">
        <div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1943;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1943;">
    <div class="timeline-content">
        <div class="timeline-year">1943</div>
        <div class="timeline-name">Artificial Neurons</div>
        <div class="timeline-author">McCulloch & Pitts</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1943;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1957;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1957;">
    <div class="timeline-content">
        <div class="timeline-year">1957</div>
        <div class="timeline-name">Perceptron</div>
        <div class="timeline-author">Rosenblatt, F.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1957;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1965;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1965;">
    <div class="timeline-content">
        <div class="timeline-year">1965</div>
        <div class="timeline-name">Deep Networks</div>
        <div class="timeline-author">Ivakhnenko & Lapa</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1965;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1979;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1979;">
    <div class="timeline-content">
        <div class="timeline-year">1979</div>
        <div class="timeline-name">Convolutional Networks</div>
        <div class="timeline-author">Fukushima, K.</div>
    </div>
</div> 
<div class="timeline-connector" style="--year: 1979;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1982;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1982;">
    <div class="timeline-content">
        <div class="timeline-year">1982</div>
        <div class="timeline-name">Recurrent Networks</div>
        <div class="timeline-author">Hopfield</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1982;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1989;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1989;">
    <div class="timeline-content">
        <div class="timeline-year">1989</div>
        <div class="timeline-name">LSTM</div>
        <div class="timeline-author">Hochreiter & Schmidhuber</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1989;"></div>
<div class="timeline-dot" style="--year: 2006;"></div>
<div class="timeline-item" style="--year: 2006;">
    <div class="timeline-content">
        <div class="timeline-year">2006</div>
        <div class="timeline-name">Deep Belief Networks</div>
        <div class="timeline-author">Hinton, G. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2006;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2012;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2012;">
    <div class="timeline-content">
        <div class="timeline-year">2012</div>
        <div class="timeline-name">AlexNet</div>
        <div class="timeline-author">Krizhevsky et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2012;"></div>
    </div>
</div>

<div class="timeline-container" style="flex-direction: row;">
    <div style="width: 20%;">
        <div class="timeline-title">Training & Optimization</div>
        <div class="timeline-text">Methods for efficient learning and gradient-based optimization</div>
    </div>
    <div class="timeline" style="width: 80%; --start-year: 1943; --end-year: 2012;" data-timeline-fragments-select="1967:0,1970:0,1986:0,1992:0,2009:0,2010:0,2012:0">
        <div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1967;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1967;">
    <div class="timeline-content">
        <div class="timeline-year">1967</div>
        <div class="timeline-name">Stochastic Gradient Descent for NN</div>
        <div class="timeline-author">Amari, S.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1967;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1970;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1970;">
    <div class="timeline-content">
        <div class="timeline-year">1970</div>
        <div class="timeline-name">Automatic Differentiation</div>
        <div class="timeline-author">Linnainmaa, S.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1970;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1986;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1986;">
    <div class="timeline-content">
        <div class="timeline-year">1986</div>
        <div class="timeline-name">Backpropagation for NN</div>
        <div class="timeline-author">Hinton et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1986;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1992;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1992;">
    <div class="timeline-content">
        <div class="timeline-year">1992</div>
        <div class="timeline-name">Weight Decay</div>
        <div class="timeline-author">Krogh & Hertz</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1992;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2009;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2009;">
    <div class="timeline-content">
    <div class="timeline-year">2009</div>
    <div class="timeline-name">Convolutional DBNs & Prob. Max Pooling</div>
    <div class="timeline-author">Lee, H. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2009;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2010;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2010;">
    <div class="timeline-content">
        <div class="timeline-year">2010</div>
        <div class="timeline-name">ReLU & Xavier Init</div>
        <div class="timeline-author">Nair, Hinton & Glorot</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2010;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2012;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2012;">
    <div class="timeline-content">
        <div class="timeline-year">2012</div>
        <div class="timeline-name">Dropout</div>
        <div class="timeline-author">Hinton, G. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2012;"></div>
    </div>
</div>

<div class="timeline-container" style="flex-direction: row;">
    <div style="width: 20%;">
        <div class="timeline-title">Software & Datasets</div>
        <div class="timeline-text">Tools, platforms, and milestones that enabled practical deep learning</div>
    </div>
    <div class="timeline" style="width: 80%; --start-year: 1943; --end-year: 2012;" data-timeline-fragments-select="2002:0,2007:0,">
        <div class="timeline-dot" style="--year: 1997;"></div>
<div class="timeline-item" style="--year: 1997;">
    <div class="timeline-content">
        <div class="timeline-year">1997</div>
        <div class="timeline-name">Deep Blue</div>
        <div class="timeline-author">IBM</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1997;"></div>
<div class="timeline-dot" style="--year: 1998;"></div>
<div class="timeline-item" style="--year: 1998;">
    <div class="timeline-content">
        <div class="timeline-year">1998</div>
        <div class="timeline-name">MNIST Dataset & LeNet 5</div>
        <div class="timeline-author">LeCun, Y. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1998;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2002;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2002;">
    <div class="timeline-content">
        <div class="timeline-year">2002</div>
        <div class="timeline-name">Torch Framework</div>
        <div class="timeline-author">Torch Team</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2002;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2007;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2007;">
    <div class="timeline-content">
        <div class="timeline-year">2007</div>
        <div class="timeline-name">CUDA Platform</div>
        <div class="timeline-author">NVIDIA</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2007;"></div>
<div class="timeline-dot" style="--year: 2009;"></div>
<div class="timeline-item" style="--year: 2009;">
    <div class="timeline-content">
        <div class="timeline-year">2009</div>
        <div class="timeline-name">ImageNet Dataset</div>
        <div class="timeline-author">Deng, J. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2009;"></div>
<div class="timeline-dot" style="--year: 2011;"></div>
<div class="timeline-item" style="--year: 2011;">
    <div class="timeline-content">
        <div class="timeline-year">2011</div>
        <div class="timeline-name">Siri</div>
        <div class="timeline-author">Apple Inc.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2011;"></div>
    </div>
</div>

---

## The Deep Learning Era

<div class="timeline-container" style="flex-direction: row;">
    <div style="width: 20%;">
        <div class="timeline-title">Deep architectures</div>
        <div class="timeline-text">Deep architectures and generative models transforming AI capabilities</div>
    </div>
    <div class="timeline" style="width: 80%; --start-year: 2013; --end-year: 2023;" data-timeline-fragments-select="2013:1,2015:0,2016:0,2017:0,2021:0">
        <div class="timeline-dot fragment custom select" data-fragment-index="1" style="--year: 2013;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="1" style="--year: 2013;">
    <div class="timeline-content">
        <div class="timeline-year">2013</div>
        <div class="timeline-name">Variational Autoencoders</div>
        <div class="timeline-author">Kingma et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2013;"></div>
<div class="timeline-dot" style="--year: 2014;"></div>
<div class="timeline-item" style="--year: 2014;">
    <div class="timeline-content">
        <div class="timeline-year">2014</div>
        <div class="timeline-name">Generative Adversarial Nets</div>
        <div class="timeline-author">Goodfellow et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2014;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2015;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2015;">
    <div class="timeline-content">
        <div class="timeline-year">2015</div>
        <div class="timeline-name">ResNet & Diffusion</div>
        <div class="timeline-author">He et al. & Sohl-Dickstein et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2015;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2016;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2016;">
    <div class="timeline-content">
        <div class="timeline-year">2016</div>
        <div class="timeline-name">Style Transfer & WaveNet</div>
        <div class="timeline-author">Gatys & van den Oord</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2016;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2017;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2017;">
    <div class="timeline-content">
        <div class="timeline-year">2017</div>
        <div class="timeline-name">Transformers</div>
        <div class="timeline-author">Vaswani et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2017;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2021;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2021;">
    <div class="timeline-content">
        <div class="timeline-year">2021</div>
        <div class="timeline-name">ViT & CLIP</div>
        <div class="timeline-author">Dosovitskiy & Radford</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2021;"></div>
<div class="timeline-dot" style="--year: 2022;"></div>
<div class="timeline-item" style="--year: 2022;">
    <div class="timeline-content">
        <div class="timeline-year">2022</div>
        <div class="timeline-name">Diffusion Transformer</div>
        <div class="timeline-author">Peebles & Xie</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2022;"></div>
<div class="timeline-dot" style="--year: 2023;"></div>
<div class="timeline-item" style="--year: 2023;">
    <div class="timeline-content">
        <div class="timeline-year">2023</div>
        <div class="timeline-name">Mamba</div>
        <div class="timeline-author">Gu & Dao</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2023;"></div>
    </div>
</div>

<div class="timeline-container" style="flex-direction: row;">
    <div style="width: 20%;">
        <div class="timeline-title">Training & Optimization</div>
        <div class="timeline-text">Advanced learning techniques and representation learning breakthroughs</div>
    </div>
    <div class="timeline" style="width: 80%; --start-year: 2013; --end-year: 2023;" data-timeline-fragments-select="2013:0,2014:0,2015:0,2016:0">
        <div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2013;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2013;">
    <div class="timeline-content">
        <div class="timeline-year">2013</div>
        <div class="timeline-name">Word2Vec</div>
        <div class="timeline-author">Mikolov, T. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2013;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2014;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2014;">
    <div class="timeline-content">
        <div class="timeline-year">2014</div>
        <div class="timeline-name">Attention Mechanism</div>
        <div class="timeline-author">Bahdanau, D. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2014;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2015;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2015;">
    <div class="timeline-content">
        <div class="timeline-year">2015</div>
        <div class="timeline-name">BatchNorm & Adam</div>
        <div class="timeline-author">Ioffe & Kingma</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2015;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2016;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2016;">
    <div class="timeline-content">
        <div class="timeline-year">2016</div>
        <div class="timeline-name">Layer Normalization</div>
        <div class="timeline-author">Ba, J. L. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2016;"></div>
<div class="timeline-dot" style="--year: 2020;"></div>
<div class="timeline-item" style="--year: 2020;">
    <div class="timeline-content">
        <div class="timeline-year">2020</div>
        <div class="timeline-name">DDPM</div>
        <div class="timeline-author">Ho, J. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2020;"></div>
    </div>
</div>

<div class="timeline-container" style="flex-direction: row;">
    <div style="width: 20%;">
        <div class="timeline-title">Software & Applications</div>
        <div class="timeline-text">Practical deployment and mainstream adoption of deep learning systems</div>
    </div>
    <div class="timeline" style="width: 80%; --start-year: 2013; --end-year: 2023;" data-timeline-fragments-select="2017:0,2018:0,2020:0,2022:0,2023:0">
        <div class="timeline-dot" style="--year: 2016;"></div>
<div class="timeline-item" style="--year: 2016;">
    <div class="timeline-content">
        <div class="timeline-year">2016</div>
        <div class="timeline-name">AlphaGo</div>
        <div class="timeline-author">Silver, D. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2016;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2017;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2017;">
    <div class="timeline-content">
        <div class="timeline-year">2017</div>
        <div class="timeline-name">PyTorch</div>
        <div class="timeline-author">Paszke, A. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2017;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2018;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2018;">
    <div class="timeline-content">
        <div class="timeline-year">2018</div>
        <div class="timeline-name">GPT-1</div>
        <div class="timeline-author">Radford & Devlin</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2018;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2020;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2020;">
    <div class="timeline-content">
        <div class="timeline-year">2020</div>
        <div class="timeline-name">GPT-3</div>
        <div class="timeline-author">Brown, T. B. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2020;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2022;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2022;">
    <div class="timeline-content">
        <div class="timeline-year">2022</div>
        <div class="timeline-name">ChatGPT & Stable Diffusion</div>
        <div class="timeline-author">OpenAI & Stability AI</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2022;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2023;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2023;">
    <div class="timeline-content">
        <div class="timeline-year">2023</div>
        <div class="timeline-name">LLaMA</div>
        <div class="timeline-author">Touvron, H. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2023;"></div>
    </div>
</div>

---

## Recap: Latent Models

**Latent Variable Models:** Introduce hidden $\mathbf{z}$ to model complex distributions; marginal likelihood: $p(\mathbf{x}|\boldsymbol{\theta}) = \int p(\mathbf{x}, \mathbf{z}|\boldsymbol{\theta}) \, d\mathbf{z}$

**GMM (Discrete Latent):** $p(\mathbf{x}|\boldsymbol{\theta}) = \sum_{k=1}^K \pi_k \cdot \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$ — tractable sum, but **log-of-sum** prevents closed-form MLE

**EM Algorithm:** Iteratively optimize when direct MLE is intractable

<ul>
<li><strong>E-Step:</strong> Compute responsibilities $\gamma_{ik} = p(z_i=k|\mathbf{x}_i, \boldsymbol{\theta}^{(t)})$ (soft cluster assignments)</li>
<li><strong>M-Step:</strong> Update $\boldsymbol{\theta}$ via weighted MLE: $\boldsymbol{\mu}_k = \frac{\sum_i \gamma_{ik} \mathbf{x}_i}{\sum_i \gamma_{ik}}$, etc.</li>
</ul>

**Variational View:**
- **ELBO:** $\log p(\mathbf{x}|\boldsymbol{\theta}) = \text{ELBO}(q, \boldsymbol{\theta}) + D_{\text{KL}}(q(z|\mathbf{x}) \,\|\, p(z|\mathbf{x}, \boldsymbol{\theta}))$
- **E-Step** = minimize KL → set $q = p(z|\mathbf{x}, \boldsymbol{\theta})$ (tighten bound)
- **M-Step** = maximize Q-function $\mathbb{E}_q[\log p(\mathbf{x}, z|\boldsymbol{\theta})]$ (raise bound)

**Key:** EM converges because log-likelihood is monotonically non-decreasing; K-means is EM with hard assignments

</div>

---

## From GMM to Deep Latent Models

**GMM worked because:**

| Component | GMM Choice | Why it's tractable |
|:----------|:-----------|:-------------------|
| Latent $z$ | Discrete: $z \in \{1, ..., K\}$ | Sum over $K$ values instead of integral |
| Prior $p(z)$ | Categorical: $\pi_k$ | Simple mixing weights |
| Decoder $p(\mathbf{x}\|z)$ | Gaussian: $\mathcal{N}(\mathbf{x}\|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$ | Closed-form posterior |

**What if we want more expressive models?**

<ul>
<li><strong>Continuous latent space:</strong> $\mathbf{z} \in \mathbb{R}^d$ — can represent smooth, continuous factors of variation</li>
<li><strong>Neural network decoder:</strong> $p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) = \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{z}), \sigma^2 \mathbf{I})$ — mean is neural network output, variance is fixed</li>
</ul>

This is the **deep latent variable model** — but what breaks?

</div>

---

## The Intractable Posterior Problem

**Recall the E-step goal:** Compute the posterior $p(\mathbf{z}|\mathbf{x}, \boldsymbol{\theta})$

Using Bayes' theorem:

<div class="formula">
$
p(\mathbf{z}|\mathbf{x}, \boldsymbol{\theta}) = \frac{p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) \cdot p(\mathbf{z})}{p(\mathbf{x}|\boldsymbol{\theta})} = \frac{p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) \cdot p(\mathbf{z})}{\int p(\mathbf{x}|\mathbf{z}', \boldsymbol{\theta}) \cdot p(\mathbf{z}') \, d\mathbf{z}'}
$
</div>

**The denominator is the problem!**

| Model | Decoder $p(\mathbf{x}\|\mathbf{z})$ | Marginal $p(\mathbf{x})$ | Posterior $p(\mathbf{z}\|\mathbf{x})$ |
|:------|:-----------------------------------|:------------------------|:------------------------------------|
| GMM | Gaussian | Finite sum | **Tractable** |
| Deep LVM | Neural Network | Intractable integral | **Intractable** |

</div>

**Why neural networks break tractability:**

<div>
GMM has fixed $\boldsymbol{\mu}_k$ and discrete $z$ (finite sum). The deep latent variable model has $\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{z}) = \text{NeuralNet}(\mathbf{z})$ — a complex function over continuous latent space. The marginal $\int p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) \cdot p(\mathbf{z}) \, d\mathbf{z}$ has no closed form!
</div>

</div>

---

## EM Breaks Down

**Recall the EM framework:**

<div class="formula">
$
\log p(\mathbf{x}|\boldsymbol{\theta}) = \text{ELBO}(q, \boldsymbol{\theta}) + D_{\text{KL}}\left( q(\mathbf{z}|\mathbf{x}) \,\|\, p(\mathbf{z}|\mathbf{x}, \boldsymbol{\theta}) \right)
$
</div>

| Step | GMM | Deep Latent Model |
|:-----|:----|:------------------|
| **E-step** | Set $q = p(\mathbf{z}\|\mathbf{x}, \boldsymbol{\theta})$ exactly | **Cannot compute** $p(\mathbf{z}\|\mathbf{x}, \boldsymbol{\theta})$ |
| **M-step** | Closed-form weighted MLE | Gradient descent on NN parameters |
| **Bound** | Tight (KL = 0 after E-step) | **Always a gap** |

**The fundamental problem:**

In GMM, we could set $q(\mathbf{z}|\mathbf{x}) = p(\mathbf{z}|\mathbf{x}, \boldsymbol{\theta})$ exactly, making the ELBO tight.

With neural network decoders, we **cannot compute the true posterior** — so we cannot perform the E-step!

</div>

**We need an approximation strategy...**

</div>

---

## Learn the Posterior Approximation

**Key insight:** If we can't compute $p(\mathbf{z}|\mathbf{x}, \boldsymbol{\theta})$, let's **learn to approximate it!**

**Introduce an encoder network** $q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})$ that approximates the intractable posterior:

<div class="formula">
$
q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi}) = \mathcal{N}\left(\mathbf{z} \,|\, \boldsymbol{\mu}_{\boldsymbol{\phi}}(\mathbf{x}), \text{diag}(\boldsymbol{\sigma}^2_{\boldsymbol{\phi}}(\mathbf{x}))\right)
$
</div>

- $\boldsymbol{\mu}_{\boldsymbol{\phi}}(\mathbf{x})$: neural network outputting the mean
- $\boldsymbol{\sigma}_{\boldsymbol{\phi}}(\mathbf{x})$: neural network outputting the standard deviation

</div>

**Why this works:**

- A single encoder handles **all datapoints** — one forward pass per $\mathbf{x}$
- The encoder learns to map $\mathbf{x} \mapsto (\boldsymbol{\mu}, \boldsymbol{\sigma})$ that approximate the true posterior
- Generalizes to unseen data (unlike per-datapoint optimization)

</div>

**This is called "amortized inference"** — the cost of learning the posterior is amortized across the entire dataset by sharing encoder parameters $\boldsymbol{\phi}$.

</div>

---

## VAE vs GMM: The Setup

**Recall the ELBO decomposition** (same as GMM!):

| Component | GMM | VAE |
|:----------|:----|:----|
| **Latent** $\mathbf{z}$ | Discrete: $z \in \{1, ..., K\}$ | Continuous: $\mathbf{z} \in \mathbb{R}^d$ |
| **Prior** $p(\mathbf{z})$ | Categorical: $\pi_k$ | Standard Gaussian: $\mathcal{N}(\mathbf{0}, \mathbf{I})$ |
| **Decoder** $p(\mathbf{x}\|\mathbf{z}, \boldsymbol{\theta})$ | Gaussian: $\mathcal{N}(\mathbf{x}\|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$ | Neural network with Gaussian output |
| **Posterior approx.** $q$ | Exact: $q = p(z\|\mathbf{x}, \boldsymbol{\theta})$ | Learned encoder: $q(\mathbf{z}\|\mathbf{x}, \boldsymbol{\phi})$ |
| **Parameters** | $\boldsymbol{\theta} = \{\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k, \pi_k\}$ | $\boldsymbol{\theta}$ (decoder NN), $\boldsymbol{\phi}$ (encoder NN) |

</div>

**Key difference:** In VAE, we optimize **both** $\boldsymbol{\theta}$ and $\boldsymbol{\phi}$ jointly, since we cannot compute the true posterior!

</div>

---

## Recap: Deriving the ELBO

**The fundamental challenge:** We want to maximize $\log p(\mathbf{x}|\boldsymbol{\theta})$, but the log-of-sum is intractable.

We introduce a variational distribution $q(z|\mathbf{x})$ and use Jensen's inequality:

<div class="formula">
  $
\begin{aligned}
\log p_{X|\Theta}(\mathbf{x}|\boldsymbol{\theta}) &= \log \left( \sum_{k=1}^K p(\mathbf{x}, z=k|\boldsymbol{\theta}) \right) \\
&= \log \left( \sum_{k=1}^K q(z=k|\mathbf{x}) \cdot \frac{p(\mathbf{x}, z=k|\boldsymbol{\theta})}{q(z=k|\mathbf{x})} \right) \\
&= \log \left( \mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \frac{p(\mathbf{x}, z|\boldsymbol{\theta})}{q(z|\mathbf{x})} \right] \right) \\
&\geq \mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \log \frac{p(\mathbf{x}, z|\boldsymbol{\theta})}{q(z|\mathbf{x})} \right] = \text{ELBO}(q, \boldsymbol{\theta})
\end{aligned}
  $
</div>

</div>

---

## The VAE ELBO

Starting from the general ELBO definition:

<div class="formula">
$
\text{ELBO}(\boldsymbol{\phi}, \boldsymbol{\theta}; \mathbf{x}) = \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})} \left[ \log \frac{p(\mathbf{x}, \mathbf{z} | \boldsymbol{\theta})}{q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})} \right]
$
</div>

Using the chain rule $p(\mathbf{x}, \mathbf{z} | \boldsymbol{\theta}) = p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) \cdot p(\mathbf{z})$:

<div class="formula">
$
\begin{aligned}
\text{ELBO} &= \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})} \left[ \log p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) + \log p(\mathbf{z}) - \log q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi}) \right] \\
&= \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})} \left[ \log p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) \right] + \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})} \left[ \log \frac{p(\mathbf{z})}{q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})} \right]
\end{aligned}
$
</div>

</div>

Recognizing the KL divergence, we get the **VAE objective**:

<div class="formula">
$
\text{ELBO}(\boldsymbol{\phi}, \boldsymbol{\theta}; \mathbf{x}) = \underbrace{\mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})} \left[ \log p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) \right]}_{\text{Reconstruction term}} - \underbrace{D_{\text{KL}}\left( q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi}) \,\|\, p(\mathbf{z}) \right)}_{\text{Regularization term}}
$
</div>

</div>

---

## Understanding the VAE Objective

<div class="formula">
$
\text{ELBO}(\boldsymbol{\phi}, \boldsymbol{\theta}; \mathbf{x}) = \underbrace{\mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})} \left[ \log p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) \right]}_{\text{Reconstruction}} - \underbrace{D_{\text{KL}}\left( q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi}) \,\|\, p(\mathbf{z}) \right)}_{\text{Regularization}}
$
</div>

**Reconstruction term:** How well can the decoder reconstruct $\mathbf{x}$ from samples $\mathbf{z} \sim q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})$?

- Encourages the latent code to **preserve information** about $\mathbf{x}$
- Like the expected complete-data log-likelihood in EM's M-step

</div>

**Regularization term:** How close is the encoder's output to the prior?

- Encourages the latent space to be **well-structured** (match $\mathcal{N}(\mathbf{0}, \mathbf{I})$)
- Prevents the encoder from "cheating" by encoding each $\mathbf{x}$ as a delta function
- No direct analogue in GMM — posterior is exact there!

</div>

**Trade-off:** Reconstruction wants $q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})$ to be specific to each $\mathbf{x}$; regularization wants $q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})$ close to the prior. The VAE balances these!

</div>

---

## Comparison: GMM Q-Function vs VAE ELBO

**GMM (E-step sets $q = p(z|\mathbf{x}, \boldsymbol{\theta})$ exactly):**

<div class="formula">
$
Q(\boldsymbol{\theta}; \boldsymbol{\theta}^{(t)}) = \sum_{i=1}^{n} \sum_{k=1}^{K} \gamma_{ik} \log p(\mathbf{x}_i, z_i=k | \boldsymbol{\theta}) = \sum_{i=1}^{n} \sum_{k=1}^{K} \gamma_{ik} \left[ \log \pi_k + \log \mathcal{N}(\mathbf{x}_i | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right]
$
</div>

**VAE (optimize $\boldsymbol{\phi}$ and $\boldsymbol{\theta}$ jointly):**

<div class="formula">
$
\text{ELBO}(\boldsymbol{\phi}, \boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{x}_i, \boldsymbol{\phi})} \left[ \log p(\mathbf{x}_i|\mathbf{z}, \boldsymbol{\theta}) \right] - D_{\text{KL}}\left( q(\mathbf{z}|\mathbf{x}_i, \boldsymbol{\phi}) \,\|\, p(\mathbf{z}) \right) \right]
$
</div>

</div>

**Goal:** Maximize the ELBO with respect to both $\boldsymbol{\theta}$ and $\boldsymbol{\phi}$

<div class="formula">
$
\boldsymbol{\phi}^*, \boldsymbol{\theta}^* = \arg\max_{\boldsymbol{\phi}, \boldsymbol{\theta}} \sum_{i=1}^{n} \text{ELBO}(\boldsymbol{\phi}, \boldsymbol{\theta}; \mathbf{x}_i)
$
</div>

</div>

**Problem: The reconstruction term involves an expectation**

<div class="formula">
$
\mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})} \left[ \log p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) \right] = \int q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi}) \log p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) \, d\mathbf{z}
$
</div>

This integral has no closed form when $p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta})$ is a neural network!

</div>

---

## Monte Carlo Estimation

**Problem:** How do we compute expectations when integrals have no closed form?

<div class="formula">
$
\mathbb{E}_{p(\mathbf{x})}[f(\mathbf{x})] = \int p(\mathbf{x}) f(\mathbf{x}) \, d\mathbf{x} \quad \text{(often intractable)}
$
</div>

**Monte Carlo estimation:** Approximate the expectation using samples!

<div class="formula">
$
\mathbb{E}_{p(\mathbf{x})}[f(\mathbf{x})] \approx \frac{1}{L} \sum_{l=1}^{L} f(\mathbf{x}^{(l)}), \quad \text{where } \mathbf{x}^{(l)} \sim p(\mathbf{x})
$
</div>

</div>

**Why this works:** By the Law of Large Numbers, the sample mean converges to the true expectation:

<div class="formula">
$
\frac{1}{L} \sum_{l=1}^{L} f(\mathbf{x}^{(l)}) \xrightarrow{L \to \infty} \mathbb{E}_{p(\mathbf{x})}[f(\mathbf{x})]
$
</div>

</div>

**Key properties:**
- **Unbiased:** $\mathbb{E}\left[\frac{1}{L}\sum_l f(\mathbf{x}^{(l)})\right] = \mathbb{E}_{p}[f(\mathbf{x})]$
- **Variance:** $\text{Var} \propto \frac{1}{L}$ — more samples = lower variance
- **Works for any** $f$ as long as we can sample from $p(\mathbf{x})$

</div>

<div class="fragment image-overlay" data-fragment-index="4" style="text-align: center; width: 1200px; height: auto;">
    <video width="100%" data-autoplay loop muted controls>
        <source src="assets/videos/10-variational_autoencoder/1080p60/MonteCarloConvergence.mp4" type="video/mp4">
        Your browser does not support the video tag.
    </video>
</div>

---

## The Optimization Challenge

**Goal:** Maximize the ELBO with respect to both $\boldsymbol{\theta}$ and $\boldsymbol{\phi}$

**Problem: The reconstruction term involves an expectation**

This integral has no closed form when $p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta})$ is a neural network!

</div>

**Solution:** Monte Carlo estimation — sample $\mathbf{z}^{(l)} \sim q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})$:

<div class="formula">
$
\mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})} \left[ \log p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) \right] \approx \frac{1}{L} \sum_{l=1}^{L} \log p(\mathbf{x}|\mathbf{z}^{(l)}, \boldsymbol{\theta})
$
</div>

In practice, $L = 1$ works well during training!

</div>

---

## Gradient w.r.t. Decoder Parameters $\boldsymbol{\theta}$

**Good news:** The gradient w.r.t. $\boldsymbol{\theta}$ is straightforward!

<div class="formula">
$
\nabla_{\boldsymbol{\theta}} \frac{1}{L} \sum_{l=1}^{L} \log p(\mathbf{x}|\mathbf{z}^{(l)}, \boldsymbol{\theta}) = \frac{1}{L} \sum_{l=1}^{L} \nabla_{\boldsymbol{\theta}} \log p(\mathbf{x}|\mathbf{z}^{(l)}, \boldsymbol{\theta})
$
</div>

**Why is this easy?**

- The samples $\mathbf{z}^{(l)}$ come from the **encoder** (parameters $\boldsymbol{\phi}$)
- From the decoder's perspective, $\mathbf{z}^{(l)}$ is just a **fixed input** — like any other input to a neural network
- No sampling w.r.t. $\boldsymbol{\theta}$ means standard backpropagation works!

</div>

**This is just like training any neural network:**

$\mathbf{z}^{(l)} \xrightarrow{\text{Decoder}_{\boldsymbol{\theta}}} \hat{\mathbf{x}} \xrightarrow{\text{loss}} \log p(\mathbf{x}|\mathbf{z}^{(l)}, \boldsymbol{\theta})$

Backprop through the decoder as usual!

</div>

---

## Gradient w.r.t. Encoder Parameters $\boldsymbol{\phi}$

**Problem:** We need gradients w.r.t. $\boldsymbol{\phi}$, but we sample from $q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})$!

<div class="formula">
$
\nabla_{\boldsymbol{\phi}} \frac{1}{L} \sum_{l=1}^{L} \log p(\mathbf{x}|\mathbf{z}^{(l)}, \boldsymbol{\theta}), \quad \text{where } \mathbf{z}^{(l)} \sim q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})
$
</div>

**The issue:** The samples $\mathbf{z}^{(l)}$ depend on $\boldsymbol{\phi}$ through stochastic sampling!

- Sampling $\mathbf{z} \sim q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})$ is a **stochastic operation**
- Gradients don't flow through random sampling!
- We cannot backpropagate through the sampling step

</div>

**Compare the two gradients:**

| Parameter | Gradient | Difficulty |
|:----------|:---------|:-----------|
| $\boldsymbol{\theta}$ (decoder) | $\nabla_{\boldsymbol{\theta}} \log p(\mathbf{x}\|\mathbf{z}, \boldsymbol{\theta})$ | Standard backprop — $\mathbf{z}$ is just an input |
| $\boldsymbol{\phi}$ (encoder) | $\nabla_{\boldsymbol{\phi}} \log p(\mathbf{x}\|\mathbf{z}, \boldsymbol{\theta})$ | **Problematic** — $\mathbf{z}$ depends on $\boldsymbol{\phi}$ via sampling |

</div>

---

## The Reparameterization Trick

**Key insight:** Rewrite the sampling process to separate stochasticity from parameters!

**Before (non-differentiable):**

<div class="formula">
$\mathbf{z} \sim q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi}) = \mathcal{N}(\boldsymbol{\mu}_{\boldsymbol{\phi}}(\mathbf{x}), \text{diag}(\boldsymbol{\sigma}^2_{\boldsymbol{\phi}}(\mathbf{x})))$
</div>

</div>

**After (differentiable):**

<div class="formula">
$
\mathbf{z} = \boldsymbol{\mu}_{\boldsymbol{\phi}}(\mathbf{x}) + \boldsymbol{\sigma}_{\boldsymbol{\phi}}(\mathbf{x}) \odot \boldsymbol{\epsilon}, \quad \text{where } \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})
$
</div>

<ul>
<li>$\boldsymbol{\epsilon}$ is sampled from a <strong>fixed</strong> distribution (independent of $\boldsymbol{\phi}$)</li>
<li>$\mathbf{z}$ is now a <strong>deterministic function</strong> of $\boldsymbol{\phi}$ (given $\boldsymbol{\epsilon}$)</li>
<li>Gradients flow through $\boldsymbol{\mu}_{\boldsymbol{\phi}}$ and $\boldsymbol{\sigma}_{\boldsymbol{\phi}}$ via standard backpropagation!</li>
</ul>

</div>

**The expectation becomes:**

<div class="formula">
$
\mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})} \left[ f(\mathbf{z}) \right] = \mathbb{E}_{\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \left[ f(\boldsymbol{\mu}_{\boldsymbol{\phi}}(\mathbf{x}) + \boldsymbol{\sigma}_{\boldsymbol{\phi}}(\mathbf{x}) \odot \boldsymbol{\epsilon}) \right]
$
</div>

<div>
Now $\nabla_{\boldsymbol{\phi}}$ can go inside the expectation!
</div>

</div>

---

## Reparameterization: The Math

**With reparameterization, we can compute gradients of the MC estimate:**

<div class="formula">
$
\nabla_{\boldsymbol{\phi}} \frac{1}{L} \sum_{l=1}^{L} f(\mathbf{z}^{(l)}) = \frac{1}{L} \sum_{l=1}^{L} \nabla_{\boldsymbol{\phi}} f(\boldsymbol{\mu}_{\boldsymbol{\phi}} + \boldsymbol{\sigma}_{\boldsymbol{\phi}} \odot \boldsymbol{\epsilon}^{(l)})
$
</div>

**Applying the chain rule:**

<div class="formula">
$
\nabla_{\boldsymbol{\phi}} f(\mathbf{z}) = \nabla_\mathbf{z} f(\mathbf{z}) \cdot \nabla_{\boldsymbol{\phi}} \mathbf{z} = \nabla_\mathbf{z} f(\mathbf{z}) \cdot \left( \nabla_{\boldsymbol{\phi}} \boldsymbol{\mu}_{\boldsymbol{\phi}} + \boldsymbol{\epsilon} \odot \nabla_{\boldsymbol{\phi}} \boldsymbol{\sigma}_{\boldsymbol{\phi}} \right)
$
</div>

</div>

**In practice (with $L$ samples):**

<div class="formula">
$
\nabla_{\boldsymbol{\phi}} \frac{1}{L} \sum_{l=1}^{L} f(\mathbf{z}^{(l)}) = \frac{1}{L} \sum_{l=1}^{L} \nabla_\mathbf{z} f(\mathbf{z}^{(l)}) \cdot \left( \nabla_{\boldsymbol{\phi}} \boldsymbol{\mu}_{\boldsymbol{\phi}} + \boldsymbol{\epsilon}^{(l)} \odot \nabla_{\boldsymbol{\phi}} \boldsymbol{\sigma}_{\boldsymbol{\phi}} \right)
$
</div>

</div>

---

## Applying to the VAE Reconstruction Term

**Now let's substitute** $f(\mathbf{z}) = \log p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta})$ — the decoder log-likelihood:

<div class="formula">
$
\nabla_{\boldsymbol{\phi}} \frac{1}{L} \sum_{l=1}^{L} \log p(\mathbf{x}|\mathbf{z}^{(l)}, \boldsymbol{\theta}) = \frac{1}{L} \sum_{l=1}^{L} \nabla_{\boldsymbol{\phi}} \log p(\mathbf{x}|\boldsymbol{\mu}_{\boldsymbol{\phi}} + \boldsymbol{\sigma}_{\boldsymbol{\phi}} \odot \boldsymbol{\epsilon}^{(l)}, \boldsymbol{\theta})
$
</div>

**Expanding with the chain rule:**

<div class="formula">
$
= \frac{1}{L} \sum_{l=1}^{L} \underbrace{\nabla_\mathbf{z} \log p(\mathbf{x}|\mathbf{z}^{(l)}, \boldsymbol{\theta})}_{\text{decoder gradient}} \cdot \left( \nabla_{\boldsymbol{\phi}} \boldsymbol{\mu}_{\boldsymbol{\phi}} + \boldsymbol{\epsilon}^{(l)} \odot \nabla_{\boldsymbol{\phi}} \boldsymbol{\sigma}_{\boldsymbol{\phi}} \right)
$
</div>

</div>

**Key insight:** The gradient flows from decoder → through $\mathbf{z}$ → to encoder parameters $\boldsymbol{\phi}$

- $\nabla_\mathbf{z} \log p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta})$: how changing $\mathbf{z}$ affects reconstruction
- $\nabla_{\boldsymbol{\phi}} \boldsymbol{\mu}_{\boldsymbol{\phi}}$: how encoder parameters affect the mean
- $\boldsymbol{\epsilon}^{(l)} \odot \nabla_{\boldsymbol{\phi}} \boldsymbol{\sigma}_{\boldsymbol{\phi}}$: how encoder parameters affect variance (scaled by noise)

</div>

**In practice ($L=1$):** Sample one $\boldsymbol{\epsilon}$, compute $\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon}$, backprop through decoder and encoder!

</div>

---

## Recap: The Full VAE Objective

**We're optimizing the ELBO:**

**What we've solved — the reconstruction term:**

| Challenge | Solution |
|:----------|:---------|
| Intractable expectation | Monte Carlo: $\frac{1}{L} \sum_{l=1}^{L} \log p(\mathbf{x}\|\mathbf{z}^{(l)}, \boldsymbol{\theta})$ |
| Gradient w.r.t. $\boldsymbol{\theta}$ | Standard backprop (z is just an input) |
| Gradient w.r.t. $\boldsymbol{\phi}$ | Reparameterization trick |

</div>

**What's left — the KL term:**

How do we compute $D_{\text{KL}}\left( q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi}) \,\|\, p(\mathbf{z}) \right)$?

</div>

---

## The KL Term: Closed Form

**Good news:** The KL divergence between two Gaussians has a closed form!

For $q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi}) = \mathcal{N}(\boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma}^2))$ and $p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})$:

<div class="formula">
$
\begin{aligned}
D_{\text{KL}}(q \| p) &= \mathbb{E}_q[\log q(\mathbf{z})] - \mathbb{E}_q[\log p(\mathbf{z})] \\[0.5em]
&= \mathbb{E}_q\left[-\frac{1}{2}\sum_{j=1}^d \left(\log(2\pi\sigma_j^2) + \frac{(z_j - \mu_j)^2}{\sigma_j^2}\right)\right] - \mathbb{E}_q\left[-\frac{1}{2}\sum_{j=1}^d \left(\log(2\pi) + z_j^2\right)\right] \\[0.5em]
&= -\frac{1}{2}\sum_j \left(\log \sigma_j^2 + 1\right) + \frac{1}{2}\sum_j \mathbb{E}_q[z_j^2] \\[0.5em]
&= -\frac{1}{2}\sum_j \left(\log \sigma_j^2 + 1\right) + \frac{1}{2}\sum_j \left(\mu_j^2 + \sigma_j^2\right) \\[0.5em]
&= \frac{1}{2} \sum_{j=1}^{d} \left( \sigma_j^2 + \mu_j^2 - 1 - \log \sigma_j^2 \right)
\end{aligned}
$
</div>

where $j \in \{1, \ldots, d\}$ indexes each dimension of the latent vector $\mathbf{z} \in \mathbb{R}^d$.

</div>

**No Monte Carlo needed for this term!** Gradients w.r.t. $\boldsymbol{\phi}$ are straightforward.

</div>

---

## The Complete VAE Loss

**Putting it all together:** For a single datapoint $\mathbf{x}$:

<div class="formula">
$
\text{ELBO}(\boldsymbol{\phi}, \boldsymbol{\theta}; \mathbf{x}) = \underbrace{\frac{1}{L} \sum_{l=1}^{L} \log p(\mathbf{x}|\mathbf{z}^{(l)}, \boldsymbol{\theta})}_{\text{Monte Carlo estimate}} - \underbrace{\frac{1}{2} \sum_{j=1}^{d} \left( \sigma_j^2 + \mu_j^2 - 1 - \log \sigma_j^2 \right)}_{\text{Closed-form KL}}
$
</div>

<div>
where $\mathbf{z}^{(l)} = \boldsymbol{\mu}_{\boldsymbol{\phi}}(\mathbf{x}) + \boldsymbol{\sigma}_{\boldsymbol{\phi}}(\mathbf{x}) \odot \boldsymbol{\epsilon}^{(l)}$, $\boldsymbol{\epsilon}^{(l)} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$
</div>

**But what is** $\log p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta})$**?** We need to specify the decoder's output distribution!

**Common choice:** Gaussian with fixed variance $\sigma^2$

<div class="formula">
$
p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) = \mathcal{N}(\mathbf{x} \,|\, \boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{z}), \sigma^2 \mathbf{I})
$
</div>

The neural network $\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{z})$ outputs the **mean** of this Gaussian — the reconstructed $\hat{\mathbf{x}}$.

</div>

---

## Decoder Likelihood: From Gaussian to MSE

**Decoder output distribution:** $p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) = \mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{z}), \sigma^2 \mathbf{I})$

**Taking the log of the Gaussian PDF:**

<div class="formula">
$
\begin{aligned}
\log p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) &= \log \left( \frac{1}{(2\pi\sigma^2)^{D/2}} \exp\left( -\frac{1}{2\sigma^2} \|\mathbf{x} - \boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{z})\|^2 \right) \right) \\[0.5em]
&= -\frac{D}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \|\mathbf{x} - \boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{z})\|^2 \\[0.5em]
&= -\frac{1}{2\sigma^2} \|\mathbf{x} - \boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{z})\|^2 + \text{const}
\end{aligned}
$
</div>

where $D$ is the data dimensionality (e.g., number of pixels).

</div>

**Key insight:** Since $\sigma^2$ is a fixed constant:
<div class="formula">
$\max_{\boldsymbol{\theta}} \log p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) \quad \Longleftrightarrow \quad \min_{\boldsymbol{\theta}} \|\mathbf{x} - \boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{z})\|^2$
</div>

Maximizing Gaussian log-likelihood is equivalent to minimizing **mean squared error (MSE)**!

</div>

---

## The Practical VAE Loss

**Substituting the Gaussian decoder into the ELBO:**

<div class="formula">
$
\text{ELBO} \propto -\frac{1}{2\sigma^2} \|\mathbf{x} - \hat{\mathbf{x}}\|^2 - D_{\text{KL}}(q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi}) \| p(\mathbf{z}))
$
</div>

**Converting to a loss (negate and drop constants):**

<div class="formula">
$
\mathcal{L}_{\text{VAE}} = \underbrace{\|\mathbf{x} - \hat{\mathbf{x}}\|^2}_{\text{Reconstruction loss (MSE)}} + \underbrace{\beta \cdot D_{\text{KL}}(q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi}) \| p(\mathbf{z}))}_{\text{KL regularization}}
$
</div>

where $\hat{\mathbf{x}} = \boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{z})$ is the decoder output.

</div>

</div>

**Note:** The relationship $\beta = 2\sigma^2$ shows that $\beta$ implicitly controls the assumed decoder variance — larger $\beta$ corresponds to assuming a noisier decoder!

</div>

---

## VAE Training Algorithm

```
Initialize: encoder parameters φ, decoder parameters θ

For each epoch:
    For each minibatch {x₁, ..., xₘ}:
        
        # Forward pass (encoder)
        For each xᵢ:
            (μᵢ, σᵢ) = Encoder_φ(xᵢ)
        
        # Reparameterization (sample latent codes)
        For each xᵢ:
            εᵢ ~ N(0, I)
            zᵢ = μᵢ + σᵢ ⊙ εᵢ
        
        # Forward pass (decoder)
        For each zᵢ:
            x̂ᵢ = Decoder_θ(zᵢ)
        
        # Compute loss
        L_recon = (1/m) Σᵢ ||xᵢ - x̂ᵢ||²
        L_KL = (1/m) Σᵢ Σⱼ (σᵢⱼ² + μᵢⱼ² - 1 - log σᵢⱼ²) / 2
        L = L_recon + β · L_KL
        
        # Backward pass & update
        Compute ∇_θ L, ∇_φ L via backpropagation
        Update θ, φ using optimizer (e.g., Adam)
```

</div>

---

## GMM vs VAE: Optimization Comparison

<table>
<thead>
<tr>
<th align="left">Aspect</th>
<th align="left">GMM (EM)</th>
<th align="left">VAE</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left"><strong>E-step / Encoder</strong></td>
<td align="left">Compute $\gamma_{ik} = p(z_i=k|\mathbf{x}_i, \boldsymbol{\theta})$ exactly</td>
<td align="left">Forward pass: $(\boldsymbol{\mu}, \boldsymbol{\sigma}) = \text{Encoder}_{\boldsymbol{\phi}}(\mathbf{x})$</td>
</tr>
<tr>
<td align="left"><strong>Posterior</strong></td>
<td align="left">Exact (tractable)</td>
<td align="left">Approximate (learned)</td>
</tr>
<tr>
<td align="left"><strong>Sampling</strong></td>
<td align="left">Weighted sum over $K$ components</td>
<td align="left">Monte Carlo: $\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon}$</td>
</tr>
<tr>
<td align="left"><strong>M-step / Decoder</strong></td>
<td align="left">Closed-form: $\boldsymbol{\mu}_k = \frac{\sum_i \gamma_{ik} \mathbf{x}_i}{\sum_i \gamma_{ik}}$</td>
<td align="left">Gradient descent on NN</td>
</tr>
<tr>
<td align="left"><strong>Joint optimization</strong></td>
<td align="left">Alternating (E then M)</td>
<td align="left">Simultaneous (SGD on $\boldsymbol{\theta}, \boldsymbol{\phi}$)</td>
</tr>
<tr>
<td align="left"><strong>Convergence</strong></td>
<td align="left">Monotonic increase in likelihood</td>
<td align="left">ELBO increases (with noise from SGD)</td>
</tr>
<tr>
<td align="left"><strong>KL gap</strong></td>
<td align="left">Zero (ELBO is tight)</td>
<td align="left">Non-zero (approximation gap)</td>
</tr>
</tbody>
</table>

**Key insight:** VAE trades exactness for expressiveness:

- GMM: Exact inference, limited model (Gaussian components)
- VAE: Approximate inference, powerful model (neural networks)

</div>

---

## Summary: Optimizing the VAE

**The VAE objective** (maximize ELBO):

<div class="formula">
$
\text{ELBO}(\boldsymbol{\phi}, \boldsymbol{\theta}; \mathbf{x}) = \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi})} \left[ \log p(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) \right] - D_{\text{KL}}\left( q(\mathbf{z}|\mathbf{x}, \boldsymbol{\phi}) \,\|\, p(\mathbf{z}) \right)
$
</div>

**Three key ingredients:**

| Challenge | Solution |
|:----------|:---------|
| Intractable posterior $p(\mathbf{z}\|\mathbf{x})$ | Learn encoder $q(\mathbf{z}\|\mathbf{x}, \boldsymbol{\phi})$ |
| Intractable expectation | Monte Carlo sampling ($L=1$ suffices) |
| Non-differentiable sampling | Reparameterization trick |

</div>

---

## VAE Architecture Overview

<div style="text-align: center; width: 100%; height: auto;">
    <video width="80%" data-autoplay loop muted controls>
        <source src="assets/videos/10-variational_autoencoder/1080p60/VAEArchitecture.mp4" type="video/mp4">
        Your browser does not support the video tag.
    </video>
</div>

---

# Questions?