Latent Models

# Latent Models
---

## Mathematical Foundations

<div class="timeline-container" style="flex-direction: row;">
    <div style="width: 20%;">
        <div class="timeline-title">Calculus & Linear Algebra</div>
        <div class="timeline-text">Basis for optimization algorithms and machine learning model operations</div>
    </div>
    <div class="timeline" style="width: 80%; --start-year: 1676; --end-year: 1951;" data-timeline-fragments-select="1676:0,1805:0,1809:0,1847:0,1951:0">
        <div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1676;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1676;">
    <div class="timeline-content">
        <div class="timeline-year">1676</div>
        <div class="timeline-name">Chain Rule</div>
        <div class="timeline-author">Leibniz, G. W.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1676;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1805;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1805;">
    <div class="timeline-content">
        <div class="timeline-year">1805</div>
        <div class="timeline-name">Least Squares</div>
        <div class="timeline-author">Legendre, A. M.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1805;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1809;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1809;">
    <div class="timeline-content">
        <div class="timeline-year">1809</div>
        <div class="timeline-name">Normal Equations</div>
        <div class="timeline-author">Gauss, C. F.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1809;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1847;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1847;">
    <div class="timeline-content">
        <div class="timeline-year">1847</div>
        <div class="timeline-name">Gradient Descent</div>
        <div class="timeline-author">Cauchy, A. L.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1847;"></div>
<div class="timeline-dot" style="--year: 1858;"></div>
<div class="timeline-item" style="--year: 1858;">
    <div class="timeline-content">
        <div class="timeline-year">1858</div>
        <div class="timeline-name">Eigenvalue Theory</div>
        <div class="timeline-author">Cayley & Hamilton</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1858;"></div>
<div class="timeline-dot" style="--year: 1901;"></div>
<div class="timeline-item" style="--year: 1901;">
    <div class="timeline-content">
        <div class="timeline-year">1901</div>
        <div class="timeline-name">PCA</div>
        <div class="timeline-author">Pearson, K.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1901;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1951;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1951;">
    <div class="timeline-content">
    <div class="timeline-year">1951</div>
    <div class="timeline-name">Stochastic Gradient Descent</div>
    <div class="timeline-author">Robbins & Monro</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1951;"></div>
    </div>
</div>

<div class="timeline-container" style="flex-direction: row;">
    <div style="width: 20%;">
        <div class="timeline-title">Probability & Statistics</div>
        <div class="timeline-text">Basis for Bayesian methods, statistical inference, and generative models</div>
    </div>
    <div class="timeline" style="width: 80%; --start-year: 1676; --end-year: 1951;" data-timeline-fragments-select="1763:0,1812:0,1815:0,1922:0">
        <div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1763;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1763;">
    <div class="timeline-content">
        <div class="timeline-year">1763</div>
        <div class="timeline-name">Bayes' Theorem</div>
        <div class="timeline-author">Bayes, T.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1763;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1812;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1812;">
    <div class="timeline-content">
        <div class="timeline-year">1812</div>
        <div class="timeline-name">Bayesian Probability</div>
        <div class="timeline-author">Laplace, P. S.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1812;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1815;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1815;">
    <div class="timeline-content">
        <div class="timeline-year">1815</div>
        <div class="timeline-name">Gaussian Distribution</div>
        <div class="timeline-author">Gauss, C. F.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1815;"></div>
<div class="timeline-dot" style="--year: 1830;"></div>
<div class="timeline-item" style="--year: 1830;">
    <div class="timeline-content">
        <div class="timeline-year">1830</div>
        <div class="timeline-name">Central Limit Theorem</div>
        <div class="timeline-author">Various</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1830;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1922;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1922;">
    <div class="timeline-content">
        <div class="timeline-year">1922</div>
        <div class="timeline-name">Maximum Likelihood</div>
        <div class="timeline-author">Fisher, R.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1922;"></div>
    </div>
</div>

<div class="timeline-container" style="flex-direction: row;">
    <div style="width: 20%;">
        <div class="timeline-title">Information & Computation</div>
        <div class="timeline-text">Foundations of algorithmic thinking and information theory</div>
    </div>
    <div class="timeline" style="width: 80%; --start-year: 1676; --end-year: 1951;" data-timeline-fragments-select="1843:0,1936:0,1947:0,1948:0">
        <div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1843;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1843;">
    <div class="timeline-content">
        <div class="timeline-year">1843</div>
        <div class="timeline-name">First Computer Algorithm</div>
        <div class="timeline-author">Lovelace, A.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1843;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1936;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1936;">
    <div class="timeline-content">
        <div class="timeline-year">1936</div>
        <div class="timeline-name">Turing Machine</div>
        <div class="timeline-author">Turing, A.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1936;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1947;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1947;">
    <div class="timeline-content">
        <div class="timeline-year">1947</div>
        <div class="timeline-name">Linear Programming</div>
        <div class="timeline-author">Dantzig, G.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1947;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1948;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1948;">
    <div class="timeline-content">
        <div class="timeline-year">1948</div>
        <div class="timeline-name">Information Theory</div>
        <div class="timeline-author">Shannon, C.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1948;"></div>
    </div>
</div>

---

## Early History of Neural Networks

<div class="timeline-container" style="flex-direction: row;">
    <div style="width: 20%;">
        <div class="timeline-title">Architectures & Layers</div>
        <div class="timeline-text">Evolution of network architectures and layer innovations</div>
    </div>
    <div class="timeline" style="width: 80%; --start-year: 1943; --end-year: 2012;" data-timeline-fragments-select="1943:0,1957:0,1965:0,1979:0,1982:0,1989:0,2012:0">
        <div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1943;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1943;">
    <div class="timeline-content">
        <div class="timeline-year">1943</div>
        <div class="timeline-name">Artificial Neurons</div>
        <div class="timeline-author">McCulloch & Pitts</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1943;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1957;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1957;">
    <div class="timeline-content">
        <div class="timeline-year">1957</div>
        <div class="timeline-name">Perceptron</div>
        <div class="timeline-author">Rosenblatt, F.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1957;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1965;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1965;">
    <div class="timeline-content">
        <div class="timeline-year">1965</div>
        <div class="timeline-name">Deep Networks</div>
        <div class="timeline-author">Ivakhnenko & Lapa</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1965;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1979;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1979;">
    <div class="timeline-content">
        <div class="timeline-year">1979</div>
        <div class="timeline-name">Convolutional Networks</div>
        <div class="timeline-author">Fukushima, K.</div>
    </div>
</div> 
<div class="timeline-connector" style="--year: 1979;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1982;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1982;">
    <div class="timeline-content">
        <div class="timeline-year">1982</div>
        <div class="timeline-name">Recurrent Networks</div>
        <div class="timeline-author">Hopfield</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1982;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1989;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1989;">
    <div class="timeline-content">
        <div class="timeline-year">1989</div>
        <div class="timeline-name">LSTM</div>
        <div class="timeline-author">Hochreiter & Schmidhuber</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1989;"></div>
<div class="timeline-dot" style="--year: 2006;"></div>
<div class="timeline-item" style="--year: 2006;">
    <div class="timeline-content">
        <div class="timeline-year">2006</div>
        <div class="timeline-name">Deep Belief Networks</div>
        <div class="timeline-author">Hinton, G. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2006;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2012;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2012;">
    <div class="timeline-content">
        <div class="timeline-year">2012</div>
        <div class="timeline-name">AlexNet</div>
        <div class="timeline-author">Krizhevsky et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2012;"></div>
    </div>
</div>

<div class="timeline-container" style="flex-direction: row;">
    <div style="width: 20%;">
        <div class="timeline-title">Training & Optimization</div>
        <div class="timeline-text">Methods for efficient learning and gradient-based optimization</div>
    </div>
    <div class="timeline" style="width: 80%; --start-year: 1943; --end-year: 2012;" data-timeline-fragments-select="1967:0,1970:0,1986:0,1992:0,2009:0,2010:0,2012:0">
        <div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1967;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1967;">
    <div class="timeline-content">
        <div class="timeline-year">1967</div>
        <div class="timeline-name">Stochastic Gradient Descent for NN</div>
        <div class="timeline-author">Amari, S.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1967;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1970;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1970;">
    <div class="timeline-content">
        <div class="timeline-year">1970</div>
        <div class="timeline-name">Automatic Differentiation</div>
        <div class="timeline-author">Linnainmaa, S.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1970;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1986;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1986;">
    <div class="timeline-content">
        <div class="timeline-year">1986</div>
        <div class="timeline-name">Backpropagation for NN</div>
        <div class="timeline-author">Hinton et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1986;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 1992;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 1992;">
    <div class="timeline-content">
        <div class="timeline-year">1992</div>
        <div class="timeline-name">Weight Decay</div>
        <div class="timeline-author">Krogh & Hertz</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1992;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2009;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2009;">
    <div class="timeline-content">
    <div class="timeline-year">2009</div>
    <div class="timeline-name">Convolutional DBNs & Prob. Max Pooling</div>
    <div class="timeline-author">Lee, H. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2009;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2010;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2010;">
    <div class="timeline-content">
        <div class="timeline-year">2010</div>
        <div class="timeline-name">ReLU & Xavier Init</div>
        <div class="timeline-author">Nair, Hinton & Glorot</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2010;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2012;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2012;">
    <div class="timeline-content">
        <div class="timeline-year">2012</div>
        <div class="timeline-name">Dropout</div>
        <div class="timeline-author">Hinton, G. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2012;"></div>
    </div>
</div>

<div class="timeline-container" style="flex-direction: row;">
    <div style="width: 20%;">
        <div class="timeline-title">Software & Datasets</div>
        <div class="timeline-text">Tools, platforms, and milestones that enabled practical deep learning</div>
    </div>
    <div class="timeline" style="width: 80%; --start-year: 1943; --end-year: 2012;" data-timeline-fragments-select="2002:0,2007:0,">
        <div class="timeline-dot" style="--year: 1997;"></div>
<div class="timeline-item" style="--year: 1997;">
    <div class="timeline-content">
        <div class="timeline-year">1997</div>
        <div class="timeline-name">Deep Blue</div>
        <div class="timeline-author">IBM</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1997;"></div>
<div class="timeline-dot" style="--year: 1998;"></div>
<div class="timeline-item" style="--year: 1998;">
    <div class="timeline-content">
        <div class="timeline-year">1998</div>
        <div class="timeline-name">MNIST Dataset & LeNet 5</div>
        <div class="timeline-author">LeCun, Y. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 1998;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2002;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2002;">
    <div class="timeline-content">
        <div class="timeline-year">2002</div>
        <div class="timeline-name">Torch Framework</div>
        <div class="timeline-author">Torch Team</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2002;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2007;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2007;">
    <div class="timeline-content">
        <div class="timeline-year">2007</div>
        <div class="timeline-name">CUDA Platform</div>
        <div class="timeline-author">NVIDIA</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2007;"></div>
<div class="timeline-dot" style="--year: 2009;"></div>
<div class="timeline-item" style="--year: 2009;">
    <div class="timeline-content">
        <div class="timeline-year">2009</div>
        <div class="timeline-name">ImageNet Dataset</div>
        <div class="timeline-author">Deng, J. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2009;"></div>
<div class="timeline-dot" style="--year: 2011;"></div>
<div class="timeline-item" style="--year: 2011;">
    <div class="timeline-content">
        <div class="timeline-year">2011</div>
        <div class="timeline-name">Siri</div>
        <div class="timeline-author">Apple Inc.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2011;"></div>
    </div>
</div>

---

## The Deep Learning Era

<div class="timeline-container" style="flex-direction: row;">
    <div style="width: 20%;">
        <div class="timeline-title">Deep architectures</div>
        <div class="timeline-text">Deep architectures and generative models transforming AI capabilities</div>
    </div>
    <div class="timeline" style="width: 80%; --start-year: 2013; --end-year: 2023;" data-timeline-fragments-select="2015:0,2016:0,2017:0,2021:0">
        <div class="timeline-dot" style="--year: 2013;"></div>
<div class="timeline-item" style="--year: 2013;">
    <div class="timeline-content">
        <div class="timeline-year">2013</div>
        <div class="timeline-name">Variational Autoencoders</div>
        <div class="timeline-author">Kingma et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2013;"></div>
<div class="timeline-dot" style="--year: 2014;"></div>
<div class="timeline-item" style="--year: 2014;">
    <div class="timeline-content">
        <div class="timeline-year">2014</div>
        <div class="timeline-name">Generative Adversarial Nets</div>
        <div class="timeline-author">Goodfellow et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2014;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2015;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2015;">
    <div class="timeline-content">
        <div class="timeline-year">2015</div>
        <div class="timeline-name">ResNet & Diffusion</div>
        <div class="timeline-author">He et al. & Sohl-Dickstein et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2015;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2016;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2016;">
    <div class="timeline-content">
        <div class="timeline-year">2016</div>
        <div class="timeline-name">Style Transfer & WaveNet</div>
        <div class="timeline-author">Gatys & van den Oord</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2016;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2017;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2017;">
    <div class="timeline-content">
        <div class="timeline-year">2017</div>
        <div class="timeline-name">Transformers</div>
        <div class="timeline-author">Vaswani et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2017;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2021;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2021;">
    <div class="timeline-content">
        <div class="timeline-year">2021</div>
        <div class="timeline-name">ViT & CLIP</div>
        <div class="timeline-author">Dosovitskiy & Radford</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2021;"></div>
<div class="timeline-dot" style="--year: 2022;"></div>
<div class="timeline-item" style="--year: 2022;">
    <div class="timeline-content">
        <div class="timeline-year">2022</div>
        <div class="timeline-name">Diffusion Transformer</div>
        <div class="timeline-author">Peebles & Xie</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2022;"></div>
<div class="timeline-dot" style="--year: 2023;"></div>
<div class="timeline-item" style="--year: 2023;">
    <div class="timeline-content">
        <div class="timeline-year">2023</div>
        <div class="timeline-name">Mamba</div>
        <div class="timeline-author">Gu & Dao</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2023;"></div>
    </div>
</div>

<div class="timeline-container" style="flex-direction: row;">
    <div style="width: 20%;">
        <div class="timeline-title">Training & Optimization</div>
        <div class="timeline-text">Advanced learning techniques and representation learning breakthroughs</div>
    </div>
    <div class="timeline" style="width: 80%; --start-year: 2013; --end-year: 2023;" data-timeline-fragments-select="2013:0,2014:0,2015:0,2016:0">
        <div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2013;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2013;">
    <div class="timeline-content">
        <div class="timeline-year">2013</div>
        <div class="timeline-name">Word2Vec</div>
        <div class="timeline-author">Mikolov, T. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2013;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2014;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2014;">
    <div class="timeline-content">
        <div class="timeline-year">2014</div>
        <div class="timeline-name">Attention Mechanism</div>
        <div class="timeline-author">Bahdanau, D. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2014;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2015;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2015;">
    <div class="timeline-content">
        <div class="timeline-year">2015</div>
        <div class="timeline-name">BatchNorm & Adam</div>
        <div class="timeline-author">Ioffe & Kingma</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2015;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2016;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2016;">
    <div class="timeline-content">
        <div class="timeline-year">2016</div>
        <div class="timeline-name">Layer Normalization</div>
        <div class="timeline-author">Ba, J. L. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2016;"></div>
<div class="timeline-dot" style="--year: 2020;"></div>
<div class="timeline-item" style="--year: 2020;">
    <div class="timeline-content">
        <div class="timeline-year">2020</div>
        <div class="timeline-name">DDPM</div>
        <div class="timeline-author">Ho, J. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2020;"></div>
    </div>
</div>

<div class="timeline-container" style="flex-direction: row;">
    <div style="width: 20%;">
        <div class="timeline-title">Software & Applications</div>
        <div class="timeline-text">Practical deployment and mainstream adoption of deep learning systems</div>
    </div>
    <div class="timeline" style="width: 80%; --start-year: 2013; --end-year: 2023;" data-timeline-fragments-select="2017:0,2018:0,2020:0,2022:0,2023:0">
        <div class="timeline-dot" style="--year: 2016;"></div>
<div class="timeline-item" style="--year: 2016;">
    <div class="timeline-content">
        <div class="timeline-year">2016</div>
        <div class="timeline-name">AlphaGo</div>
        <div class="timeline-author">Silver, D. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2016;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2017;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2017;">
    <div class="timeline-content">
        <div class="timeline-year">2017</div>
        <div class="timeline-name">PyTorch</div>
        <div class="timeline-author">Paszke, A. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2017;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2018;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2018;">
    <div class="timeline-content">
        <div class="timeline-year">2018</div>
        <div class="timeline-name">GPT-1</div>
        <div class="timeline-author">Radford & Devlin</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2018;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2020;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2020;">
    <div class="timeline-content">
        <div class="timeline-year">2020</div>
        <div class="timeline-name">GPT-3</div>
        <div class="timeline-author">Brown, T. B. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2020;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2022;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2022;">
    <div class="timeline-content">
        <div class="timeline-year">2022</div>
        <div class="timeline-name">ChatGPT & Stable Diffusion</div>
        <div class="timeline-author">OpenAI & Stability AI</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2022;"></div>
<div class="timeline-dot fragment custom select" data-fragment-index="0" style="--year: 2023;"></div>
<div class="timeline-item fragment custom select" data-fragment-index="0" style="--year: 2023;">
    <div class="timeline-content">
        <div class="timeline-year">2023</div>
        <div class="timeline-name">LLaMA</div>
        <div class="timeline-author">Touvron, H. et al.</div>
    </div>
</div>
<div class="timeline-connector" style="--year: 2023;"></div>
    </div>
</div>

---

## Recap: Probability Fundamentals

**Foundation:** Random variables ($X$, $Y$) with distributions (PMF/PDF), characterized by expectation, joint/marginal/conditional probabilities

**Bayes' Theorem:** $p_{Y|X}(\mathbf{y}|\mathbf{x}) = \frac{p_{X|Y}(\mathbf{x}|\mathbf{y}) \cdot p_Y(\mathbf{y})}{p_X(\mathbf{x})}$ — connects posterior, likelihood, and prior

**Decision Rules (Classification):**
- Bayesian: $\arg\min_{\hat{\mathbf{y}}} \sum_{\mathbf{y}} \mathcal{L}(\mathbf{y}, \hat{\mathbf{y}}) \cdot p_{Y|X}(\mathbf{y}|\mathbf{x})$ (minimize expected loss)
- MAP: $\arg\max_{\hat{\mathbf{y}}} p_{X|Y}(\mathbf{x}|\hat{\mathbf{y}}) p_Y(\hat{\mathbf{y}})$ (0-1 loss → maximize posterior)
- ML: $\arg\max_{\hat{\mathbf{y}}} p_{X|Y}(\mathbf{x}|\hat{\mathbf{y}})$ (uniform prior → maximize likelihood)

**Parameter Estimation (Training):**
- Bayesian: $p_{\Theta|\mathcal{D}}(\boldsymbol{\theta}|\mathcal{D}) = \frac{p_{\mathcal{D}|\Theta}(\mathcal{D}|\boldsymbol{\theta}) \cdot p_\Theta(\boldsymbol{\theta})}{p_{\mathcal{D}}(\mathcal{D})}$ (full posterior distribution)
- MAP: $\arg\max_{\boldsymbol{\theta}} \prod_{i=1}^n p_{Y|X,\Theta}(\mathbf{y}_i|\mathbf{x}_i, \boldsymbol{\theta}) \cdot p_\Theta(\boldsymbol{\theta})$ (mode of posterior = **regularization**)
- MLE: $\arg\max_{\boldsymbol{\theta}} \prod_{i=1}^n p_{Y|X,\Theta}(\mathbf{y}_i|\mathbf{x}_i, \boldsymbol{\theta})$ (uniform prior)

**Key:** Same probabilistic framework applies to both **what to predict** (classification) and **how to learn** (training)

</div>

---

## Supervised Learning

Our previous machine learning models were primarily focused on supervised learning tasks.

**Dataset Structure:**

In supervised learning, we have access to a labeled dataset:

<div class="formula" style="width: 60%; margin-left: 0;">
  $
\mathcal{D} = \{(\mathbf{x}_i, \mathbf{y}_i)\}_{i=1}^n
  $
</div>

where each input $\mathbf{x}_i$ is paired with a corresponding output $\mathbf{y}_i$.

**Probabilistic Formulation:**

We can frame supervised learning probabilistically by assuming the data is generated from some conditional distribution $p_{Y|X,\Theta}(\mathbf{y}|\mathbf{x}, \boldsymbol{\theta})$ parameterized by $\boldsymbol{\theta}$. Using Bayes' theorem, the posterior distribution over parameters is:

<div class="formula">
  $
  p_{\Theta|X,Y}(\boldsymbol{\theta}|\mathbf{X}, \mathbf{Y}) = \frac{p_{X,Y|\Theta}(\mathbf{X}, \mathbf{Y}|\boldsymbol{\theta}) \cdot p_\Theta(\boldsymbol{\theta})}{p_{X,Y}(\mathbf{X}, \mathbf{Y})}
  $
</div>

</div>

**Parameter Estimation:**

In practice, we typically estimate a single "best" parameter value rather than computing the full posterior. We can use techniques like Maximum Likelihood Estimation (MLE) or Maximum A Posteriori (MAP):

<div class="formula">
  $
\theta_{\text{MAP}} = \arg\max_{\boldsymbol{\theta}} \prod_{i=1}^n p_{Y|X,\Theta}(\mathbf{y}_i|\mathbf{x}_i, \boldsymbol{\theta}) \cdot p_\Theta(\boldsymbol{\theta})
  $
</div>

</div>

---

## Unsupervised Learning

In unsupervised learning, we work with unlabeled data and aim to discover hidden structure.

**Dataset Structure:**

We only have access to observations without corresponding labels:

<div class="formula" style="width: 60%; margin-left: 0;">
  $
\mathcal{D} = \{\mathbf{x}_i\}_{i=1}^n
  $
</div>

**Probabilistic Formulation:**

We model the data distribution directly as $p_{X|\Theta}(\mathbf{x}|\boldsymbol{\theta})$. The goal is to find parameters that explain the observed data:

<div class="formula">
  $
  p_{\Theta|X}(\boldsymbol{\theta}|\mathbf{X}) = \frac{p_{X|\Theta}(\mathbf{X}|\boldsymbol{\theta}) \cdot p_\Theta(\boldsymbol{\theta})}{p_X(\mathbf{X})}
  $
</div>

</div>

**Parameter Estimation:**

Using Maximum Likelihood Estimation, we find parameters that maximize the probability of observing the data:

<div class="formula">
  $
\boldsymbol{\theta}_{\text{MLE}} = \arg\max_{\boldsymbol{\theta}} \prod_{i=1}^n p_{X|\Theta}(\mathbf{x}_i|\boldsymbol{\theta})
  $
</div>

</div>

---

## Latent Variables

But sometimes, it is difficult to model the data distribution directly. Instead, we can introduce latent (hidden) variables that capture underlying factors and try to model the joint distribution of observed and latent variables.

The latent variable model assumes that each observation $\mathbf{x}$ is generated from a latent variable $\mathbf{z}$ through a conditional distribution $p_{X|Z,\Theta}(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta})$.

We can express the marginal likelihood of the observed data by integrating out the latent variables:

<div class="formula" style="width: 63%; margin-left: 0;">
  $
p_{X|\Theta}(\mathbf{x}|\boldsymbol{\theta}) = \int p_{X|Z,\Theta}(\mathbf{x}|\mathbf{z}, \boldsymbol{\theta}) \cdot p_{Z|\Theta}(\mathbf{z}|\boldsymbol{\theta}) \, d\mathbf{z} = \int p_{X,Z | \Theta}(\mathbf{x},\mathbf{z} |\boldsymbol{\theta}) \, d\mathbf{z}
  $
</div>

**Parameter Estimation:**

To estimate the parameters $\boldsymbol{\theta}$ in the presence of latent variables, we can use Maximum Likelihood Estimation (MLE) by maximizing the marginal likelihood:

<div class="formula">
  $
\boldsymbol{\theta}_{\text{MLE}} = \arg\max_{\boldsymbol{\theta}} \prod_{i=1}^n p_{X|\Theta}(\mathbf{x}_i|\boldsymbol{\theta}) = \arg\max_{\boldsymbol{\theta}} \prod_{i=1}^n \int p_{X,Z | \Theta}(\mathbf{x},\mathbf{z} |\boldsymbol{\theta}) \, d\mathbf{z}
  $
</div>

</div>

---

## From Latent Variables to Clustering

**The Challenge:** The integral $\int p_{X,Z | \Theta}(\mathbf{x},\mathbf{z} |\boldsymbol{\theta}) \, d\mathbf{z}$ is often intractable for continuous $\mathbf{z}$.

**Simplification:** What if $\mathbf{z}$ is **discrete**? Let $z \in \{1, 2, ..., K\}$ represent a cluster assignment.

The integral becomes a tractable **sum**:

<div class="formula">
  $
p_{X|\Theta}(\mathbf{x}|\boldsymbol{\theta}) = \sum_{k=1}^K p_{X,Z|\Theta}(\mathbf{x}, z=k |\boldsymbol{\theta}) = \sum_{k=1}^K p_{X|Z,\Theta}(\mathbf{x}|z=k, \boldsymbol{\theta}) \cdot p_{Z|\Theta}(z=k|\boldsymbol{\theta})
  $

</div>

**Gaussian Mixture Model (GMM):** Choose Gaussian components with mixing weights $\pi_k$:

<div class="formula">
  $
p(\mathbf{x}|\boldsymbol{\theta}) = \sum_{k=1}^K \pi_k \cdot \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)
  $
</div>

| Latent Variable Model | Gaussian Mixture Model |
|:----------------------|:-----------------------|
| $p_{Z\|\Theta}(\mathbf{z}\|\boldsymbol{\theta})$ | $\pi_k$ (mixing weights, $\sum_k \pi_k = 1$) |
| $p_{X\|Z,\Theta}(\mathbf{x}\|\mathbf{z}, \boldsymbol{\theta})$ | $\mathcal{N}(\mathbf{x}\|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$ |

</div>

---

## The GMM Optimization Problem

**MLE Objective:** Maximize the log-likelihood over all data points:

<div class="formula">
  $
\boldsymbol{\theta}_{\text{MLE}} = \arg\max_{\boldsymbol{\theta}} \sum_{i=1}^n \log \left( \sum_{k=1}^K \pi_k \cdot \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right)
  $
</div>

**The Problem:** The **log of the sum** prevents closed-form solutions!

Taking the derivative with respect to $\boldsymbol{\mu}_k$ using the chain rule:

<div class="formula">
  $
\frac{\partial \mathcal{L}}{\partial \boldsymbol{\mu}_k} = \sum_{i=1}^n \frac{\pi_k \cdot \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^K \pi_j \cdot \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)} \cdot \boldsymbol{\Sigma}_k^{-1}(\mathbf{x}_i - \boldsymbol{\mu}_k)
  $
</div>

</div>

Setting to zero and solving, we get a **circular dependency**:

<div class="formula">
  $
\boldsymbol{\mu}_k = \frac{\sum_{i=1}^n \gamma_{ik} \mathbf{x}_i}{\sum_{i=1}^n \gamma_{ik}} \quad \text{where} \quad \gamma_{ik} = \frac{\pi_k \cdot \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^K \pi_j \cdot \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}
  $
</div>

<div>
$\boldsymbol{\mu}_k$ appears on <strong>both sides</strong> — the solution depends on $\gamma_{ik}$, which depends on $\boldsymbol{\mu}_k$ itself!
</div>

</div>

---

## Contrast: Single Gaussian vs. Mixture

**Single Gaussian** (no mixture, no sum inside log):

<div class="formula">
  $
\log p(\mathbf{x}|\boldsymbol{\mu}, \boldsymbol{\Sigma}) = -\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) + \text{const}
  $
</div>

Taking derivative and setting to zero gives a **clean closed-form**:

<div class="formula">
  $
\boldsymbol{\mu}_{\text{MLE}} = \frac{1}{n} \sum_{i=1}^n \mathbf{x}_i
  $
</div>

</div>

**Gaussian Mixture** (sum inside log):

The log doesn't "pass through" to individual components. All parameters remain coupled in the denominator of $\gamma_{ik}$, creating the circular dependency.

**Key Insight:** If we **knew the cluster assignments** $z_i$, we could reduce each cluster to a single Gaussian problem!

- Each $\boldsymbol{\mu}_k$ would just be the mean of points assigned to cluster $k$
- But we don't know $z_i$ — that's what we're trying to learn!

</div>

---

## The EM Algorithm: Intuition

**Chicken-and-egg problem:**
- If we knew **parameters** $\boldsymbol{\theta}$, we could compute cluster assignments $z_i$
- If we knew **assignments** $z_i$, we could easily estimate parameters $\boldsymbol{\theta}$

**EM Solution:** Alternate between the two!

<div style="display: flex; justify-content: space-around; margin: 20px 0;">
  <div style="text-align: center; padding: 15px; border: 2px solid #4CAF50; border-radius: 10px; width: 40%;">
    <strong>E-Step (Expectation)</strong><br>
    Given current $\boldsymbol{\theta}^{(t)}$,<br>
    compute <em>soft</em> cluster assignments<br>
    $\gamma_{ik} = p(z_i = k | \mathbf{x}_i, \boldsymbol{\theta}^{(t)})$
  </div>
  <div style="font-size: 2em; align-self: center;">→</div>
  <div style="text-align: center; padding: 15px; border: 2px solid #2196F3; border-radius: 10px; width: 40%;">
    <strong>M-Step (Maximization)</strong><br>
    Given soft cluster assignments $\gamma_{ik}$,<br>
    update parameters<br>
    $\boldsymbol{\theta}^{(t+1)}$
  </div>
</div>

</div>

**Connection to Lecture 08:** Recall our two applications of Bayesian inference:

<table style="font-size: 0.7em; width: 100%; border-collapse: collapse;">
  <thead>
    <tr style="border-bottom: 2px solid #333;">
      <th style="text-align: left; padding: 8px; border: 1px solid #ccc;">Step</th>
      <th style="text-align: left; padding: 8px; border: 1px solid #ccc;">What it does</th>
      <th style="text-align: left; padding: 8px; border: 1px solid #ccc;">Relation to Prob. Fundamentals</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="padding: 8px; border: 1px solid #ccc;"><strong>E-Step</strong></td>
      <td style="padding: 8px; border: 1px solid #ccc;">Compute <strong>full posterior</strong> $p(z|x, \boldsymbol{\theta})$</td>
      <td style="padding: 8px; border: 1px solid #ccc;">Like classification, but returns <strong>entire distribution</strong> (not just MAP argmax)</td>
    </tr>
    <tr>
      <td style="padding: 8px; border: 1px solid #ccc;"><strong>M-Step</strong></td>
      <td style="padding: 8px; border: 1px solid #ccc;">Maximize expected complete-data log-likelihood</td>
      <td style="padding: 8px; border: 1px solid #ccc;"><strong>Weighted MLE</strong> — each point contributes to all clusters via $\gamma_{ik}$</td>
    </tr>
  </tbody>
</table>

**Key insight:** If E-step used MAP (hard assignments), we'd get K-means. Soft assignments are what make EM different!

</div>

---

## EM for Gaussian Mixture Models

**E-Step:** Compute responsibilities (posterior probability that point $i$ belongs to cluster $k$):

<div class="formula">
  $
\gamma_{ik} = \frac{\pi_k \cdot \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^K \pi_j \cdot \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}
  $
</div>

**M-Step:** Update parameters using weighted statistics:

<div class="formula">
  $
\begin{aligned}
N_k &= \sum_{i=1}^n \gamma_{ik} & &\text{(effective number of points in cluster } k \text{)} \\[0.5em]
\boldsymbol{\mu}_k^{\text{new}} &= \frac{1}{N_k} \sum_{i=1}^n \gamma_{ik} \mathbf{x}_i & &\text{(weighted mean)} \\[0.5em]
\boldsymbol{\Sigma}_k^{\text{new}} &= \frac{1}{N_k} \sum_{i=1}^n \gamma_{ik} (\mathbf{x}_i - \boldsymbol{\mu}_k^{\text{new}})(\mathbf{x}_i - \boldsymbol{\mu}_k^{\text{new}})^\top & &\text{(weighted covariance)} \\[0.5em]
\pi_k^{\text{new}} &= \frac{N_k}{n} & &\text{(mixing weight)}
\end{aligned}
  $
</div>

</div>

<div class="fragment image-overlay" data-fragment-index="2" style="text-align: center; width: 1200px; height: auto;">
    <video width="100%" data-autoplay loop muted controls>
        <source src="assets/videos/09-latent_models/1080p60/EMStepByStep.mp4" type="video/mp4">
        Your browser does not support the video tag.
    </video>
</div>

---

## EM Algorithm: Summary

```
Initialize: θ⁽⁰⁾ = {μₖ, Σₖ, πₖ} randomly or with k-means

Repeat until convergence:
    
    E-Step: For each point i and cluster k, compute:
            γᵢₖ = P(zᵢ = k | xᵢ, θ⁽ᵗ⁾)
    
    M-Step: Update parameters:
            μₖ ← weighted mean of points
            Σₖ ← weighted covariance
            πₖ ← fraction of responsibility
    
    Check: Δ log-likelihood < ε → stop
```

**K-Means as a Special Case:**

<table style="font-size: 0.7em; width: 100%; border-collapse: collapse;">
  <thead>
    <tr style="border-bottom: 2px solid #333;">
      <th style="text-align: left; padding: 8px; border: 1px solid #ccc;">GMM with EM</th>
      <th style="text-align: left; padding: 8px; border: 1px solid #ccc;">K-Means</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="padding: 8px; border: 1px solid #ccc;">Soft assignments $\gamma_{ik} \in [0,1]$</td>
      <td style="padding: 8px; border: 1px solid #ccc;">Hard assignments $\gamma_{ik} \in \{0,1\}$</td>
    </tr>
    <tr>
      <td style="padding: 8px; border: 1px solid #ccc;">Full covariance $\boldsymbol{\Sigma}_k$</td>
      <td style="padding: 8px; border: 1px solid #ccc;">Spherical: $\boldsymbol{\Sigma}_k = \sigma^2 \mathbf{I}$</td>
    </tr>
    <tr>
      <td style="padding: 8px; border: 1px solid #ccc;">Probabilistic model</td>
      <td style="padding: 8px; border: 1px solid #ccc;">Distance-based heuristic</td>
    </tr>
  </tbody>
</table>

</div>

<div class="fragment image-overlay" data-fragment-index="2" style="text-align: center; width: 1200px; height: auto;">
    <video width="100%" data-autoplay loop muted controls>
        <source src="assets/videos/09-latent_models/1080p60/EMVisualization1D.mp4" type="video/mp4">
        Your browser does not support the video tag.
    </video>
</div>

---

## Deriving EM via Variational Inference

**Remember:** Our goal is to maximize the log marginal likelihood:

<div class="formula">
  $
\begin{aligned}
p_{X|\Theta}(\mathbf{X}|\boldsymbol{\theta}) &= \prod_{i=1}^n p_{X|\Theta}(\mathbf{x}_i|\boldsymbol{\theta}) \\
&= \prod_{i=1}^n \sum_{k=1}^K \pi_k \cdot \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \\
\log p_{X|\Theta}(\mathbf{X}|\boldsymbol{\theta}) &= \sum_{i=1}^n \log \left( \sum_{k=1}^K \pi_k \cdot \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right)
\end{aligned}
  $
</div>

</div>

The problem is the **log of the sum** — prevents closed-form solutions!

</div>

---

## Jensen's Inequality

**Jensen's Inequality** provides a key tool for handling the log of a sum.

For a **concave function** $f$ (like $\log$) and any distribution $p$ over $z$:

<div class="formula">
  $
f\left( \mathbb{E}_{z \sim p(z)}[g(z)] \right) \geq \mathbb{E}_{z \sim p(z)}\left[ f(g(z)) \right]
  $
</div>

**Applied to the logarithm with discrete distribution $p$:**

<div class="formula">
  $
\log \left( \sum_z p(z) \cdot g(z) \right) \geq \sum_z p(z) \cdot \log g(z)
  $
</div>

The log of a weighted average is **at least** the weighted average of logs.

</div>

<div class="fragment appear-vanish image-overlay" data-fragment-index="2" style=" top: 70%;">
    <img src="assets/images/09-latent_models/jensen_inequality_discrete.png" alt="Joint Probability Distribution" style="max-width: 100%; margin: 30px;">
    <div class="reference">
        Source: <a href="https://github.com/udlbook/udlbook" target="_blank">Understanding Deep Learning (Prince) - adapted</a>
    </div>
</div>

<div class="fragment appear-vanish image-overlay" data-fragment-index="3" style="width: 80%; top: 70%;">
    <img src="assets/images/09-latent_models/jensen_inequality_continuous.png" alt="Joint Probability Distribution" style="max-width: 100%; margin: 30px;">
    <div class="reference">
        Source: <a href="https://github.com/udlbook/udlbook" target="_blank">Understanding Deep Learning (Prince) - adapted</a>
    </div>
</div>

---

## ELBO: Evidence Lower Bound

To tackle this, we introduce a variational distribution $q_{Z|X}(\mathbf{z}|\mathbf{x})$ over the latent variables and use Jensen's inequality to derive a lower bound on the log likelihood:

<div class="formula">
  $
\begin{aligned}
\log p_{X|\Theta}(\mathbf{x}|\boldsymbol{\theta}) &= \log \left( \sum_{k=1}^K p_{X,Z|\Theta}(\mathbf{x}, z=k|\boldsymbol{\theta}) \right) \\
&= \log \left( \sum_{k=1}^K q_{Z|X}(z=k|\mathbf{x}) \cdot \frac{p_{X,Z|\Theta}(\mathbf{x}, z=k|\boldsymbol{\theta})}{q_{Z|X}(z=k|\mathbf{x})} \right) \\
&= \log \left( \mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \frac{p_{X,Z|\Theta}(\mathbf{x}, z|\boldsymbol{\theta})}{q_{Z|X}(z|\mathbf{x})} \right] \right) \\
&\geq \mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \log \frac{p_{X,Z|\Theta}(\mathbf{x}, z|\boldsymbol{\theta})}{q_{Z|X}(z|\mathbf{x})} \right] \quad \text{(Jensen's inequality)}
\end{aligned}
  $
</div>

This lower bound is called the **Evidence Lower Bound (ELBO)**:

<div class="formula">
  $
\text{ELBO}(q, \boldsymbol{\theta}) = \mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \log \frac{p_{X,Z|\Theta}(\mathbf{x}, z|\boldsymbol{\theta})}{q_{Z|X}(z|\mathbf{x})} \right]
  $
</div>

</div>

---

## Evidence Decomposition

**Key insight:** Let's rewrite the ELBO using the chain rule $p(\mathbf{x}, z|\boldsymbol{\theta}) = p(z|\mathbf{x}, \boldsymbol{\theta}) \cdot p(\mathbf{x}|\boldsymbol{\theta})$:

<div class="formula">
  $
\begin{aligned}
\text{ELBO}(q, \boldsymbol{\theta}) &= \mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \log p(\mathbf{x}, z|\boldsymbol{\theta}) \right] - \mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \log q(z|\mathbf{x}) \right] \\
&= \mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \log p(z|\mathbf{x}, \boldsymbol{\theta}) \right] + \mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \log p(\mathbf{x}|\boldsymbol{\theta}) \right] - \mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \log q(z|\mathbf{x}) \right] \\
&= \log p(\mathbf{x}|\boldsymbol{\theta}) - \underbrace{\mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \log \frac{q(z|\mathbf{x})}{p(z|\mathbf{x}, \boldsymbol{\theta})} \right]}_{D_{\text{KL}}\left( q(z|\mathbf{x}) \,\|\, p(z|\mathbf{x}, \boldsymbol{\theta}) \right)}
\end{aligned}
  $
</div>

---

## Kullback-Leibler (KL) Divergence

The **KL divergence** measures how different one probability distribution is from another:

<div class="formula">
  $
D_{\text{KL}}(q \,\|\, p) = \mathbb{E}_{z \sim q(z)} \left[ \log \frac{q(z)}{p(z)} \right] = \sum_z q(z) \log \frac{q(z)}{p(z)}
  $
</div>

**Key properties:**

| Property | Meaning |
|:---------|:--------|
| $D_{\text{KL}}(q \,\|\, p) \geq 0$ | Always non-negative |
| $D_{\text{KL}}(q \,\|\, p) = 0 \iff q = p$ | Zero only when distributions are identical |
| $D_{\text{KL}}(q \,\|\, p) \neq D_{\text{KL}}(p \,\|\, q)$ | **Not symmetric** — order matters! |

</div>

**Intuition:** KL divergence measures the "extra bits" needed to encode samples from $q$ using a code optimized for $p$.

- Large $D_{\text{KL}}$: $q$ and $p$ are very different
- Small $D_{\text{KL}}$: $q$ approximates $p$ well

</div>

---

## Evidence Decomposition

**Key insight:** Let's rewrite the ELBO using the chain rule $p(\mathbf{x}, z|\boldsymbol{\theta}) = p(z|\mathbf{x}, \boldsymbol{\theta}) \cdot p(\mathbf{x}|\boldsymbol{\theta})$:

<div class="formula">
  $
\begin{aligned}
\text{ELBO}(q, \boldsymbol{\theta}) &= \mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \log p(\mathbf{x}, z|\boldsymbol{\theta}) \right] - \mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \log q(z|\mathbf{x}) \right] \\
&= \mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \log p(z|\mathbf{x}, \boldsymbol{\theta}) \right] + \mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \log p(\mathbf{x}|\boldsymbol{\theta}) \right] - \mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \log q(z|\mathbf{x}) \right] \\
&= \log p(\mathbf{x}|\boldsymbol{\theta}) - \mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \log \frac{q(z|\mathbf{x})}{p(z|\mathbf{x}, \boldsymbol{\theta})} \right]\\
&= \log p(\mathbf{x}|\boldsymbol{\theta}) - D_{\text{KL}}\left( q(z|\mathbf{x}) \,\|\, p(z|\mathbf{x}, \boldsymbol{\theta}) \right)
\end{aligned}
  $
</div>

Rearranging gives the **fundamental decomposition**:

<div class="formula">
  $
\log p(\mathbf{x}|\boldsymbol{\theta}) = \text{ELBO}(q, \boldsymbol{\theta}; \mathbf{x}) + D_{\text{KL}}\left( q(z|\mathbf{x}) \,\|\, p(z|\mathbf{x}, \boldsymbol{\theta}) \right)
  $
</div>

</div>

**This tells us:**

| Property | Explanation |
|:---------|:------------|
| ELBO is a **lower bound** | Since $D_{\text{KL}} \geq 0$ always |
| The **gap** is the KL divergence | From $q$ to the true posterior $p(z\|\mathbf{x}, \boldsymbol{\theta})$ |
| **Tight bound** when $q = p(z\|\mathbf{x}, \boldsymbol{\theta})$ | Setting $q$ to the true posterior makes $D_{\text{KL}} = 0$ |

</div>

---

## The E-Step: Making the Bound Tight

At iteration $t$, we have current parameters $\boldsymbol{\theta}^{(t)}$. From the decomposition:

<div class="formula">
  $
\log p(\mathbf{x}|\boldsymbol{\theta}^{(t)}) = \text{ELBO}(q, \boldsymbol{\theta}^{(t)}; \mathbf{x}) + D_{\text{KL}}\left( q(z|\mathbf{x}) \,\|\, p(z|\mathbf{x}, \boldsymbol{\theta}^{(t)}) \right)
  $
</div>

We want to first **maximize the ELBO w.r.t. $q$** to make the bound as tight as possible.

<div class="formula">
  $
\begin{aligned}
q^{(t+1)}(z|\mathbf{x}) &= \arg\max_{q} \text{ELBO}(q, \boldsymbol{\theta}^{(t)}; \mathbf{x})\\
&= \arg\max_{q} \left( \log p(\mathbf{x}|\boldsymbol{\theta}^{(t)}) - D_{\text{KL}}\left( q(z|\mathbf{x}) \,\|\, p(z|\mathbf{x}, \boldsymbol{\theta}^{(t)}) \right) \right) \\
\end{aligned}
  $
</div>

</div>

Since $\log p(\mathbf{x}|\boldsymbol{\theta}^{(t)})$ is constant w.r.t. $q$, this is equivalent to minimizing the KL divergence:

<div class="formula">
  $
q^{(t+1)}(z|\mathbf{x}) = \arg\min_{q} D_{\text{KL}}\left( q(z|\mathbf{x}) \,\|\, p(z|\mathbf{x}, \boldsymbol{\theta}^{(t)}) \right)
  $
</div>

</div>

---

## The E-Step: Making the Bound Tight

Since $\log p(\mathbf{x}|\boldsymbol{\theta}^{(t)})$ is constant w.r.t. $q$, this is equivalent to minimizing the KL divergence:

<div class="formula">
  $
q^{(t+1)}(z|\mathbf{x}) = \arg\min_{q} D_{\text{KL}}\left( q(z|\mathbf{x}) \,\|\, p(z|\mathbf{x}, \boldsymbol{\theta}^{(t)}) \right)
  $
</div>

The minimum is achieved when the two distributions are equal. In the case of GMMs, this gives the familiar E-step update:

<div class="formula">
  $
\begin{aligned}
q^{(t+1)}(z=k|\mathbf{x}_i) &= p(z=k|\mathbf{x}_i, \boldsymbol{\theta}^{(t)})\\
&= \frac{p(\mathbf{x}_i|z=k, \boldsymbol{\theta}^{(t)}) \cdot p(z=k|\boldsymbol{\theta}^{(t)})}{p(\mathbf{x}_i|\boldsymbol{\theta}^{(t)})} \\
&= \frac{\mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_k^{(t)}, \boldsymbol{\Sigma}_k^{(t)})\cdot \pi_k^{(t)} }{\sum_{j=1}^K \pi_j^{(t)} \cdot \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_j^{(t)}, \boldsymbol{\Sigma}_j^{(t)})}
\end{aligned}
  $
</div>

</div>

After the E-step, the bound is tight:

<div class="formula">
  $
\log p(\mathbf{x}|\boldsymbol{\theta}^{(t)}) = \text{ELBO}(q^{(t+1)}, \boldsymbol{\theta}^{(t)}; \mathbf{x})
  $
</div>

</div>

<div class="fragment appear-vanish image-overlay" data-fragment-index="3" style=" width: 90%;">
    <img src="assets/images/09-latent_models/em_elbo.png" alt="Joint Probability Distribution" style="max-width: 100%; margin: 30px;">
    <div class="reference">
        Source: <a href="https://github.com/udlbook/udlbook" target="_blank">Understanding Deep Learning (Prince) - adapted</a>
    </div>
</div>

---

## The M-Step: Raising the Bound

Now we fix $q(z|\mathbf{x}) = q^{(t+1)}(z|\mathbf{x})$ and maximize the ELBO with respect to $\boldsymbol{\theta}$:

<div class="formula">
  $
\boldsymbol{\theta}^{(t+1)} = \arg\max_{\boldsymbol{\theta}} \text{ELBO}(q^{(t+1)}, \boldsymbol{\theta}; \mathbf{x})
  $
</div>

Recall the ELBO definition:

<div class="formula">
  $
\text{ELBO}(q, \boldsymbol{\theta}; \mathbf{x}) = \mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \log p(\mathbf{x}, z|\boldsymbol{\theta}) \right] - \mathbb{E}_{z \sim q(z|\mathbf{x})} \left[ \log q(z|\mathbf{x}) \right]
  $
</div>

</div>

Since $q$ is **fixed**, the entropy term $-\mathbb{E}_{z \sim q(z|\mathbf{x})}[\log q(z|\mathbf{x})]$ is constant w.r.t. $\boldsymbol{\theta}$:

<div class="formula">
  $
\boldsymbol{\theta}^{(t+1)} = \arg\max_{\boldsymbol{\theta}} \mathbb{E}_{z \sim q^{(t+1)}(z|\mathbf{x})} \left[ \log p(\mathbf{x}, z|\boldsymbol{\theta}) \right]
  $
</div>

</div>

---

## The M-Step: The Q-Function

The quantity we maximize is called the **Q-function** (expected complete-data log-likelihood):

<div class="formula">
  $
Q(\boldsymbol{\theta}; \boldsymbol{\theta}^{(t)}) = \sum_{i=1}^{n} \mathbb{E}_{z_i \sim q^{(t+1)}(z_i|\mathbf{x}_i)} \left[ \log p(\mathbf{x}_i, z_i|\boldsymbol{\theta}) \right] = \sum_{i=1}^{n} \sum_z q^{(t+1)}(z_i|\mathbf{x}_i) \log p(\mathbf{x}_i, z_i|\boldsymbol{\theta})
  $
</div>

*Notation: $Q(\boldsymbol{\theta}; \boldsymbol{\theta}^{(t)})$ means "Q as a function of $\boldsymbol{\theta}$, where $\boldsymbol{\theta}^{(t)}$ was used to compute $q^{(t+1)}$" — **not** a conditional probability!*

For GMMs, the joint factors as $p(\mathbf{x}_i, z_i=k|\boldsymbol{\theta}) = p(\mathbf{x}_i|z_i=k, \boldsymbol{\theta}) \cdot p(z_i=k|\boldsymbol{\theta}) = \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \cdot \pi_k$

Taking the log: $\log p(\mathbf{x}_i, z_i=k|\boldsymbol{\theta}) = \log \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) + \log \pi_k$

</div>

Substituting into the Q-function with $q^{(t+1)}(z_i = k) = \gamma_{ik}$:

<div class="formula">
  $
Q(\boldsymbol{\theta}; \boldsymbol{\theta}^{(t)}) = \sum_{i=1}^{n} \sum_{k=1}^{K} \gamma_{ik} \left[ \log \pi_k + \log \mathcal{N}(\mathbf{x}_i | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right]
  $
</div>

</div>

**Key insight:** The log is now **inside** the sum over $k$, not outside!

- Each term involves only **one** component's parameters
- This is a **weighted log-likelihood** — we can maximize in closed form!

</div>

---

## The M-Step: Closed-Form Updates

To maximize $Q(\boldsymbol{\theta}; \boldsymbol{\theta}^{(t)})$, we take derivatives and set to zero.

**For $\boldsymbol{\mu}_k$:** Using $\log \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) = -\frac{1}{2}(\mathbf{x}_i - \boldsymbol{\mu}_k)^\top \boldsymbol{\Sigma}_k^{-1}(\mathbf{x}_i - \boldsymbol{\mu}_k) + \text{const}$

<div class="formula">
  $
\frac{\partial Q}{\partial \boldsymbol{\mu}_k} = \sum_{i=1}^{n} \gamma_{ik} \boldsymbol{\Sigma}_k^{-1}(\mathbf{x}_i - \boldsymbol{\mu}_k) = 0 \quad \Rightarrow \quad \boldsymbol{\mu}_k = \frac{\sum_{i=1}^{n} \gamma_{ik} \mathbf{x}_i}{\sum_{i=1}^{n} \gamma_{ik}}
  $
</div>

**For $\pi_k$:** Maximize $\sum_k N_k \log \pi_k$ subject to $\sum_k \pi_k = 1$ using Lagrange multipliers:

<div class="formula">
  $
\frac{\partial}{\partial \pi_k}\left[\sum_k N_k \log \pi_k + \lambda\left(\sum_k \pi_k - 1\right)\right] = \frac{N_k}{\pi_k} + \lambda = 0 \quad \Rightarrow \quad \pi_k = \frac{N_k}{n}
  $
</div>

where $N_k = \sum_{i=1}^n \gamma_{ik}$ and $\lambda = -n$ from the constraint.

</div>

**For $\boldsymbol{\Sigma}_k$:** Similar derivation using matrix calculus gives:

<div class="formula">
  $
\boldsymbol{\Sigma}_k = \frac{\sum_{i=1}^{n} \gamma_{ik} (\mathbf{x}_i - \boldsymbol{\mu}_k)(\mathbf{x}_i - \boldsymbol{\mu}_k)^\top}{\sum_{i=1}^{n} \gamma_{ik}}
  $
</div>

</div>

---

## The M-Step: Summary

Maximizing $Q(\boldsymbol{\theta}; \boldsymbol{\theta}^{(t)})$ gives the familiar M-step updates:

<div class="formula">
  $
\begin{aligned}
\pi_k^{(t+1)} &= \frac{1}{n} \sum_{i=1}^{n} \gamma_{ik} \\[0.8em]
\boldsymbol{\mu}_k^{(t+1)} &= \frac{\sum_{i=1}^{n} \gamma_{ik} \mathbf{x}_i}{\sum_{i=1}^{n} \gamma_{ik}} \\[0.8em]
\boldsymbol{\Sigma}_k^{(t+1)} &= \frac{\sum_{i=1}^{n} \gamma_{ik} (\mathbf{x}_i - \boldsymbol{\mu}_k^{(t+1)})(\mathbf{x}_i - \boldsymbol{\mu}_k^{(t+1)})^\top}{\sum_{i=1}^{n} \gamma_{ik}}
\end{aligned}
  $
</div>

**After the M-step:**

<div class="formula">
  $
\text{ELBO}(q^{(t+1)}, \boldsymbol{\theta}^{(t+1)}; \mathbf{x}) \geq \text{ELBO}(q^{(t+1)}, \boldsymbol{\theta}^{(t)}; \mathbf{x})
  $
</div>

We've **raised the ELBO** by finding better parameters!

</div>

---

## Connecting the Dots

**From theory to practice:** The variational derivation justifies exactly what we presented intuitively!

<table>
  <thead>
    <tr>
      <th>Intuitive View</th>
      <th>Variational Derivation</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>E-step: Compute soft assignments $\gamma_{ik}$</td>
      <td>Minimize $D_{\text{KL}}(q \,\|\, p)$ → set $q = p(z|x, \boldsymbol{\theta})$</td>
    </tr>
    <tr>
      <td>M-step: Weighted MLE for each cluster</td>
      <td>Maximize $Q(\boldsymbol{\theta}; \boldsymbol{\theta}^{(t)}) = \mathbb{E}_{z \sim q(z|x)}[\log p(x,z|\boldsymbol{\theta})]$</td>
    </tr>
    <tr>
      <td>Log-likelihood increases</td>
      <td>ELBO ↑ (E-step tightens, M-step raises)</td>
    </tr>
  </tbody>
</table>

</div>

**Why does EM converge?**

</div>

<div class="fragment image-overlay" data-fragment-index="3" style="text-align: center; width: 1200px; height: auto;">
    <video width="100%" data-autoplay loop muted controls>
        <source src="assets/videos/09-latent_models/1080p60/EMVisualization1D.mp4" type="video/mp4">
        Your browser does not support the video tag.
    </video>
</div>

---

## Why the Log-Likelihood Increases

**After E-step** (before M-step): We set $q = p(z|x, \boldsymbol{\theta}^{(t)})$, so $D_{\text{KL}} = 0$

<div class="formula">
  $
\log p(\mathbf{x}|\boldsymbol{\theta}^{(t)}) = \text{ELBO}(q, \boldsymbol{\theta}^{(t)}; \mathbf{x}) + \underbrace{D_{\text{KL}}(q \,\|\, p(z|\mathbf{x}, \boldsymbol{\theta}^{(t)}))}_{= 0}
  $
</div>

Therefore: $\log p(\mathbf{x}|\boldsymbol{\theta}^{(t)}) = \text{ELBO}(q, \boldsymbol{\theta}^{(t)}; \mathbf{x})$

**After M-step**: We found $\boldsymbol{\theta}^{(t+1)}$ that maximizes ELBO, but $q$ is now stale (computed with old $\boldsymbol{\theta}^{(t)}$):

<div class="formula">
  $
\log p(\mathbf{x}|\boldsymbol{\theta}^{(t+1)}) = \text{ELBO}(q, \boldsymbol{\theta}^{(t+1)}; \mathbf{x}) + \underbrace{D_{\text{KL}}(q \,\|\, p(z|\mathbf{x}, \boldsymbol{\theta}^{(t+1)}))}_{\geq 0}
  $
</div>

</div>

**Combining the inequalities:**

1. M-step maximized ELBO: $\text{ELBO}(q, \boldsymbol{\theta}^{(t+1)}; \mathbf{x}) \geq \text{ELBO}(q, \boldsymbol{\theta}^{(t)}; \mathbf{x})$
2. KL is non-negative: $\log p(\mathbf{x}|\boldsymbol{\theta}^{(t+1)}) \geq \text{ELBO}(q, \boldsymbol{\theta}^{(t+1)}; \mathbf{x})$

<div class="formula">
  $
\boxed{\log p(\mathbf{x}|\boldsymbol{\theta}^{(t+1)}) \geq \text{ELBO}(q, \boldsymbol{\theta}^{(t+1)}; \mathbf{x}) \geq \text{ELBO}(q, \boldsymbol{\theta}^{(t)}; \mathbf{x}) = \log p(\mathbf{x}|\boldsymbol{\theta}^{(t)})}
  $
</div>

**The log-likelihood is monotonically non-decreasing!** EM is guaranteed to converge to a local maximum.

</div>

---

## Summary: Latent Variable Models & EM

**Latent Variable Models:**
- Introduce hidden variables $\mathbf{z}$ to model complex data distributions
- Marginal likelihood: $p(\mathbf{x}|\boldsymbol{\theta}) = \int p(\mathbf{x}, z|\boldsymbol{\theta}) \, dz$ — intractable due to log-of-sum

**The EM Algorithm:**

<table style="font-size: 0.65em; width: 100%; border-collapse: collapse;">
  <thead>
    <tr style="border-bottom: 2px solid #333;">
      <th style="text-align: left; padding: 8px; border: 1px solid #ccc;">Step</th>
      <th style="text-align: left; padding: 8px; border: 1px solid #ccc;">What it does</th>
      <th style="text-align: left; padding: 8px; border: 1px solid #ccc;">Why it works</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="padding: 8px; border: 1px solid #ccc;"><strong>E-Step</strong></td>
      <td style="padding: 8px; border: 1px solid #ccc;">Compute $\gamma_{ik} = p(z_i=k|\mathbf{x}_i, \boldsymbol{\theta}^{(t)})$</td>
      <td style="padding: 8px; border: 1px solid #ccc;">Minimizes KL divergence → makes ELBO tight</td>
    </tr>
    <tr>
      <td style="padding: 8px; border: 1px solid #ccc;"><strong>M-Step</strong></td>
      <td style="padding: 8px; border: 1px solid #ccc;">Update $\boldsymbol{\theta}$ via weighted MLE</td>
      <td style="padding: 8px; border: 1px solid #ccc;">Maximizes Q-function → raises ELBO</td>
    </tr>
  </tbody>
</table>

**Key Theoretical Insights:**
- **ELBO**: $\log p(\mathbf{x}|\boldsymbol{\theta}) = \text{ELBO}(q, \boldsymbol{\theta}; \mathbf{x}) + D_{\text{KL}}(q \,\|\, p(z|\mathbf{x}, \boldsymbol{\theta}))$
- **Convergence guarantee**: Log-likelihood is monotonically non-decreasing
- **Connection to Lecture 08**: E-step = full posterior, M-step = weighted MLE

**GMM as a Concrete Example:**
- Discrete latent $z \in \{1, \ldots, K\}$ = cluster assignment
- K-means = EM with hard assignments ($\gamma_{ik} \in \{0,1\}$)

</div>