- **Turing Machine (1936)**: Alan Turing's abstract computational model established theoretical limits of computation and introduced the concept of a universal machine capable of simulating any other computation
- **Linear Programming (1947)**: George Dantzig's simplex algorithm enabled systematic optimization of linear objective functions under constraints, becoming foundational for operations research and constrained optimization in machine learning
- **Information Theory (1948)**: Claude Shannon's mathematical framework quantified information and uncertainty through entropy ($H(X) = -\sum p(x) \log p(x)$), establishing fundamental limits for data compression and transmission that underpin modern loss functions and information measures in deep learning
---
## Early History of Neural Networks
- In 1943, McCulloch and Pitts created the first mathematical model of an artificial neuron
- Demonstrated neurons could be modeled as binary threshold units performing logical operations (AND, OR, NOT)
- Proved networks of artificial neurons could compute any logical or arithmetic function
- Provided the first formal argument that the brain could be understood as a computing device
- In 1957, the perceptron was introduced by Frank Rosenblatt
- It was a simple model that could learn to classify inputs into different categories, by adjusting weights based on errors
- These errors were calculated from prelabeled data, which is called supervised learning
- Later, the multi-layer perceptron was developed, allowing for more complex representations of data
- In 1979, convolutional neural networks were introduced - replacing the multiplications with convolution operations
- And three years later - Hopfield networks were proposed, introducing recurrent connections - temporal dynamics
- Then the backpropagation algorithm enabled training of multi-layer networks - efficiently computing gradients
- Before the deep learning era, Deep Belief Networks were proposed as a way to pre-train deep networks layer by layer
- Finally, in 2012, AlexNet demonstrated the power of large deep convolutional networks on image classification tasks - marking the beginning of the deep learning revolution
---
## Early History of Neural Networks
```python
# Initialize parameters
θ = initialize_parameters()
learning_rate = 0.01
num_epochs = 100
# Training loop
for epoch in range(num_epochs):
# Shuffle training data
shuffle(training_data)
# Iterate through each training example
for (x_i, y_i) in training_data:
# Compute gradient for single example
gradient = compute_gradient(loss(θ, x_i, y_i))
# Update parameters
θ = θ - learning_rate * gradient
```
- Now let's look at some key milestones in neural audio systems during this early history
- Already in 1960, Widrow and Hoff introduced the Least Mean Square filtering algorithm
- Then 27 years later, neural networks were applied to phoneme recognition
- In 1989, Peter Todd used RNNs for symbolic music generation
- In the same year, there have been first attempts to use gradient descent for musical DSP
- In 1997, neural networks were used the first time for modeling analog effects
- Music transcription with neural networks dates back to 1999, with Matija Marolt's work on piano transcription
- Finally in 2009, Lee et al. demonstrated the effectiveness of deep belief networks for learning audio features with unsupervised learning - unlabeled data
- These features outperformed traditional hand-crafted features in many classification tasks
---
## Early History of Neural Audio Systems
Key Milestones
Significant developments in neural audio systems
2006
Deep Belief Networks
1960
LMS Filtering
Widrow & Hoff
1987
NN for Phoneme Recognition
Waibel et al.
1989
RNN for Symbolic Music Generation
Todd
1989
Gradient Descent for Musical DSP
Shynk & Moorer
1997
NN for Analog Effects Modeling
Zhang & Duhamel
1999
NN for Piano Transcription
Matija Marolt
2009
Audio features with DBN
Lee et al.
Gradient Descent Based
Digital Signal Processing
Use gradient descent to optimize parameters of digital signal processing algorithms for tasks like audio effects modeling and synthesis.
Feature Extraction with
Neural Networks
Use neural networks to automatically learn and extract relevant features from audio data for tasks like classification, transcription, and analysis.
Symbolic Music Generation
with Neural Networks
Use neural networks to generate symbolic music representations, such as music notation or MIDI sequences, for composition and arrangement tasks.
What about neural audio synthesis?
Notes:
- I would like to highlight that these early works can be categorised into three main areas.
- First, gradient descent based digital signal processing - using gradient descent to optimize parameters of DSP algorithms
- Second, feature extraction with neural networks - using neural networks to automatically learn and extract relevant features
- And the third category is symbolic music generation with neural networks
- But what about neural audio synthesis?
---
## The Deep Learning Era
Deep architectures
Deep architectures and generative models transforming AI capabilities
2013
Variational Autoencoders
Kingma et al.
2014
Generative Adversarial Nets
Goodfellow et al.
2015
ResNet & Diffusion
He et al. & Sohl-Dickstein et al.
2016
Style Transfer & WaveNet
Gatys & van den Oord
2017
Transformers
Vaswani et al.
2021
ViT & CLIP
Dosovitskiy & Radford
2022
Diffusion Transformer
Peebles & Xie
https://theaisummer.com/Autoencoder/
https://www.linkedin.com/pulse/what-generative-adversarial-networks-gans-sushant-babbar-qpc9c
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems, 33, 6840-6851.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PmLR.
https://digialps.com/stability-ais-new-open-source-ai-creation-stable-audio-2-0-takes-on-suno-ai/
Notes:
- Well, for neural audio synthesis we need the inventions of the deep learning era - first an overview of key milestones in deep learning in general
- In 2013, Variational Autoencoders were introduced - ability to generate new data points by sampling from a learned distribution - the latent distribution
- Learn in an unsupervised manner to encode input data into a compressed representation and then decode it back to the original input
- In 2014, Generative Adversarial Networks were proposed - two neural networks competing against each other
- In 2015, Diffusion models were introduced - iterative denoising process to generate high-quality samples
- The year 2017 was the year Transformers revolutionized sequence modeling with self-attention mechanisms
- In 2021, CLIP demonstrated the power of multi-modal learning by connecting images and text
- Two encoders that map images and text into a shared latent space - by using contrastive learning the images and text are mapped close to each other in the latent space
- It could for example classify images, without ever being trained on that specific task
- In 2022, Diffusion Transformers combined the strengths of diffusion models and transformers
- And finally in 2023, Mamba was introduced - a new architecture for sequence modeling
---
## The Deep Learning Era
Deep architectures
Deep architectures and generative models transforming AI capabilities
2013
Variational Autoencoders
Kingma et al.
2014
Generative Adversarial Nets
Goodfellow et al.
2015
ResNet & Diffusion
He et al. & Sohl-Dickstein et al.
2016
Style Transfer & WaveNet
Gatys & van den Oord
2017
Transformers
Vaswani et al.
2021
ViT & CLIP
Dosovitskiy & Radford
2022
Diffusion Transformer
Peebles & Xie
Training & Optimization
Advanced learning techniques and representation learning breakthroughs
2013
Word2Vec
Mikolov, T. et al.
2014
Attention Mechanism
Bahdanau, D. et al.
2015
BatchNorm & Adam
Ioffe & Kingma
2016
Layer Normalization
Ba, J. L. et al.
Software & Applications
Practical deployment and mainstream adoption of deep learning systems
2016
AlphaGo
Silver, D. et al.
2017
PyTorch
Paszke, A. et al.
2018
GPT-1
Radford & Devlin
2020
GPT-3
Brown, T. B. et al.
2022
ChatGPT & Stable Diffusion
OpenAI & Stability AI
2023
LLaMA
Touvron, H. et al.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26.
Bahdanau, D., Cho, K., & Bengio, Y. (2016). Neural Machine Translation by Jointly Learning to Align and Translate (No. arXiv:1409.0473). arXiv. https://doi.org/10.48550/arXiv.1409.0473
Adaptive Moment Estimation - combines momentum and RMSprop:
First moment (momentum):
$\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1-\beta_1)\nabla_{\boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta}_t)$
Second moment (RMSprop):
$\mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1-\beta_2)(\nabla_{\boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta}_t))^2$
Bias correction:
$\hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1-\beta_1^t}, \quad \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1-\beta_2^t}$
Parameter update:
$\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \alpha \frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon}$
Notes:
- Well, for neural audio synthesis we need the inventions of the deep learning era - first an overview of key milestones in deep learning in general
- In 2013, Variational Autoencoders were introduced - ability to generate new data points by sampling from a learned distribution - the latent distribution
- Learn in an unsupervised manner to encode input data into a compressed representation and then decode it back to the original input
- In 2014, Generative Adversarial Networks were proposed - two neural networks competing against each other
- In 2015, Diffusion models were introduced - iterative denoising process to generate high-quality samples
- The year 2017 was the year Transformers revolutionized sequence modeling with self-attention mechanisms
- In 2021, CLIP demonstrated the power of multi-modal learning by connecting images and text
- Two encoders that map images and text into a shared latent space - by using contrastive learning the images and text are mapped close to each other in the latent space
- It could for example classify images, without ever being trained on that specific task
- In 2022, Diffusion Transformers combined the strengths of diffusion models and transformers
- And finally in 2023, Mamba was introduced - a new architecture for sequence modeling
---
## Deep Neural Audio Systems
Key Milestones
Significant developments in deep neural audio systems
2013
VAE
Kingma & Welling
2014
GAN
Goodfellow et al.
2015
Diffusion
Sohl-Dickstein et al.
2017
Transformers
Vaswani et al.
2021
CLIP
Dosovitskiy & Radford
2022
Diffusion Transformer
Peebles & Xie
2017
Neural Synthesis
Engel et al.
2019
Real-time Amp Emul.
Damskägg et al.
2020
Automatic Mixing
Steinmetz et al.
2021
RAVE
Caillon & Esling
2022
CLAP
Benjamin, et al.
2024
Stable Audio
Evans et al.
2025
Lyria 2
The Lyria Team
Oord, A. van den, Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., & Kavukcuoglu, K. (2016). WaveNet: A Generative Model for Raw Audio (No. arXiv:1609.03499). https://doi.org/10.48550/arXiv.1609.03499
Oord, A. van den, Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., & Kavukcuoglu, K. (2016). WaveNet: A Generative Model for Raw Audio (No. arXiv:1609.03499). https://doi.org/10.48550/arXiv.1609.03499
Engel, J., Resnick, C., Roberts, A., Dieleman, S., Norouzi, M., Eck, D., & Simonyan, K. (2017, July). Neural audio synthesis of musical notes with wavenet autoencoders. In International conference on machine learning (pp. 1068-1077). PMLR.
Wright, A., Damskägg, E.-P., Juvela, L., & Välimäki, V. (2020). Real-Time Guitar Amplifier Emulation with Deep Learning. Applied Sciences, 10(3), 766. https://doi.org/10.3390/app10030766
Engel, J., Hantrakul, L. (Hanoi), Gu, C., & Roberts, A. (2019, September 25). DDSP: Differentiable Digital Signal Processing. International Conference on Learning Representations.
Caillon, A., & Esling, P. (2021). RAVE: A variational autoencoder for fast and high-quality neural audio synthesis (No. arXiv:2111.05011). arXiv. http://arxiv.org/abs/2111.05011
Elizalde, B., Deshmukh, S., Al Ismail, M., & Wang, H. (2023, June). Clap learning audio concepts from natural language supervision. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE.
https://digialps.com/stability-ais-new-open-source-ai-creation-stable-audio-2-0-takes-on-suno-ai/
Notes:
- We left the neural audio systems before the deep learning era, saying that there was no neural audio generation yet
- But that changed with the WaveNet model in 2016
- WaveNet used a clever trick in convolutional neural networks to model raw audio waveforms - it used so-called dilated convolutions to increase the receptive field of the network
- This allowed the model to capture long-range dependencies in audio signals, resulting in high-quality and realistic audio generation
- In 2017, Engel et al. introduced Neural Synthesis with WaveNet Autoencoders - a model that could generate musical notes by learning a latent representation of audio
- In 2019, the same team (Google Magenta) further advanced the field with Differentiable Digital Signal Processing (DDSP) - combining neural networks with traditional signal processing techniques
- Basically, they were predicting the parameters of an additive synthesizer with deep learning
- The key to this approach is that the synthesis process is differentiable, allowing for end-to-end training of the model
- In 2020, Steinmetz et al. proposed an approach for automatic mixing based on differentiable effects
- In 2021, Caillon and Esling introduced RAVE - a real-time audio synthesis model using variational autoencoders
- What works for images and text, should also work for audio - in 2022, CLAP was introduced - a model that learns audio concepts from natural language supervision
- And finally in 2024, Stable Audio Open was released - a model based on diffusion transformers for high-quality text-to-audio generation
---
OUR RECENT RESEARCH CONTRIBUTIONS
Selected work from the Computer Music and Neural Audio Systems Research Team
Audio Communication Group
Technische Universität Berlin
Notes:
- Ok, this was my overview of the academic field from its origins to the present day
- This area is receiving growing interest from research groups worldwide
- From us, as well
- So I'd like to show you three of our recent contributions in the years 2024 and 2025
- By "us" I refer to the Computer Music and Neural Audio Systems Research Team at the Audio Communication Group
---
Anira (Ackva, V.* & Schulz, F.*)
ANIRA: An Architecture for Neural network Inference in Real-time Audio applications
→ C++ Library that bridges the gap between neural audio research and real-time applications
Key Contributions
- Enables real-time safe neural network integration in DAWs and audio plugins
- Provides a framework for benchmarking neural networks in real-time scenarios
- Paper: First benchmark of neural audio effects models with different backends in real-time audio contexts
Open-source • Extensive documentation • Permissive licensing
Ackva, V., & Schulz, F. (2024). ANIRA: An Architecture for Neural Network Inference in Real-Time Audio Applications. 2024 IEEE 5th International Symposium on the Internet of Sounds (IS2), 1–10. https://doi.org/10.1109/IS262782.2024.10704099
Notes:
- The first contribution is ANIRA - an architecture for neural network inference in real-time audio applications - a project mainly by my colleague Valentin Ackva and me
- Inference is the process of using a trained neural network to make predictions on new data
- ANIRA is a C++ library that tries to bridge the gap between neural audio research and real-time applications
- It has two major focus areas - first the real-time safe integration of neural networks into DAWs, audio plugins and audio applications in general
- The second focus area is the performance evaluation of neural networks in audio applications
- For this ANIRA provides a framework for benchmarking neural networks in real-time scenarios
- And our paper was the first benchmark of neural audio effects models with different backends in real-time audio contexts
- Finally, ANIRA is open-source, has extensive documentation and permissive licensing
---
Neural Proxies for Sound Synthesizers
(Combes, P., Weinzierl, S., Obermayer, K.)
→ How can we integrate non-differential synthesizers in deep learning pipelines for automatic synthesizer programming?
Key Contributions
- Method for training neural proxies for arbitrary synthesizers
- Evaluation of pretrained audio feature extraction models as proxy training representations
- Evaluation of method on synthesizer sound matching task
Open-source
Combes, P., Weinzierl, S., & Obermayer, K. (2025). Neural Proxies for Sound Synthesizers: Learning Perceptually Informed Preset Representations. Journal of the Audio Engineering Society, 73(9), 561–577. https://doi.org/10.17743/jaes.2022.0219
Training of a neural proxy to mimic the behavior of a
non-differentiable synthesizer
Training of a synthesizer sound matching system using the neural proxy
Notes:
- The next contribution is Neural Proxies for Sound Synthesizers, primarily led by my colleague Paulo Combes
- The central question: how can we integrate non-differentiable synthesizers into deep learning pipelines for automatic synthesizer programming?
- In deep learning everything needs to be differentiable for our backpropagation algorithm to work
- This is why neural audio synthesis models like DDSP rely on differentiable synthesizers
- However, many high-quality synthesizers are non-differentiable, which limits their use in deep learning workflows
- Paulo's solution: neural proxies - differentiable neural networks that mimic non-differentiable synthesizer behavior
- The training process uses an audio feature extraction model (g()) to extract features from synthesizer output
- Then a neural network (f()) is trained to map synthesizer parameters to these extracted features
- The paper also provides extensive evaluation of different audio feature extraction models as proxy training representations
- Finally, the method was evaluated on synthesizer sound matching tasks
- Using the neural proxy (f()) to train a network (e()) that predicts synthesizer parameters for a given target sound
---
pGESAM (Limberg, C.*, Schulz, F.*, Zhang, Z., Weinzierl, S.)
pGESAM: pitch-conditioned GEnerative SAmple Map
→ How can musicians find the perfect samples in an effective and creative way?
→ How can we generate samples that can be played expressively throughout different pitches?
Key Contributions
- Framework for successful generation of 4 second one-shot samples from 3 data-points
- Effective pitch-timbre disentanglement via semi-supervised learning (2D timbre, 1D pitch)
- Extensive evaluation on NSynth dataset
Open-source • Web Demonstration
Limberg, C., Schulz, F., Zhang, Z., & Weinzierl, S. (2025). Pitch-Conditioned Instrument Sound Synthesisfrom an Interactive Timbre Latent Space. 28th International Conference on Digital Audio Effects (DAFx25), 1–8. https://dafx.de/paper-archive/2025/DAFx25_paper_58.pdf
Notes:
- The last contribution is pGESAM - pitch-conditioned Generative Sample Map - a collaboration primarily between Christian Limberg and me
- Two central questions:
- How can musicians find the perfect samples in an effective and creative way?
- How can we generate samples that can be played expressively throughout different pitches?
- Key contributions: a framework generating 4-second one-shot samples from just 3 data points
- Three floats input, 4-second audio output
- These dimensions are disentangled - independent control over timbre (2D) and pitch (1D)
- Architecture overview: neural audio codec extracts embeddings (e), VAE learns low-dimensional timbre representation with disentangled pitch, pitch/timbre-conditioned transformer generates audio embeddings autoregressively
- Extensive evaluation on NSynth dataset demonstrates effectiveness
- Now I want to show you a quick demo of the pGESAM framework with our interactive web application
---
OUTLOOK
---
## Future Directions in Neural Audio Systems
Deep Learning & Model Architectures
- Advanced sequence modeling for extended, coherent audio generation
- Methods for explainability and interpretability of neural audio models
- Synthetic data generation with generative models
Deployment & Real-time Performance
- Real-time inference optimization for low-latency audio processing
- Efficient model compression for resource-constrained devices
- Sample-rate agnostic architectures for flexible synthesis
Creative & Artistic Applications
- Improved control mechanisms for user-guided generation and processing
- Multi-modal conditioning for richer, more expressive outputs
- Enhanced embodiment in neural musical instruments
Notes:
- In the deep learning research area there is active work on long-term coherent generation, model explainability, and synthetic data creation
- For real-time contexts, inference optimization, model compression, and sample-rate agnostic architectures are important topics
- Finally, for creative applications, there is research in enhanced user control and better multi-modal conditioning, which would hopefully lead to more embodiment of neural musical instruments
---
# Setup Python
Environment