- Maps the decoder's output back to the vocabulary space for token prediction
- Typically implemented as a linear layer followed by a softmax activation
- Shares weights with the embedding layer to reduce the number of parameters and improve performance (Press & Wolf, 2017)
- Converts the decoder's continuous representations into logits for each token in the vocabulary
$
\begin{aligned}
\mathbf{y}(\mathbf{h}_t) &= \mathbf{h}_t \mathbf{W}^{\top} + \mathbf{b} \\
&= \mathbf{h}_t \mathbf{W}^{\top} \quad \text{(if weights are shared, } \mathbf{b} = 0\text{)}\\
\mathbf{p}_t(\mathbf{h}_t) &= \mathrm{softmax}(\mathbf{y}) \quad \text{where} \quad [\mathbf{p}_t]_i = \frac{\exp(y_i)}{\sum_{j=1}^{V} \exp(y_j)}
\end{aligned}
$
where $\mathbf{W} \in \mathbb{R}^{V \times D}$ is the shared weight matrix from the embedding layer, $V$ is the vocabulary size, and $D$ is the model dimension.
---
## Positional Encoding
- Since Transformers do not have inherent sequential processing, positional encodings are added to input embeddings to provide information about the order of tokens
- Can be implemented using fixed sinusoidal functions or learned embeddings
- Enables the model to capture the relative and absolute positions of tokens in the sequence
**Vanilla sinusoidal positional encoding formula:**
$
\begin{aligned}
\mathrm{PE}(t, 2i) &= \sin\left(\frac{t}{10000^{2i/d_{\text{model}}}}\right) \\
\mathrm{PE}(t, 2i+1) &= \cos\left(\frac{t}{10000^{2i/d_{\text{model}}}}\right)
\end{aligned}
$
where $t$ is the token position, $i$ is the dimension index, and $d_{\text{model}}$ is the model dimension.
---
## Self-Attention Mechanism
Self-attention allows each token in the input sequence to attend to all other tokens, enabling the model to capture dependencies regardless of their distance in the sequence
**Step 1: Compute Queries, Keys, and Values**
For each token, compute query ($\mathbf{Q}$), key ($\mathbf{K}$), and value ($\mathbf{V}$) vectors using learned linear projections.
$
\mathbf{Q} = \mathbf{X} \mathbf{W}_Q, \quad \mathbf{K} = \mathbf{X} \mathbf{W}_K, \quad \mathbf{V} = \mathbf{X} \mathbf{W}_V
$
where $\mathbf{X} \in \mathbb{R}^{T \times D}$ is the input sequence matrix, and $\mathbf{W}_Q \in \mathbb{R}^{D \times D_k}$, $\mathbf{W}_K \in \mathbb{R}^{D \times D_k}$, $\mathbf{W}_V \in \mathbb{R}^{D \times D_v}$ are learned weight matrices. Each token has dimension $D$, and the sequence length is $T$.
---
## Scaled Dot-Product Attention
**Step 2: Compute Attention Scores**
$
\mathbf{A} = \mathrm{softmax}\left(\frac{\mathbf{Q} \mathbf{K}^{\top}}{\sqrt{D_k}}\right)
$
where $d_k$ is the dimension of the key vectors, used for scaling to prevent large dot-product values that could lead to small gradients. $\mathbf{A} \in \mathbb{R}^{T \times T}$ contains the attention weights for each token pair in the sequence.
**Step 3: Compute Weighted Sum of Values**
$
\mathbf{Z} = \mathbf{A} \mathbf{V}
$
$\mathbf{Z} \in \mathbb{R}^{T \times D_v}$ captures information from all tokens in the sequence, weighted by their relevance to the query token
---
## Linear Output Layer
**Step 4: Final Linear Projection**
- After obtaining the output from the self-attention mechanism, a linear layer is applied to project the output back to the model dimension
- This linear transformation allows the model to learn complex combinations of the attended information
- The output of this layer is then passed through subsequent layers in the Transformer architecture
$
\mathbf{h}_{\text{out}} = \mathbf{Z} \mathbf{W}_O + \mathbf{b}_O
$
---
## Masked / Causal Self-Attention Mechanism
- In autoregressive models, causal self-attention ensures that each token can only attend to previous tokens in the sequence, preventing information leakage from future tokens
- This is typically implemented by applying a mask to the attention scores before the softmax operation
The masked attention scores are computed as follows:
$
\begin{aligned}
\mathbf{M}_{i,j} &= \begin{cases}
0 & \text{if } j \leq i \\
-\infty & \text{if } j > i
\end{cases} \\
\mathbf{A}_{\text{masked}} &= \mathrm{softmax}\left(\frac{\mathbf{Q} \mathbf{K}^{\top}}{\sqrt{D_k}} + \mathbf{M}\right)
\end{aligned}
$
---
## Multi-Head Self-Attention Mechanism
- Instead of performing a single attention function, multi-head attention runs multiple attention operations in parallel
- Each "head" learns different attention patterns, allowing the model to capture various aspects of relationships (e.g., syntactic, semantic) between tokens
- Enables the model to jointly attend to information from different representation subspaces
**Step 1: Project inputs to multiple heads**
For each head $i$, compute separate Q, K, V projections:
$
\mathbf{Q}_i = \mathbf{X} \mathbf{W}_Q^i, \quad \mathbf{K}_i = \mathbf{X} \mathbf{W}_K^i, \quad \mathbf{V}_i = \mathbf{X} \mathbf{W}_V^i
$
where $\mathbf{W}_Q^i \in \mathbb{R}^{D \times D_k}$, $\mathbf{W}_K^i \in \mathbb{R}^{D \times D_k}$, $\mathbf{W}_V^i \in \mathbb{R}^{D \times D_v}$ are unique weight matrices for head $i$
---
## Multi-Head Self-Attention Mechanism
**Step 2: Compute attention for each head**
$
\begin{aligned}
\mathbf{A}_i &= \mathrm{softmax}\left(\frac{\mathbf{Q}_i \mathbf{K}_i^{\top}}{\sqrt{D_k}}\right) \\
\mathbf{Z}_i &= \mathbf{A}_i \mathbf{V}_i
\end{aligned}
$
where $\mathbf{A}_i$ are the attention weights for head $i$, and $\mathbf{Z}_i$ is the output of head $i$
**Step 3: Concatenate heads and project**
$
\begin{aligned}
\mathbf{Z}_{\text{concat}} &= \mathrm{Concat}(\mathbf{Z}_1, \mathbf{Z}_2, \ldots, \mathbf{Z}_h) \\
\mathbf{h}_{\text{out}} &= \mathbf{Z}_{\text{concat}} \mathbf{W}_O + \mathbf{b}_O
\end{aligned}
$
where $h$ is the number of heads, and $\mathbf{W}_O \in \mathbb{R}^{h \cdot D_v \times D}$ projects the concatenated outputs back to the model dimension
---
## Cross-Attention Mechanism
- Cross-attention allows to attend to a different sequence (e.g., encoder outputs) rather than the same sequence (as in self-attention)
- Is used to integrate information from the encoder into the decoder in sequence-to-sequence tasks like machine translation, but can also be used in other contexts where two different sequences need to interact
**Cross-attention formula:**
$
\begin{aligned}
\mathbf{Q} &= \mathbf{X}_{\text{decoder}} \mathbf{W}_Q \\
\mathbf{K} &= \mathbf{X}_{\text{encoder}} \mathbf{W}_K \\
\mathbf{V} &= \mathbf{X}_{\text{encoder}} \mathbf{W}_V
\end{aligned}
$
where $\mathbf{X}_{\text{decoder}} \in \mathbb{R}^{T_{\text{decoder}} \times D}$ are the decoder inputs and $\mathbf{X}_{\text{encoder}} \in \mathbb{R}^{T_{\text{encoder}} \times D}$ are the encoder outputs. The resulting $\mathbf{Q} \in \mathbb{R}^{T_{\text{decoder}} \times D_k}$, $\mathbf{K} \in \mathbb{R}^{T_{\text{encoder}} \times D_k}$, and $\mathbf{V} \in \mathbb{R}^{T_{\text{encoder}} \times D_v}$ are then used in the scaled dot-product attention as usual.
---
## Residual Connections
- Solve vanishing gradients with direct gradient path to earlier layers, enabling training of very deep networks
- Network learns small adjustments instead of full transformations which simplifies optimization
- Preserves information since layers can pass data unchanged or refine it without information loss
$
\mathbf{y} = \mathbf{x} + \mathrm{Sublayer}(\mathbf{x})
$
---
## Layer Normalization
- Stabilizes and accelerates training by normalizing inputs across features for each data point
- Reduces internal covariate shift, making training less sensitive to initialization and learning rates
$
\mathbf{y}(\mathbf{x}) = \frac{\mathbf{x} - \mu}{\sigma} \odot \boldsymbol{\gamma} + \boldsymbol{\beta}
$
where $\mu$ and $\sigma$ are the mean and standard deviation of the features in $\mathbf{x}$, and $\boldsymbol{\gamma} \in \mathbb{R}^D$, $\boldsymbol{\beta} \in \mathbb{R}^D$ are learnable parameters for scaling and shifting.
---
## Position-wise Feedforward Networks
- Applied independently to each position in the sequence, allowing for non-linear transformations of the token representations
- Consists of two linear layers with a ReLU activation in between, enabling the model to learn complex feature interactions
$
\mathbf{y}(\mathbf{x}) = \mathrm{ReLU}(\mathbf{x} \mathbf{W}_1 + \mathbf{b}_1) \mathbf{W}_2 + \mathbf{b}_2
$
where $\mathbf{W}_1 \in \mathbb{R}^{D \times D_{ff}}$, $\mathbf{W}_2 \in \mathbb{R}^{D_{ff} \times D}$ are weight matrices, and $\mathbf{b}_1 \in \mathbb{R}^{D_{ff}}$, $\mathbf{b}_2 \in \mathbb{R}^{D}$ are bias vectors. $D_{ff}$ is the dimension of the feedforward layer, typically larger than the model dimension $D$.
---
## Key Components of Original Transformer
- Embedding layers convert discrete tokens to continuous vector representations
- Output projection maps decoder outputs back to vocabulary space for token prediction
- Positional encodings (sinusoidal or learned) provide sequence order information
- Multi-head self-attention captures dependencies between all tokens in parallel
- Masked multi-head self-attention prevents future token leakage in autoregressive generation
- Multi-head cross-attention (encoder-decoder) enables decoder to attend to encoder outputs
- Residual connections enable deep network training by providing direct gradient paths
- Layer normalization stabilizes training and reduces sensitivity to initialization
- Position-wise feedforward networks apply non-linear transformations to each token
---
## Summary of Original Transformer
**Key Advantages**
- **Parallelizable architecture** enables efficient training on large datasets by processing all tokens simultaneously
- **Long-range dependencies** captured through direct attention connections between any token pair
- **Scalability** to billions of parameters, forming the foundation for modern LLMs (BERT, GPT, LLaMA)
**Applications**
- **Transfer learning** through pretraining on large corpora followed by fine-tuning for specific tasks
- **Versatile across domains** including NLP, computer vision, and audio processing
---
## Decoder Only Example: GPT
- The GPT architecture is a decoder-only Transformer model that utilizes masked self-attention to generate text autoregressively
- Consists of multiple layers of masked multi-head self-attention followed by position-wise feedforward networks, with residual connections and layer normalization applied throughout
- GPT models are pretrained on large text corpora using a language modeling objective, learning to predict the next token in a sequence given the previous tokens
| Component |
Parameters per Layer |
Total Parameters |
Calculation |
| Embedding |
6.4B |
6.4B |
$V \times D = 50{,}257 \times 12{,}288$ |
| Multi-Head Attention |
603M |
57.9B |
$4 \times D^2 = 4 \times 12{,}288^2$ (split into 96 heads with Q, K, V, and O per layer × 96 layers) |
| Layer Normalization |
24K |
4.7M |
$2 \times D = 2 \times 12{,}288$ (2 per layer × 96 layers) |
| Feedforward Network |
1.2B |
115.3B |
$2 \times D \times D_{\text{ff}} = 2 \times 12{,}288 \times 49{,}152$ (2 per layer × 96 layers) |
| Output Projection |
— |
(shared) |
Shares weights with embedding layer |
| Total |
≈175B |
6.4B + 57.9B + 4.7M + 115.3B ≈ 179.6B |
Where: $D = 12{,}288$ (model dimension), $D_{\text{ff}} = 4D = 49{,}152$ (feedforward dimension), $V = 50{,}257$ (vocabulary size), 96 layers, 96 attention heads
---
## Encoder Only Example: BERT
- The BERT architecture is an encoder-only Transformer model that utilizes bidirectional self-attention to generate contextualized token representations
- Consists of multiple layers of multi-head self-attention followed by position-wise feedforward networks, with residual connections and layer normalization applied throughout
- BERT models are pretrained on large text corpora using a masked language modeling objective, learning to predict randomly masked tokens in a sequence based on their surrounding context
- Then they are fine-tuned for specific downstream tasks such as text classification or named entity recognition
---
## Encoder-Decoder Example: Original Transformer
- The original Transformer architecture consists of an encoder-decoder structure where the encoder processes the input sequence and the decoder generates the output sequence
- The encoder is composed of multiple layers of multi-head self-attention and position-wise feedforward networks, while the decoder includes masked multi-head self-attention, cross-attention to the encoder outputs, and position-wise feedforward networks
- This architecture is particularly effective for sequence-to-sequence tasks such as machine translation
---
## Continuous Encoder Example: Vision Transformer
- Vision Transformer (ViT) applies Transformer architecture to image classification by treating images as sequences of patches
- Input images are divided into fixed-size patches, flattened and linearly projected into continuous patch embeddings
- Patch embeddings are continuous vectors unlike discrete token embeddings in NLP
- A special [CLS] token is prepended to the sequence to aggregate information for classification
- Positional encodings preserve spatial information before processing through the Transformer encoder
- A classification head applied to the [CLS] token output performs the final classification
---
## Continuous Encoder-Decoder Example: pGESAM
- The pGESAM architecture is an encoder-decoder Transformer for continuous timbre and pitch embeddings
- The encoder processes timbre (2D float via linear projection) and pitch (1D via learned embedding) representations
- The decoder autoregressively generates audio codec tokens using masked self-attention
- Cross-attention conditions generation on the encoder's continuous timbre-pitch representations
Source: Limberg, C., Schulz, F., Zhang, Z., & Weinzierl, S. (2025). Pitch-Conditioned Instrument Sound Synthesisfrom an Interactive Timbre Latent Space. 28th International Conference on Digital Audio Effects (DAFx25), 1–8. https://dafx.de/paper-archive/2025/DAFx25_paper_58.pdf
---
# Python Implementation