Notes from the Wired

Attention is All You Need

December 1, 2023 | 1,124 words | 6min read

Paper Title: Attention is All You Need
Link to Paper: https://arxiv.org/abs/1706.03762
Date: 12. Jun2 2017
Paper Type: LLM, Architetcure, Attention
Short Abstract:
“Attention is All You Need” is a landmark paper that helped usher in the modern era of AI. It proposes the Transformer architecture, which relies on attention mechanisms to weight all the tokens in a sequence based on the current token.

1. Introduction

Recurrent Neural Networks (RNNs) and later models like LSTMs have long dominated the domain of sequence modeling, such as language modeling and machine translation.

However, these models rely on a hidden state that the next time step depends on. This inherently sequential nature prevents significant speedups through parallelization.

In this work, the authors propose the Transformer architecture—a model without recurrence that relies entirely on global attention. This design enables significant parallelization and achieves state-of-the-art results in machine translation.

2. Model Architecture

Most sequence-to-sequence models follow an encoder-decoder structure. The encoder maps an input sequence \(x\) to a sequence of continuous representations \(z\). For example, it converts a sentence (a sequence of words) into a sequence of token vectors that the model can process.

Given the tokens \(z\), the decoder then generates an output sequence \(y\). The Transformer also follows this architecture, with the encoder shown on the left side of the figure and the decoder on the right.

2.1 Encoder and Decoder Stacks

Encoder

Decoder

2.2 Attention

The attention mechanism takes as input a query, key, and value. When applied across all tokens, these become matrices, each with dimension \(d_k\).

To compute attention, we take the dot product of the query and key, scale it by \(\sqrt{d_k}\), and apply the softmax function. The result is a weight matrix that we use to weight the values:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

The matrices \(Q\), \(K\), and \(V\) in the first attention layer are computed as:

$$ Q = W_Q X \quad K = W_K X \quad V = W_V X $$

The interpretation of the attention mechanism is as follows: The query matrix can be understood as the model asking, for each input token (e.g., a word), “What should I pay attention to?” The key matrix provides the potential answers to that question. The dot product between the queries and keys measures the similarity between them—i.e., how relevant each key is to a given query.

In essence, each query specifies what it wants to attend to, the keys represent all available options, and the dot product tells us which keys (tokens) are most relevant. The resulting weights are then applied to the value matrix, which contains the actual content the model processes.

2.3 Multi-Head Attention

Instead of performing a single attention function, it is practically useful to perform it multiple times in parallel. This is similar to how convolutional neural networks use multiple filters to detect different features. In the same way, using multiple attention heads allows the model to focus on different aspects of the input.

For example, one attention head might focus on punctuation (e.g., commas and periods) to help understand syntax, while another might focus on named entities like people or places.

Multi-head attention works as follows:

As shown, we perform attention multiple times—each producing an output called a head. These heads are then concatenated and passed through a final linear layer. Together, this forms one multi-head attention layer, whose output can then be used as input for further layers.

In this paper, the authors use \(h = 8\) parallel attention heads.

2.4 Applications of Attention in Our Model

Attention is used in three different ways:

2.5 Position-wise Feed-Forward Networks

In addition to the attention mechanism, each layer in the Transformer architecture includes a fully connected feed-forward network, which is applied separately to each position. This is computed as:

$$ \text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2 $$

2.6 Embeddings and Softmax

To convert sequences of input tokens into model-dimensional vectors, an embedding layer is used. At the output stage, the softmax function is applied to the decoder’s output to generate a probability distribution, which is then used to predict the next token in the sequence during generation.

2.7 Positional Encoding

Since the Transformer architecture does not rely on recurrence, it lacks built-in information about the order of tokens. To address this, the authors add positional encodings to the input embeddings at the bottom of both the encoder and decoder stacks.

These encodings are defined as:

$$ PE_{\text{pos}, 2i} = \sin\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right) \\ PE_{\text{pos}, 2i+1} = \cos\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right) $$

Where pos is the position in the sequence, and i is the dimension index.

3. Why Self-Attention?

There are three main advantages to using self-attention:

4. Experiments

4.1 Training and Inference

Training Details

Inference

4.2 Results

5. Conclusion

Transformers are great!

Email Icon reply via email