Decoding the Transformer Model: Architecture, Loss Function, and Inference from the ‘Attention is All You Need’ Paper

5 min readAug 18, 2024

Eiffel Tower caught on my iPhone, quality reduced a million times to fit on this page

The “Attention is All You Need” paper by Vaswani et al. revolutionized the field of natural language processing (NLP) and machine learning with its introduction of the Transformer model. Unlike previous neural network models that relied on recurrence, the Transformer leverages self-attention mechanisms to improve efficiency and performance. In this blog post, we’ll delve into the intricate details of the Transformer’s architecture, the creation of Query, Key, and Value, the loss function, and what happens during inference time.

The Transformer Architecture:

The Transformer model consists of two main components: the encoder and the decoder. Both are designed to process sequences of words (or tokens) efficiently using multi-head self-attention mechanisms.

Transformer Architecture from Original Paper

1. Encoder:

The encoder processes the input sequence and converts it into a context-rich representation. The encoder itself is made up of a stack of identical layers, each containing two main components:

a) Multi-Head Self-Attention Mechanism:

The self-attention mechanism calculates attention scores for each word in a sequence concerning all other words in the sequence. This allows the model to weigh the importance of each word relative to the others.

Creation of Query, Key, and Value:

Each word (or token) in the input sequence is first embedded into a continuous vector space. These embeddings are then linearly transformed to generate three different vectors: Query ((Q)), Key ((K)), and Value ((V)).

Mathematically, this can be represented as:

where:

(X) is the input embedding matrix,
(W^Q, W^K, W^V) are weight matrices learned during the training process.

The self-attention mechanism operates as follows:

where:

(Q) (Query) is the matrix of query vectors,
(K) (Key) is the matrix of key vectors,
(V) (Value) is the matrix of value vectors,
(d_k) is the dimension of the key vectors.

b) Position-Wise Feed-Forward Network:

This component processes the output of the self-attention mechanism for each word independently through a series of linear transformations and activation functions. This helps to introduce non-linearity and further refine the word representations.

2. Decoder:

The decoder generates the output sequence by attending (focusing) on the relevant parts of the input sequence (using the encoded representations). Like the encoder, it has a stack of identical layers, each containing:

a) Masked Multi-Head Self-Attention Mechanism:

This mechanism is similar to the encoder’s self-attention but is masked to prevent the decoder from “cheating” by looking at the future words in the sequence during training.

b) Encoder-Decoder Attention:

This layer allows the decoder to attend to the encoder’s output representations, effectively aligning the input and output sequences.

c) Position-Wise Feed-Forward Network:

Similar to the encoder, this network further processes the attention outputs.

Loss Function:

The “Attention is All You Need” paper utilizes a variant of the cross-entropy loss known as label smoothing cross-entropy. Traditional cross-entropy loss can sometimes lead to overconfident predictions, which in turn might make the model less robust. Label smoothing helps mitigate this issue by smoothing the target labels.

Label Smoothing: With label smoothing, the hard one-hot encoded vectors representing true labels are replaced by a smoothed version, which assigns a little bit of probability mass to all other classes.

Mathematically, if the original one-hot encoded target is ( y ) and ( \epsilon ) is the smoothing factor, the new smoothed target ( y’ ) is given by:

where ( K ) is the number of classes.

By employing label smoothing, the loss becomes:

where ( p_i ) are the predicted probabilities, and ( N ) is the number of classes.

Inference Time:

At inference time, the Transformer model goes through a slightly modified process as compared to training time, especially in the decoding phase.

Inference Process:

Step 1: Encoding the Input Sequence

Just as during training, the encoder processes the entire input sequence. Each word is embedded and transformed into the Query, Key, and Value vectors. The multi-head self-attention mechanism is applied, followed by the feed-forward network, to produce context-rich representations for each word in the input sequence.

Step 2: Decoding the Output Sequence

Unlike training, where the entire output sequence is available, inference is typically done one step at a time (autoregressive generation). Initially, the decoder takes a start token as input.
The decoder processes its input with masked self-attention, ensuring it only attends to previously generated tokens.
The encoder-decoder attention mechanism is utilized to align the decoder’s input with the encoder’s output representations.

Step 3: Generating the Next Token

At each time step, the decoder generates a probability distribution over the vocabulary for the next token using the softmax layer.
The token with the highest probability (or using techniques like beam search for more sophisticated selection) is added to the output sequence.

Step 4: Iterative Process

The newly generated token is fed back into the decoder as input for the next time step.
The process repeats until a special end-of-sequence token is produced or a specified maximum length is reached.

Conclusion:

The “Attention is All You Need” paper introduced the Transformer model, setting new benchmarks in NLP tasks, especially in machine translation. With its unique architecture featuring multi-head self-attention, parallelizable stacks of encoders and decoders, and robust label smoothing cross-entropy loss function, the Transformer model delivers unparalleled efficiency and effectiveness. Furthermore, its autoregressive decoding process ensures that it generates coherent and contextually accurate sequences during inference. The Transformer’s success has not only transformed machine translation but has paved the way for advancements in many other NLP applications.

Feel free to connect here: Praveen Kumar | LinkedIn