Transformers in NLP

7 min readDec 20, 2021

Providing an intuition behind attention-based models

First introduced in the renowned Attention is All You Need by Vaswani et al, Transformers have become the state-of-the art for many tasks in natural language processing and sequential models as a whole. In fact, there have also been recent experiments that have shown Transformers to generalize well to even computer vision tasks (consider An Image is Worth 16x16 words). As such, it is important to explore attention-based models as a robust framework in detail, given how well they can extend to numerous domains in Machine Learning.

To start with, as starkly proposed in “Attention is All You Need,” Transformers are grounded in attention mechanisms. That is, attention, best put by Vaswani et al, is described as “ an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences.”

In other words, attention mechanisms have the ability to generate concrete relationships between data points within sequences.

In fact, Transformers use a specific type of attention mechanism, referred to as multi-head attention. However, to understand what this type of attention mechanism is, we must first introduce a simpler scaled dot-product attention scheme.

Let us start with scaled dot-product attention. Simply, we can express this form of attention as:

wherein which Q, K, and V are batches of matrices, each with shape (batch_size, length_of_sequence, num_of_features). We further observe that the inner product between the query Q and the key K results in a matrix of size (batch_size, length_of_sequence, length_of_sequence). We can, thus, interpret this new matrix as telling us roughly how important each element in the given sequence is. As such, we identify this multiplication as the core attention of the current layer, as it essentially determines which elements we “pay attention” to.

This attention matrix is then normalized by the softmax nonlinearity, so that all the weights sum to one. Finally, we simply apply the value V to our attention matrix to observe our desired output.

We can observe how simple it is to implement this form of attention below:

from torch import Tensor
import torch.nn.functional as f def scaled_dot_product_attention(query, key, value):
    temp = query.bmm(key.transpose(1, 2)) # handles batch (only need last two dims)
    scale = query.size(-1) ** 0.5 
    softmax = f.softmax(temp / scale, dim=-1)     return softmax.bmm(value)

Now that we have a decent idea of how scaled dot-product attention, we simply incorporate this dot-product attention scheme as shown in the diagram below to construct our multi-head attention layer.

From Attention is All You Need Paper (where h is number of heads)

Namely, we observe that the multi-head attention is composed of several identical attention heads, where each so-called attention head contains 3 linear layers, followed by the scaled dot-product attention we know. We can simply implement this using a class structure as follows:

import torch
from torch import Tensor
from torch import nn
import torch.nn.functional as fdef scaled_dot_product_attention(query, key, value):
    temp = query.bmm(key.transpose(1, 2)) 
    scale = query.size(-1) ** 0.5 
    softmax = f.softmax(temp / scale, dim=-1)    return softmax.bmm(value)class AttentionHead(nn.Module):
    def __init__(self, dim_in, dim_k, dim_v): 
        # dim_in is sequence length, dim_k is num_features        
        super().__init__()
        
        self.q = nn.Linear(dim_in, dim_k)
        self.k = nn.Linear(dim_in, dim_k)
        self.v = nn.Linear(dim_in, dim_v)     
        
        def forward(self, query, key, value):        
            return scaled_dot_product_attention(self.q(query),  
                                    self.k(key), self.v(value))class MultiHeadAttention(nn.Module):    
    def __init__(self, num_heads, dim_in, dim_k, dim_v):        
        super().__init__()        
        self.heads = nn.ModuleList(            
            [AttentionHead(dim_in, dim_k, dim_v) for _ in    
                   range(num_heads)])                self.linear = nn.Linear(num_heads * dim_v, dim_in)         def forward(self, query, key, value):        
        return self.linear(            
            torch.cat([h(query, key, value) for h in self.heads],  
                                   dim=-1))

Thus, to recap, we observe that each attention head in our multi-head attention scheme computes its own query, key, and value matrices, and then simply applies the scaled dot-product attention.

We can interpret this intuitively as each head can attend to a different part of the given input sequence, independent of the others. Thus, if we increase the number of attention heads, we are able to “pay attention” to more parts of the given input sequence at once, which makes our model even more robust.

Interestingly, it is important to note that our multi-head attention framework really has no trainable components that operates over the sequence in_dim. In fact, everything instead operates over the feature k-dim, and is thus independent of sequence length. As such, we must then provide positional information to our model, so as to ensure that our model knows about the relative position of our data points in the given input sequence.

The way to go about this is as follows:

We see that the usage of seemingly unusual sinusoidal encodings in turn allows for us to better extrapolate to longer sequence lengths. This is because the trigonometric position encodings are periodic, with a range of [0, 1], and thus behave nicely. We can observe this by supposing that, during model inference, we provide an input sequence longer than any used during training. By doing so, the positional encoding for the last elements in that given sequence might be different than anything the model as encountered during training. As such, the sinusoidal positional embeddings then allow for the learned model to extrapolate smoothly to sequence lengths longer than the ones seen before.

Now, we can move on to construct our Transformer model. We start by observing a diagram of the full scheme:

Upon first glance, we see that the transformer uses an encoder-decoder model architecture. The encoder (left) thus processes a given input sequence and returns a feature/latent vector. In turn, the decoder (right) then processes the target sequence, and incorporates information learned from the encoder memory. The output then from our decoder model is our model’s prediction.

We will first start by writing up our encoder layer before we move on to the decoder.

def feed_forward(dim_input = 512, dim_feedforward = 2048):    
    return nn.Sequential(        
        nn.Linear(dim_input, dim_feedforward),        
        nn.ReLU(),        
        nn.Linear(dim_feedforward, dim_input),    
    )class Residual(nn.Module):    
    def __init__(self, sublayer: nn.Module, dimension, dropout=0.1):        
    super().__init__()
        
    self.sublayer = sublayer        
    self.norm = nn.LayerNorm(dimension)        
    self.dropout = nn.Dropout(dropout)         def forward(self, *tensors):        
        # We assume that the "value" V matrix is given last, so we can compute the residual. => {src, src, src} from next code block
        return self.norm(tensors[-1] +   
                self.dropout(self.sublayer(*tensors)))

Above, we implemented a simple feed forward network and a residual block that we will utilize in our Transformer model (consider reading more on residual blocks from ResNet for more background).

Now to create our encoder, we simply incorporate these utility methods above as follows (following the diagram we introduce above):

class TransformerEncoderLayer(nn.Module):
    def __init__(
        self, 
        dim_model = 512, 
        num_heads = 6, 
        dim_feedforward = 2048, 
        dropout = 0.1, 
    ):
        super().__init__()        dim_k = dim_v = dim_model // num_heads
        self.attention = Residual(
            MultiHeadAttention(num_heads, dim_model, dim_k, dim_v),
            dimension=dim_model,
            dropout=dropout,
        )
        self.feed_forward = Residual(
            feed_forward(dim_model, dim_feedforward),
            dimension=dim_model,
            dropout=dropout,
        )

    def forward(self, src):
        src = self.attention(src, src, src) # Q, K, V
        return self.feed_forward(src)


class TransformerEncoder(nn.Module):
    def __init__(
        self, 
        num_layers= 6,
        dim_model= 512, 
        num_heads= 8, 
        dim_feedforward= 2048, 
        dropout = 0.1, 
    ):
        super().__init__()        self.layers = nn.ModuleList([
            TransformerEncoderLayer(dim_model, num_heads, dim_feedforward, dropout)
            for _ in range(num_layers)
        ])

    def forward(self, src):
        seq_len, dimension = src.size(1), src.size(2)
        src += position_encoding(seq_len, dimension)
        for layer in self.layers:
            src = layer(src)

        return src

The decoder class follows in a similar manner. It is important to note, however, that the decoder accepts two arguments (target and memory/from encoder). Furthermore, the scheme introduce by Vaswani uses two multi-head attention modules per layer, instead of one.

Observe our implementation below:

class TransformerDecoderLayer(nn.Module):
    def __init__(
        self, 
        dim_model = 512, 
        num_heads = 6, 
        dim_feedforward = 2048, 
        dropout = 0.1, 
    ):
        super().__init__()
        dim_k = dim_v = dim_model // num_heads
        self.attention_1 = Residual(
            MultiHeadAttention(num_heads, dim_model, dim_k, dim_v),
            dimension=dim_model,
            dropout=dropout,
        )
        self.attention_2 = Residual(
            MultiHeadAttention(num_heads, dim_model, dim_k, dim_v),
            dimension=dim_model,
            dropout=dropout,
        )
        self.feed_forward = Residual(
            feed_forward(dim_model, dim_feedforward),
            dimension=dim_model,
            dropout=dropout,
        )

    def forward(self, tg, memory):
        tgt = self.attention_1(tgt, tgt, tgt)
        tgt = self.attention_2(memory, memory, tgt)
        return self.feed_forward(tgt)


class TransformerDecoder(nn.Module):
    def __init__(
        self, 
        num_layers: int = 6,
        dim_model: int = 512, 
        num_heads: int = 8, 
        dim_feedforward: int = 2048, 
        dropout: float = 0.1, 
    ):
        super().__init__()
        self.layers = nn.ModuleList([
            TransformerDecoderLayer(dim_model, num_heads, dim_feedforward, dropout)
            for _ in range(num_layers)
        ])
        self.linear = nn.Linear(dim_model, dim_model)

    def forward(self, tgt, memory):
        seq_len, dimension = tgt.size(1), tgt.size(2)
        tgt += position_encoding(seq_len, dimension)
        for layer in self.layers:
            tgt = layer(tgt, memory)

        return torch.softmax(self.linear(tgt), dim=-1)

Finally, we combine everything into a single Transformer class as follows:

class Transformer(nn.Module):
    def __init__(
        self, 
        num_encoder_layers= 6,
        num_decoder_layers = 6,
        dim_model= 512, 
        num_heads = 6, 
        dim_feedforward = 2048, 
        dropout= 0.1, 
        activation = nn.ReLU(),
    ):
        super().__init__()        self.encoder = TransformerEncoder(
            num_layers=num_encoder_layers,
            dim_model=dim_model,
            num_heads=num_heads,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
        )        self.decoder = TransformerDecoder(
            num_layers=num_decoder_layers,
            dim_model=dim_model,
            num_heads=num_heads,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
        )

    def forward(self, src, tgt):
        return self.decoder(tgt, self.encoder(src))

To conclude with, we have demonstrated a simply, intuitive explanation that sheds light on a powerful framework of neural networks known as Transformers. Particularly in NLP, transformers do not rely on past hidden states to capture dependencies with previous words, as they are able to process a sentence as a whole, with no risk of loosing (or ‘forgetting’) past information, as is the case with many RNN models. Moreover, since we incorporate multi-head attention scheme and positional embeddings, we are able to provide information about the intrinsic relationships between different words that aren’t easily captured in standard recurrent or markov-based models.

Transformers in NLP

Providing an intuition behind attention-based models

Written by Pranik Chainani

No responses yet