But how exactly do transformers work?

11 min readJan 20, 2022

Attention is all you need!

“India has a very good relation with Russia. It has bought a lot of latest technologies with them. Japan also has a good relationship, but no country replaced the former one.”

The above lines can cause a lot of confusion to any machine learning model when it comes to following words: “IT”, “THEM”, “FORMER”. There are three countries listed in the paragraph, and the models get confused about which these words are talking about.

But are we confused while reading it? If we read carefully, and attentively, it becomes very clear about which countries the paragraph is talking about. But if we read just for the sake of reading, we may get confused as well. The key to understanding here is the word that I have kept in bold above: ATTENTION.

It is the self-attention layer that will understand the association of these words with the correct pair in the sentence. The self-attention layer looks at the encoding of the word and compares it with the encodings of other words. Then it finds out whether the encoding can be improved so that the context of the word and the entire sentence can be understood in the better way. This is the theme of self-attention.

Why Transformer?

Sequence to Sequence with RNNs is great, with attention it’s even better. Then what’s so great about Transformers?

The main issue with RNNs lies in their inability of providing parallelisation while processing. The processing of RNN is sequential, i.e. we cannot compute the value of the next timestep unless we have the output of the current. This makes RNN-based approaches slow.

RNN:

Advantages:

are popular and successful for variable-length representations such as sequences (e.g. languages), images, etc. RNN are considered core of seq2seq (with attention).
The gating models such as LSTM or GRU are for long-range error propagation.

Problems:

The sequentiality prohibits parallelisation within instances.
Long-range dependencies still tricky, despite gating.
Sequence-aligned states in RNN are wasteful. Hard to model hierarchical-alike domains such as languages.

Attention? Self-Attention!

This is Transformer architecture. Encircled is encoder block has two sublayers. One is Multi-headed attention, other is feed forward network.

This is what happens inside multi-headed attention block

Multi headed Attention. See each attention is trying to attend to a different property. Here h=3.

This first step in this method is to derive three different vectors from the embeddings of each word. These vectors are termed as:

Query Vector
Key Vector
Value Vector

To create these vectors first we get three different matrices learnt from the model and then we multiply the actual word embeddings with each of these three matrices to get the above list of vectors. These new vectors will be having a lesser dimension as compared to the original word embeddings (Generally word embeddings have a dimension of 512. If that is the case then these new vectors may have a dimension of 64 each, as written in the research paper). The figure given below shows us that there are two words — ‘Thinking’ and ‘Machines’. We got the three vectors by multiplying the word vector X_1 and X_2 with the Wǫ,Wᴋ,Wᴠ with the weights vector.

This is how Query ,Key and Value are generated. Notice input is concat of word embeddings and positional encoding.

If we look at a sentence, each word will now have four vectors — Word Embeddings Vector, Query Vector, Key Vector, Value Vector.

Using this we will be scoring each word to the word of question. Suppose, we are trying to find the context of ‘IT’. All the words present in the sentence will be compared with this word and then a score will be calculated.

This score will help us decide how much importance to different words must be given when we are trying to understand the context of a single word. Now, let us look at how to calculate this score.

Score = Query Vector * Key Vector = Q*K

To understand it more clearly, suppose we have three words having three different vectors each.

If we want to find the context of the word Coding, then it will be the word of significance for us. Therefore the scores will be calculated like below:

As we are going to work on a neural network architecture, this will mean that we will have to find gradients so that the loss can be minimized and we can reach an optimal local minima. Therefore, instead of directly using the above scores, if we can normalize them then we can get stable gradients. This can be done by dividing the above scores by the square root of the vector dimension. As we know that the vector dimension is 64, hence the normalization number will be 8.

To make these values as positive and maximum going up to 1, which means to give it a probabilistic range of (0,1), we can pass the output to softmax (sigmoid function).

Now, till here we only used the Q and K vectors. Now it’s time to use the K vector. We will be multiplying the value vector with the SoftMax output for each word. This will keep the word that are relevant in the sentence as it is while all the irrelevant words, their relevancy will be reduced. Therefore,

Finally, we are going to add up all the weighted values that we got above, for each word, and finally we’ll get the self-attention layer outputs. This entire process is explained in the below figure.

Multi-headed attention

This process of self-attention has been further improved by introducing multi-headed attention concept. In the self-attention, even though the vectors take into consideration some parts of all the words, but still the process can still get biased to the main word. To sort out this problem, instead of having only one Q/K/V matrices, multi-headed attention has multiple matrices. In the research paper the transformers are said to have 8 heads, therefore we have 8 sets of Q/K/V matrices.

All we have to do is to repeat the process that we did above, 8 different times. Therefore, we get different Z matrices as given below.

Attention filter (Just for representation) Here its only 3 , but in our case it would be 8 different Z matrices

If we summarise the process till now,

We got a sentence.
We converted into word embeddings.
We operated attention process to get Z matrix.
We repeated the process 7 more times to finally get 8 different Z matrices.

Now that we are sure that the context of a sentence will be understood by the model, we will send the output, the Z matrices, to the next layer — Feed Forward Layer. But, this layer accepts only a single matrix like word embeddings, and not the 8 different matrices. It will not be able to process it. So what to do? Obviously, we have to do something so that we can condense the 8 matrices to one single matrix. But how?

For this, again the encoder creates a new matrix W, whose value is learnt during the training and then use the final matrix to condense the Z matrices. We first concatenate all the Z matrices and then multiply them with this current W matrix. We get one single matrix as an output now, which can be given to the feed forward network. Process is shown below:

Now, this output will be given to Encoder-1, and then the process will be repeated.

Till now we have understood the relevancy of the words, but how to understand the positions of these words? Simple logic that is followed here is to create a new vector, exactly having the same size as the word vector, and then add this to the original word embeddings. What will be the values inside these vectors? They will be learn from the model. What is this vector called? Positional Encoding Vector.

Positional Encoding Vector

Given below is how this vector may look:

The formula used to compute these encodings is given below:

pos defines the position, while d_model is the word embedding dimension.

Here is an awesome recent Youtube video that covers position embeddings in great depth, with beautiful animations:

Visual Guide to Transformer Neural Networks — (Part 1) Position Embeddings

Taking excerpts from the video, let us try understanding the “sin” part of the formula to compute the position embeddings:

Here “pos” refers to the position of the “word” in the sequence. P0 refers to the position embedding of the first word; “d” means the size of the word/token embedding. In this example d=5. Finally, “i” refers to each of the 5 individual dimensions of the embedding (i.e. 0, 1,2,3,4)

While “d” is fixed, “pos” and “i” vary. Let us try understanding the later two.

“pos”

If we plot a sin curve and vary “pos” (on the x-axis), you will land up with different position values on the y-axis. Therefore, words with different positions will have different position embeddings values.

There is a problem though. Since “sin” curve repeat in intervals, you can see in the figure above that P0 and P6 have the same position embedding values, despite being at two very different positions. This is where the ‘i’ part in the equation comes into play.

“i”

If you vary “i” in the equation above, you will get a bunch of curves with varying frequencies. Reading off the position embedding values against different frequencies, lands up giving different values at different embedding dimensions for P0 and P6.

This entire process that we have seen inside one single encoder can be summarized with the architecture given below,

Just after the Self-attention layer and Feed Forward layer, you’ll find Add & Normalize Layer. This is a residual connection in this architecture. As the name suggests, whatever is the input given is stacked together with the output of each layer. In the above figure z1 is found after concatenating the outputs of 8 heads, if it was causing any kind of confusion. Same goes with z2.

DECODERS

Now that we have seen the operations of an encoder, and we have looked at its architecture as well, it’s time to move forward and understand what exactly a decoder does, and how exactly it does so. If you’ll look at the architecture of a decoder, it will somewhat look similar to that of an encoder. Given below is the architecture which I will explain in sometime.

When I discussed about encoders we saw that each of it gives a concatenated Z-vector as an output which is then multiplied with a trained weight matrix. We also know that the Z-vector is derived from the Query, Key, and Value vectors, in the self attention layer. I am reminding you these steps as in the next step the output of the last Encoder is then again converted to a set of Key and Value Vectors and then they are sent to the layer of Encoder-Decoder attention of every decoders. Here, the decoder tries to focus on all the important positions present in the input.

One more catch in the decoder layer is that the self-attention layer of the decoder is not allowed to see the future positions of any word. It is allowed to only look at the output sequences of the previous predictions. Self-attention layer is not able to see the future by masking all the future positions. Once all the decoders give their output, the last one is given to the linear layer. This helps us to get the word instead of a vector. The last output of a decoder is a vector. That is given to a Linear layer, connected by a Softmax layer, to find the word that represents the vector. How exactly it’s done?

Once the decoder gives its output, it is sent to the linear layer. A Linear layer is nothing but a fully connected neural network which is used to convert the input that it has received from the previous layer and increase its dimensions. Suppose we are converting a hindi sentence to an english sentence. To make that possible, we make our model learn at-least some words in english. We made our model understand around 10k words in english. Now, the Linear layer that we’ll make will have around 10000 units. Therefore, when we pass the input hindi sequence, each unit gives us the score of a unique word compared to the input sequence. The output will be having a dimension of 10000 vectors.

Finally, a softmax is used that converts the scores into probabilities. The unit which contains the highest probability is chosen and the word representing that unit is then called as the predicted word. Let us summarise this entire process using step-by-step procedures.

The output of the last encoder is received
The output is converted into two vectors — Key and Query Vector
This output is sent to the Encoder-Decoder Layer of all the decoder.
This helps us get the very first word by sending the last decoder output the the linear layer, followed by Softmax function.
In the next round, the first output from the decoders is given as the input to the self attention layer of the first decoder. All the remaining positions are masked.
The process is now continued to produced the second word.
Again the entire process is continued to get all the words.

Given below is a small gif that explains this entire process. (Source: J Alammar).