Seedling

But what is Attention Mechanism

Attention Mechanism in my understanding — from the simplest to more advance — explained in Plain and Broken English

Planted: 2022-12-24

Last Tended: 2022-12-24

"Attention operates on key, query, value" — John Hewitt, Stanford

An attention function attention(q, k, v) takes 3 parameters, query, key, value

Simple Attention — a regular python dictionary

You can think of an Attention Mechanism as how to get a value from Python dictionary.

d = {'brand': 'Porche', 'model': 'Taycan', 'color': 'black'}

# python dictionary
# {key: value}
# d[query]

Now, we have a dictionary here, suppose that I want to get some information from it.

I can use d[query]. — for example d[color] → black.
"query" = what I am trying to find.
"key" = what sort of information in the dictionary. "value" = that information.
But in this case query is equal to key.
The dictionary can finds key that matches my query and returns the value.

What if …

What if there is a dictionary that can find key even though it is not exactly the same as query.

d = {'brand': 'Porche', 'model': 'Taycan', 'color': 'black'}

Now, I can use d[colour] and it will return d[colour] → black
hence the most similar key to my query is the word "color"

Note: You can try implement this 'fuzzy' dictionary by using word vectors, and cosine similarity. — keyword: NLTK, gensim, word vector/word embedding, cosine similarity

More advance Attention — Word vector, Matrix Multiplication, Linear Transformation

In the explanation above, I used an analogy between dictionary and simple attention.

It's time to learn the next level of the attention mechanism. I will try my best to explain the attention that used in a Transformer model.

Some slightly differences from the analogy above: the attention mechanism can weight which "values"(all words in a sentence/all k:v pairs in a dictionary) it should pay attention to and there are more than one type of attention: self-attention, cross-attention, masked-self-attention.

for example: If I query for a word "performance", the dictionary above may be pay attention to battery, top speed, horse power — and not to colour. (think as it can summarise/compress information from many values into one value)

NLP tasks in ML, each word is represented by a dense vector called 'word > vector' — aka 'word embedding'.

import nltk
import gensim
from nltk.corpus import brown

# train a word2vec model
train_set = brown.sents()
model = gensim.models.Word2Vec(train_set)

model.wv['color'][:5]
# word vector shape 5 > [-0.083,  0.275,  0.309,  0.142, -0.026]

I need my attention function can learn from data. How can I do it? How can I transform a word vector to query, key, value?

Remember that, Attention is a function operates on q, k, v. I can use matrices and hope that my model can learn the weights of those matrices.

Let [x_i \in R^d](tex://x_i \in R^d) is a word vector at time i with dimension d and
there are matrices [\mathbf{Q} \in R^{d \times d}](tex://\mathbf{Q} \in R^{d \times d}), [\mathbf{K} \in R^{d \times d}](tex://\mathbf{K} \in R^{d \times d}), [\mathbf{V} \in R^{d \times d}](tex://\mathbf{V} \in R^{d \times d}) inside my attention layer.

tf.keras.layers.MultiHeadAttention
Each matrix is used to perform ‘Linear Transformation’ — matrix-vector multiplication — ie to transform x to k, q, v
- [q_i = \mathbf{Q}x_i](tex://q_i = \mathbf{Q}x_i)
- [k_i = \mathbf{K}x_i](tex://k_i = \mathbf{K}x_i)
- [v_i = \mathbf{V}x_i](tex://v_i = \mathbf{V}x_i)

(Matrix-Vector multiplication 3B1B)

The reason to have 3 different matrices is query, key, value do the different things.

Calculate a dot product of q and k as a weight to pay attention on v

The “self-attention” mechanism inputs of attention(q, k, v) are only x — aka attention(q=x, k=x, v=x) But for “cross-attention” inputs is attention(q=x, k=context, v=context)

If you read to here, you might have some idea pop into your head of how attention mechanism works by connect these things together: matrix-vector multiplication, weighted sum, word vectors, etc. (You can see the lecture from Stanford for deeper explanation)