But what is Attention Mechanism
Attention Mechanism in my understanding — from the simplest to more advance — explained in Plain and Broken English
"Attention operates on key, query, value" — John Hewitt, Stanford
An attention function attention(q, k, v)
takes 3 parameters, query, key, value
Simple Attention — a regular python dictionary
You can think of an Attention Mechanism as how to get a value from Python dictionary.
d = {'brand': 'Porche', 'model': 'Taycan', 'color': 'black'}# python dictionary# {key: value}# d[query]
Now, we have a dictionary here, suppose that I want to get some information from it.
- I can use
d[query]
. — for exampled[color] → black
.
"query" = what I am trying to find.
"key" = what sort of information in the dictionary. "value" = that information. - But in this case
query
is equal tokey
.
The dictionary can findskey
that matches myquery
and returns thevalue
.
What if …
What if there is a dictionary that can find key
even though it is not exactly the same as query
.
d = {'brand': 'Porche', 'model': 'Taycan', 'color': 'black'}
Now, I can use d[colour]
and it will return d[colour] → black
hence the most similar key
to my query
is the word "color"
Note: You can try implement this 'fuzzy' dictionary by using word vectors, and cosine similarity. — keyword: NLTK, gensim, word vector/word embedding, cosine similarity
More advance Attention — Word vector, Matrix Multiplication, Linear Transformation
In the explanation above, I used an analogy between dictionary and simple attention.
It's time to learn the next level of the attention mechanism. I will try my best to explain the attention that used in a Transformer model.Some slightly differences from the analogy above: the attention mechanism can weight which "values"(all words in a sentence/all k:v
pairs in a dictionary) it should pay attention to and there are more than one type of attention:
self-attention, cross-attention, masked-self-attention.
for example: If I query for a word "performance", the dictionary above may be pay attention to battery, top speed, horse power — and not to colour. (think as it can summarise/compress information from many values into one value)
NLP tasks in ML, each word is represented by a dense vector called 'word > vector' — aka 'word embedding'.import nltkimport gensimfrom nltk.corpus import brown# train a word2vec modeltrain_set = brown.sents()model = gensim.models.Word2Vec(train_set)model.wv['color'][:5]# word vector shape 5 > [-0.083, 0.275, 0.309, 0.142, -0.026]
I need my attention function can learn from data. How can I do it? How can I transform a word vector to query, key, value?
Remember that, Attention is a function operates on q, k, v. I can use matrices and hope that my model can learn the weights of those matrices.
Let [x_i \in R^d](tex://x_i \in R^d) is a word vector at time i with dimension d and
there are matrices [\mathbf{Q} \in R^{d \times d}](tex://\mathbf{Q} \in R^{d \times d}), [\mathbf{K} \in R^{d \times d}](tex://\mathbf{K} \in R^{d \times d}), [\mathbf{V} \in R^{d \times d}](tex://\mathbf{V} \in R^{d \times d}) inside my attention layer.
tf.keras.layers.MultiHeadAttention
Each matrix is used to perform ‘Linear Transformation’ — matrix-vector multiplication — ie to transform
x
tok, q, v
- [q_i = \mathbf{Q}x_i](tex://q_i = \mathbf{Q}x_i)
- [k_i = \mathbf{K}x_i](tex://k_i = \mathbf{K}x_i)
- [v_i = \mathbf{V}x_i](tex://v_i = \mathbf{V}x_i)
(Matrix-Vector multiplication 3B1B)
The reason to have 3 different matrices is query, key, value do the different things.
- Calculate a dot product of
q
andk
as a weight to pay attention onv
The “self-attention” mechanism inputs of attention(q, k, v)
are only x
— aka attention(q=x, k=x, v=x)
But for “cross-attention” inputs is attention(q=x, k=context, v=context)
If you read to here, you might have some idea pop into your head of how attention mechanism works by connect these things together: matrix-vector multiplication, weighted sum, word vectors, etc. (You can see the lecture from Stanford for deeper explanation)