Feedback
Hi Mehar,
Transformers are definitely a foundational method to many ML systems these days so this is certainly a valuable topic for the wiki. I have a couple items of feedback which I think could be used to improve the article
Since transformers are all about attention, I think you should be more clear about exactly how attention works. I think the first sentence of the self attention section is good but it should be in quotes, as it is a direct quotation from the introduction of Attention is all you Need. Moreover, the citation should be changed to [2] to reflect that. I think another quote from the original transformer paper does an excellent job of introducing attention:
"An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key." [2]
I think that would allow you to segue nicely into what attention weights are (the weights for the weighted sum of values).
Another issue I have is with the way you present the math for how self-attention works. I think the original math from the paper is pretty clear and you would be better off using their notation. 1) The query, key and value "vectors" should be a matrix (ie Q = XW) instead of (q = W * X) because you get one query vector per sequence element. Maybe add a sentence afterword saying that q_i is the ith feature vector from Q 2) I think the per-element attention math you present doesn't make a ton of sense. My main complaint is the softmax, which needs to operate over each row of the QK^T matrix. softmax of q_i \dot k_i is unclear in my opinion.
Similarly, you mention positional encodings being a function of cos and sine over varying frequencies. I would just add the math from the original paper again for clarity.
Some smaller nits:
- You call transformers a "novel" architecture. Seeing as they came out 6 years ago, I don't think they're particularly novel anymore
- In your "Builds On" section, it would be great to link to the RNN, GRU and LSTM pages.
- You claim that transformers are more efficient without a source
- When you discuss the Q,K,V vectors you say that the input is "split". Perhaps use projected?
- there is a typo in the "calculating self attention" section: "each step to calculate the self-attention for *iith* {\displaystyle i^{th}} input word is given below.
- You are missing your "Encoder Wrap Up" section.