Talk:Transformers

From UBC Wiki

Contents

Thread titleRepliesLast modified
Replies019:35, 21 March 2023
Critique020:22, 19 March 2023
Feedback017:17, 17 March 2023

Thank you Nikhil and Matthew for your feedback. To the best of my knowledge, I have incorporated your feedback and your suggestions helped me improve this page.

MEHARBHATIA (talk)19:35, 21 March 2023

I am happy to see this page on transformers. It is a fundamental topic and a must-read for everyone in this course. The author has provided a clear walk-through and also talked about BERT/GPT variants. Here are a few suggestions/clarifications.

  • Could you extend a little more on positional encodings (though I understand that it can be added as a separate foundational page, maybe you can direct to some links for clarity).
  • I agree with Mathew and I think you should stick to the notation from the paper while explaining self-attention.
  • Please add links to pages such as RNN/GRU/CNN/LSTMs in “Builds On” section.
  • Figure 3 misses link to source.
  • I think it would be nice to also add links to terms such as ReLU, Adam optimizer, cross entropy etc.v
NIKHILSHENOY (talk)20:22, 19 March 2023

Hi Mehar,

Transformers are definitely a foundational method to many ML systems these days so this is certainly a valuable topic for the wiki. I have a couple items of feedback which I think could be used to improve the article

Since transformers are all about attention, I think you should be more clear about exactly how attention works. I think the first sentence of the self attention section is good but it should be in quotes, as it is a direct quotation from the introduction of Attention is all you Need. Moreover, the citation should be changed to [2] to reflect that. I think another quote from the original transformer paper does an excellent job of introducing attention:

"An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key." [2]

I think that would allow you to segue nicely into what attention weights are (the weights for the weighted sum of values).

Another issue I have is with the way you present the math for how self-attention works. I think the original math from the paper is pretty clear and you would be better off using their notation. 1) The query, key and value "vectors" should be a matrix (ie Q = XW) instead of (q = W * X) because you get one query vector per sequence element. Maybe add a sentence afterword saying that q_i is the ith feature vector from Q 2) I think the per-element attention math you present doesn't make a ton of sense. My main complaint is the softmax, which needs to operate over each row of the QK^T matrix. softmax of q_i \dot k_i is unclear in my opinion.

Similarly, you mention positional encodings being a function of cos and sine over varying frequencies. I would just add the math from the original paper again for clarity.

Some smaller nits:

  • You call transformers a "novel" architecture. Seeing as they came out 6 years ago, I don't think they're particularly novel anymore
  • In your "Builds On" section, it would be great to link to the RNN, GRU and LSTM pages.
  • You claim that transformers are more efficient without a source
  • When you discuss the Q,K,V vectors you say that the input is "split". Perhaps use projected?
  • there is a typo in the "calculating self attention" section: "each step to calculate the self-attention for *iith* {\displaystyle i^{th}} input word is given below.
  • You are missing your "Encoder Wrap Up" section.
MatthewNiedoba (talk)17:17, 17 March 2023