Critique

I am happy to see this page on transformers. It is a fundamental topic and a must-read for everyone in this course. The author has provided a clear walk-through and also talked about BERT/GPT variants. Here are a few suggestions/clarifications.

Could you extend a little more on positional encodings (though I understand that it can be added as a separate foundational page, maybe you can direct to some links for clarity).
I agree with Mathew and I think you should stick to the notation from the paper while explaining self-attention.
Please add links to pages such as RNN/GRU/CNN/LSTMs in “Builds On” section.
Figure 3 misses link to source.
I think it would be nice to also add links to terms such as ReLU, Adam optimizer, cross entropy etc.v

NIKHILSHENOY (talk)‎