Discussion 1
Critique
1) "Vision Transformers are a recent breakthrough in the field of computer vision, bringing the success of the transformer architecture in natural language processing to image and video analysis. These models use self-attention mechanisms to dynamically weigh the importance of different regions in an input, allowing them to process and understand visual data in a manner that is similar to how humans perceive the world." Do you have any citation for this? How is the concept of multi-head attention similar to how humans perceive the world? One might argue that CNNs process spatial information in a manner that is more human-like (changes in receptive fields to understand contexts) than vision transformers.
2) "We add an extra learnable “classification token” to the patch embedding at the start of the sequence." It would be great if you can explain why.
3) "The ViT is a visual model based on the architecture of a transformer originally designed for text-based tasks. The ViT model represents an input image as a series of image patches, like the series of word embeddings used when using transformers to text, and directly predicts class labels for the image." Odd placement for these sentences, I think they should be in the introductory paragraph, before the sentence: "To handle 2D images, the image is reshaped ..."
4) "Vision Transformer (ViT) achieves remarkable results compared to convolutional neural networks (CNN) while obtaining fewer computational resources for pre-training." AND "ViT exhibits an extraordinary performance when trained on enough data, breaking the performance of a similar state-of-art CNN with 4x fewer computational resources." convey the same message, so you can remove either of these.
5) "These transformers have high success rates when it comes to NLP models and are now also applied to images for image recognition tasks." This sentence is redundant considering we've been talking about ViTs for a while now.
6) "CNN uses pixel arrays, whereas ViT splits the images into visual tokens." Kind of comes out of nowhere, why are we learning about the differences AFTER we've established that ViTs are better? Doesn't relate to the general flow of discussion.
7) "The visual transformer divides an image into fixed-size patches, correctly embeds each of them, and includes positional embedding as an input to the transformer encoder. Moreover, ViT models outperform CNNs by almost four times when it comes to computational efficiency and accuracy. " Redundant, please remove them.
8) "The self-attention layer in ViT makes it possible to embed information globally across the overall image. " How? Also, this should be in the paragraph on ViT architectures.
9) "The model also learns on training data to encode the relative location of the image patches to reconstruct the structure of the image." This should either be where you discuss positional information OR removed completely.
10) How does "Multi-Head Self Attention Layer" have the acronym "MSP"?
11) "Layer Norm (LN): This is added prior to each block as it does not include any new dependencies between the training images. This thereby helps improve the training time and overall performance." This doesn't explain what LN does, could you please add that?
12) "The higher layers of ViT learn the global features, whereas the lower layers learn both global and local features." Do you have any references for this? And why does this allow ViTs to learn more generic patterns?
Minor revisions
1) Grammar:
-> The simplicity of the architecture, as well as the ability to fine-tune *it on large datasets...
-> "The simplicity of the architecture, as well as the ability to fine-tune on large datasets, has made Vision Transformers a popular choice for a wide range of computer vision applications, and the field is rapidly evolving, with ongoing research exploring new architectures, training strategies, and applications. We will cover working and architectures of these below." Too many commas. Second sentence needs serious revision. Overall, this entire block can be re-written.
-> Initial work on ViT is *built on this paper...
-> Transformers are *an advanced neural *network *architectures *that *were first used in the field of natural language processing *(which was introduced in Attention is All you Need paper *).
-> The Transformer uses *a constant latent vector size D through all of its layers, so the patches are flattened and mapped to D dimensions with a trainable linear projection.
-> The quadratic cost of the self-attention will be very costly and not scale to *a realistic input size; hence, the image is divided into patches.
-> *This *is *done *because transformers are agnostic to the structure of the input elements, *hence, adding the learnable position embeddings to each patch will allow the model to learn about the structure of the image.
-> Vision Transformer (ViT) achieves remarkable results compared to convolutional neural networks (CNN) while *using fewer computational resources for pre-training.
-> In comparison to convolutional neural networks (*CNNs), Vision *Transformers (*ViTs) show a generally weaker inductive bias *increases reliance on model regularization or data augmentation (AugReg) when training on smaller datasets.
-> "This layer contains a two-layer with Gaussian Error Linear Unit (GELU)." It contains a two-layer what?
-> In CNNs, *locally???, two-dimensional neighborhood structure *, (no comma) and translation equivariance are baked into each layer throughout the whole model.
-> # this breaks down the image in s1xs2 patches, and then *flatten them
2) Formatting:
-> "Prior to ViTs, vision based tasks such as image classification, [segmentation] and object recognition were done using convolutional neural networks." Why are there square brackets around "segmentation"?