Course talk:CPSC522/Scaling Memory for Transformers

[View source↑]
[History↑]

Feedback

What are the two papers? This should be upfront. The aim then is to describe the advance of one over the other. This reads more like a general survey than what was asked for.

Each Figure need an explicit reference on the page. You cannot take a figure from somewhere and use it as you own. (Fair use is okay if you refer to the content in your text, but you need to give explicit reference so no one will think it is your own work.)

Please be explicit about whether "quadratic complexity" is to do with time or space. The way it reads, it seems to imply transformers are quadratic in space. (Where does quadratic space come from?)

What is " factual knowledge"? Where does it come from? Why is there a figure for EMAT, when it doesn't appear in the text?

DavidPoole (talk)‎

Thank you for your feedback. I've made several improvements based on your suggestions:

- I removed sections like EMAT to streamline the article. - I retained the introduction to Transformer-XL, aligning it more closely with RMT and explaining its relevance. - I emphasized RWKV's contributions over RMT in various sections, including a dedicated comparison section, the abstract, and the introduction. - I've added citations for the figures. - I've updated the term 'quadratic complexity' to 'quadratic time complexity' to address the specific issue you pointed out.

AmirhosseinAbaskohi (talk)‎

October assignment feedback

In "Training and Backpropagation" section you mentioned "gradient checkpointing". I think you can link another page that explains that or cite a paper to help the reader get acquainted with this technique.

You mentioned "potential issues of W". It would be great if you mentioned some of them.

In "Reconciling Efficiency and Performance" section, I feel like I am reading a summary of the page. You also have some sentences from the contents before. Consequently, you might be able to move it to the conclusion section or create some subsections in your conclusion.

A case study or real-life example of the methods you explain can make the article feel more colorful and entertaining. On the other hand, it can help the reader deeply grasp those concepts.

FARDADHOMAFAR (talk)‎

Thank you for your valuable comments. I have incorporated your feedback into the article in the following ways:

   1. Added a link for gradient checkpointing.
   2. Consolidated two sections.
   3. Enhanced the first paragraph of the introduction by including an illustrative example.

AmirhosseinAbaskohi (talk)‎

October assignment feedback 2

Grammar[wikitext]

Existing text	Suggestion
that combines Transformer and	that combines the Transformer and
For example, Compressive Transformer[10] adds	For example, a compressive transformer[10] adds
The starting group of memory tokens acts as a read memory	The starting group of memory tokens act as a read memory" (is this 'a read memory' or just 'read memory'?)
The choice of how many previous segments to backpropagate is a hyperparameter, with BPTT unroll varying from 0 to 4 previous segments.	Sentence is unclear
It can handle sequences over 1 million tokens on a single GPU	A bit awkward

Style[wikitext]

Math like 'SG' don't use math font
◦ is really small and hard to read, consider using $\circ$ in-line as well.
I suggest getting rid of the Content header and upgrading your others. I think it would make the sections feel distinct and more bite-sized.
There are inconsistencies like with the channel-mixing sub-block formulas $R$ and $r_{t}$ vs $o_{t}$ and $o_{t}$ .

Content[wikitext]

Can "ameliorates" be defined or a more common word be used?
Can things like "internal attention" be defined?
The Recurrent Memory Transformer header seems like it should be on the same heading level as transformer-XL based on the way it's written.
The caption on the self-attention diagram doesn't feel like it explains what is in the image.
Simiarly, the EMAT diagram could indicate what pieces of the diagram are doing what (left side versus right side)
The outputs for segment $\tau$ are somewhat hard to parse as side-by-side equations. Putting them vertically might help readability.
The paragraph starting with "The RWKV architecture, named after its four fundamental elements" feels like an intro and could benefit from being sooner in the section.
For the sections with $WKV$ , would $KWV$ be used instead to be consistent with RWKV?
The WKV computation explanation is clear. I think it's a good example of breaking down presented information
Is transformer a proper noun? Sometimes it is capitalized and sometimes not. I would assume not.

General[wikitext]

Some terms are left undefined and without a hyperlink, like "differentiable external memory". Image captions should focus more on the image and not the concept the image is presenting. Shorter captions on images would be nicer and putting the longer parts in the body and referencing the image like "Figure 1". Some styling is inconsistent like math sometimes uses quotes. Explaining an equation in plain language can really improve readability.

I enjoyed reading this.

ClairRoss (talk)‎

Thank you for your feedback. I've updated the page and implemented the changes based on your suggestions, particularly focusing on resolving grammatical and stylistic issues. Regarding the inconsistencies in the channel-mixing sub-block formulas, I've kept the original naming as it appears in the paper to maintain accuracy.

In response to Dr. Poole's feedback and to enhance the article's clarity, I've removed the section related to EMAT. As for using WKV, it involves matrix multiplication, where altering the order of vectors impacts the output, making it unfeasible to change. Apart from these points, I've revised the remaining sections as per your comments.

Regarding your suggestion to use shorter captions for figures, I'd like to note that while it's common practice in research papers, it's less common on Wiki pages, especially given the limitations of not being able to link from the text to the images.

AmirhosseinAbaskohi (talk)‎

Thread title	Replies	Last modified
Feedback	1	07:21, 20 October 2023
October assignment feedback	1	07:16, 20 October 2023
October assignment feedback 2	1	07:12, 20 October 2023

Course talk:CPSC522/Scaling Memory for Transformers

Contents

Feedback

October assignment feedback

October assignment feedback 2

Contents

Grammar[wikitext]

Style[wikitext]

Content[wikitext]

General[wikitext]