Course talk:CPSC522/IsAttentionExplanation

From UBC Wiki

Contents

Thread titleRepliesLast modified
Crititque2105:58, 14 February 2023
Critique105:06, 14 February 2023

Crititque2

The page gives good insights about attention weights in NLP. I have following points -

- In table 3 paper 2, does the results generalize to other datasets? How does the data distribution change in the datasets already tested? - Is this line a fact - Attention weights should correlate with other feature importance measures - Is scaled dot-product attention also function of m? - isn't the bAbI dataset small, as it just contains 20 tasks? - For paper 1, what is the authors' view on how a model should highlight multiple plausible explanations? -

ANUBHAVGARG (talk)02:51, 14 February 2023

Hi Anubhav!

Thank you for your critique, please find my responses below:

1) In table 3 paper 2, does the results generalize to other datasets?
Response: I am not sure if they do. The authors exclude the other 2 datasets from all analyses because they find that the performance of the model on these datasets does not significantly improve after the addition of the attention layer. More information on this can be found on my page: Paper 2 -> Answering RQ1: In what cases is attention not necessary?

2) How does the data distribution change in the datasets already tested?
Response: I am not sure of what you mean by this. Are you asking how different the datasets are? If so, the authors do not provide information about this (possibly because they are well-known datasets). However, Table 1 in Paper 1 lists out the number of data items, average sequence length, and train/test size for these datasets (paper 2 uses the same setup so the information should be consistent).

3) Is this line a fact: "Attention weights should correlate with other feature importance measures".
Response: I believe this is a bit intuitive considering why attention is used in the first place (so that the relevant features in the input are given more importance in the decision making). Additionally, considering the fact that existing explanation methods tend to use attention as a way to explain which features were considered important, the underlying notion is that attention can used for understanding feature importance.

4) Is scaled dot-product attention also function of m?
Response: Yes, that's why it is in the denominator under a square root.

5) Isn't the bAbI dataset small, as it just contains 20 tasks?
Response: Yes, it is small.

6) For paper 1, what is the authors' view on how a model should highlight multiple plausible explanations?
I hypothesize that the authors would consider providing all possible plausible explanations for a given output to be a requisite for the sake of "completeness".

HarshineeSriram (talk)05:58, 14 February 2023
 

Overall Thoughts[wikitext]

  • Outstanding work. I only have minor suggestions but overall I thought this was very clear and well done

Abstract[wikitext]

  • Nit: I’m not sure if the ML or DL links in the Builds On section are all that helpful. Obviously attention is a ML/DL method but are you really reading this page if you don’t know about machine learning? I would suggest dropping the machine learning link at least.

Introduction[wikitext]

  • I find it confusing why you refrained from using the full citation in the introduction, but plan to use it throughout the rest of the wiki page. I would pick one convention and use it throught instead of adding the Note at the end of the intro

Preliminaries[wikitext]

  • I would move the blurb about v, W_1 and W_2 being parameters of the model to directly after they are introduce and before the Scaled dot product attention.

Paper 1[wikitext]

  • Seeing as the Kendall rank correlation is the basis of the analysis of RQ1, it would be nice to have a description of it, or a citation I could follow to understand how this method works.
  • I think the Datasets and Tasks section needs a trim. Although very thorough, reading through the rest of the experiments, I didn’t think that an understanding of each dataset was necessary to understand the author’s conclusions about attention weights.
  • When discussing Figure 2 you reference the MIMIC dataset, but I can’t see MIMIC labeled on that figure.

=Paper 2[wikitext]

  • No comments. Great section. I thought each research question was covered well and the figures and tables were introduced clearly
MatthewNiedoba (talk)21:45, 13 February 2023

Hello Matthew!

Thank you so much for your excellent and comprehensive critique. Based on your suggestions, here are the changes that I made:

1) Removed the ML links from the "Builds On" section.
2) Edited the introduction to include the full citation.
3) Moved the text about v, W_1 and W_2 being parameters.
4) Provided a description (+ citation) for Kendall rank correlation in the first paragraph of Paper 1 -> Answering RQ1.
5) Trimmed down the "Datasets and Tasks" section by only including one sentence definitions for each dataset.
6) Edited the MIMIC example (it is now changed to Diabetes).

Once again, thank you for reviewing my page. Do let me know if you have any additional comments or suggestions.

HarshineeSriram (talk)05:06, 14 February 2023