Crititque2

Hi Anubhav!

Thank you for your critique, please find my responses below:

1) In table 3 paper 2, does the results generalize to other datasets?
Response: I am not sure if they do. The authors exclude the other 2 datasets from all analyses because they find that the performance of the model on these datasets does not significantly improve after the addition of the attention layer. More information on this can be found on my page: Paper 2 -> Answering RQ1: In what cases is attention not necessary?

2) How does the data distribution change in the datasets already tested?
Response: I am not sure of what you mean by this. Are you asking how different the datasets are? If so, the authors do not provide information about this (possibly because they are well-known datasets). However, Table 1 in Paper 1 lists out the number of data items, average sequence length, and train/test size for these datasets (paper 2 uses the same setup so the information should be consistent).

3) Is this line a fact: "Attention weights should correlate with other feature importance measures".
Response: I believe this is a bit intuitive considering why attention is used in the first place (so that the relevant features in the input are given more importance in the decision making). Additionally, considering the fact that existing explanation methods tend to use attention as a way to explain which features were considered important, the underlying notion is that attention can used for understanding feature importance.

4) Is scaled dot-product attention also function of m?
Response: Yes, that's why it is in the denominator under a square root.

5) Isn't the bAbI dataset small, as it just contains 20 tasks?
Response: Yes, it is small.

6) For paper 1, what is the authors' view on how a model should highlight multiple plausible explanations?
I hypothesize that the authors would consider providing all possible plausible explanations for a given output to be a requisite for the sake of "completeness".

HarshineeSriram (talk)‎