Feedback

Hi Samprity,

Regarding wiki page:

  • I think this is a typo?: "Movie was not !good" -> "Move was !good" (Otherwise it's recursive.)
  • It seems that your model assigns 50% (completely uncertain) to any token that's not seen in the training dataset. It might be worthwhile to mention this somewhere in the wiki.
  • I'm not sure if hyperlinks for the movies is necessary.. but I did click a few with positive reviews to check them out. :)
  • It may be helpful to give what the reviews were for these moves. Maybe just a couple. Because the table right now doesn't reflect how your model works. (Because it doesn't even show the inputs.)
  • Why couldn't naive Bayes predict correctly for the Fast and Furious example? A small bit of intuition would help. Possible culprit: using your tool, it gives "extremely entertaining" a 58% positive but "7" a 81% negative. ;)

Regarding your experiment:

  • I think you can clean up the dataset a bit more. Your tokenizer assigns a weight of 61% negative to the token "1", 53% negative to the token "2", but 58% positive to the token "3'! (For comparison, the token "good" only has a positive weight of 55%.) These numbers and artifacts such as "a" (positive 53%) can just be removed to yield better results. Because using these tokens is essentially fitting to noise.
  • "One thing we observed was that the probability percentages for the same review varied if I loaded the application again." Why is this? If there is randomness in your experiment, can you mention it in the wiki? Is it because the training data is randomly picked each time?

The tool was fun to play with!

TianQiChen (talk)03:26, 21 April 2016

Thank you for the feedback!

  • "Movie was not !good" -> "Movie was not !good" I am not removing the not as of now. Since the process occurs only once it is not going to be recursive. If time permits I will try to remove the not and test it out. The token "!good" gets stored in our Bayes classifer as having appeared in a negative review.
  • Yes the model assigns 50% (completely uncertain) to any token that is not seen in the training dataset. I will mention it in the page.
  • For the reviews I correlated number of stars with positive sentiment. I did paste of the reviews on the page. But most of them are too long and made the page look weird.
  • Thank you for finding the culprit for the Fast and Furious review!
  • If time permits I will try to clean up the dataset.
  • Yes I did random sorting on the training data leading to different probabilities. I will mention it in the page!
SamprityKashyap (talk)04:04, 21 April 2016