Comments I liked reading this informative article. I also loved the bibliography of this article and found it very relevant. IMHO, some questions that could be answered by the article to make it more insightful are as follows: Why can the functions in Recurrent networks be only a sigmoid or a tanh? How does RNN accommodate probability? BPTT also has the tendency of getting stuck in local minima. Why? Are there are any solutions to the problems with BPTT?


