MDP suggestions

Also, I forgot to answer the points/clarifications. Here are they and also we will try to put them in the Wiki
Regarding question 1, yes it is the action at time t picked by the policy $\pi$
Regarding question 2, the values there are values picked to show how a highly punishing to highly rewarding values for traversing through nodes can change the optimal policy. The value ranges are mostly empirical here and only calibrated to this situation. We are not sure whether putting the explanation to this might be useful in the Wiki. If you have any insights on that, we would be happy to hear them. Regarding question 3 and 4, we would definitely try our best :)

MDAbedRahman (talk)‎

Thanks for the clarifications! I thought you might have used some equations to reach the values.

SamprityKashyap (talk)‎