Comments on the first draft

This is based on the talk and the page.

This is interesting, but I think from a different perspective than was given.

A (positive) linear function of a reward has the same optimal policy as the original reward.

You can compute how adding a constant to a reward changes the return (sum of discounted rewards). This doesn't mean that the algorithms will work the same mainly because of the initial Q-values. If you use 0 as the initial Q-values, then I would conjecture that adding a constant so that some of the rewards are positive and some are negative will work better than if they are all positive or all negative. (And will be equivalent to a different initialization).

Similarly, think about multiplying the reward by a constant.

As for adding noise, the reason that we have alpha using averages is to allow for noise. The noise should average out. I find that r_k=10/(9+k) works well for averaging noise in the state-based examples I have tried. I would suggest that learning a model, or using action replay (where the rewards can be averaged) may work well even for very noisy rewards.

DavidPoole (talk)‎