Course talk:CPSC522/Reinforcement Learning with Linear Model of Reward Corruption
- [View source↑]
- [History↑]
Contents
Thread title | Replies | Last modified |
---|---|---|
Comments on the first draft | 0 | 02:45, 17 April 2019 |
This is based on the talk and the page.
This is interesting, but I think from a different perspective than was given.
A (positive) linear function of a reward has the same optimal policy as the original reward.
You can compute how adding a constant to a reward changes the return (sum of discounted rewards). This doesn't mean that the algorithms will work the same mainly because of the initial Q-values. If you use 0 as the initial Q-values, then I would conjecture that adding a constant so that some of the rewards are positive and some are negative will work better than if they are all positive or all negative. (And will be equivalent to a different initialization).
Similarly, think about multiplying the reward by a constant.
As for adding noise, the reason that we have alpha using averages is to allow for noise. The noise should average out. I find that r_k=10/(9+k) works well for averaging noise in the state-based examples I have tried. I would suggest that learning a model, or using action replay (where the rewards can be averaged) may work well even for very noisy rewards.