Course:CPSC522/Reinforcement Learning with Linear Model of Reward Corruption
Contents
 1 Reinforcement Learning with Linear Model of Reward Corruption
 2 Abstract
 3 Content
 3.1 Introduction
 3.2 Preliminaries
 3.3 Experiments
 3.4 Results
 3.5 Discussions
 3.5.1 Discounted Return with Corrupt Reward
 3.5.2 QLearning Update Step with Corrupt Reward
 3.5.3 Remark 1: Inherent Robustness of Reinforcement Learning
 3.5.4 Remark 2: Positive Reward Shifting as Positive Feedback Loop
 3.5.5 Remark 3: Reward Scaling as Discount Rate Modification
 3.5.6 Limitations and Future Work
 3.6 Conclusion
 4 Annotated Bibliography
 5 To Add
Reinforcement Learning with Linear Model of Reward Corruption
In this page, we introduce a framework to represent reward corruption called the linear model of reward corruption. Using this framework, we investigate the effects of various reward corruption patterns on the effectiveness of a Qlearning agent.
Principal Author: Nam Hee Gordon Kim
Abstract
In reinforcement learning (RL), the main observation an agent gathers is the reward it receives from the environment after making an particular action in a particular state. However, sensors are subject to noise or corruption, and the agent's observed reward value may not accurately reflect the true reward the agent deserves from its action. In this page, we propose a simple framework to consider the discrepancies between the true reward value returned by the environment and the reward value observed by the agent. We use this framework to formalize simple scenarios where scaling, shifting, or noise is applied when the agent observes the reward value. We use a Qlearning agent on OpenAI Gym's Taxiv2 task and evaluate whether scaling, shifting, and noise functions have adversarial effects on the agent's performance. Finally, we use theoretical derivations and algorithmic reasoning to explain the counterintuitive results.
Builds on
Foundations of reinforcement learning should be found in course notes, Course:CPSC522/Reinforcement Learning, Course:CPSC522/Markov Decision Process.
Related Pages
Another page on this Wiki, authored by Alistair Wick, discusses the effects of reward functions on an RL agent's performance.
Content
Introduction
In reinforcement learning (RL), learning is not based on a fixed dataset but an environment that reacts to an agent's actions. Sutton and Barto^{[1]} introduce RL problems "as problems [that] involve learning what to dohow to map situations to actionsso as to maximize a numerical reward signal." More precisely, RL is a class of problems pertaining to Markov Decision Process (MDP), where an agent's interaction with an environment is characterized by the specification of state space, action space, transition probability, and reward function. In the recent surge of interest in RL, the specification of reward function has been a subject of interest for the community. In a blog article, Clark and Amodei^{[2]} of OpenAI demonstrated the failure cases of an RL agent where the reward in MDP is misspecified. Amodei et al.^{[3]} formalize the issue of reward hacking as a concrete AI safety problem. Everitt et al.^{[4]} investigate other possible patterns of reward corruption, such as sensory error, wireheading, and misinterpretation on behalf of a cooperative inverse reinversement leraning agent^{[5]}.
In this page, we explore a simple case of MDP reward corruption, where the reward channel is corrupted in such a way that the observed reward is some linear function of the true reward with some noise. We investigate the effect of various different linear patterns on the behavior of the RL agent.
Preliminaries
In this section, we briefly review the concept of Markov Decision Process and introduce the Corrupt Reward Markov Decision Process framework. We furthermore introduce the linear model framework to be investigated.
Markov Decision Process
We will borrow the notation convention from Sutton and Barton's textbook^{[1]}. An MDP is characterized by a 4tuple of the following objects:
 State space:
 Action space:
 State transition probability (also called dynamics): where are the current state, the action of the agent, and the resulting state, respectively.
 Reward function: where are the current state and the action of the agent, respectively.
A simple trajectory in MDP resembles the following form:
where is the number of state transitions simulated.
Corrupt Reward Markov Decision Process
Everitt et al.^{[4]} build up on this notation and introduce the corrupt reward MDP (CRMDP), which is specified by the following objects (we modified the notation scheme to track closer to Sutton's):
 State space:
 Action space:
 State transition probability (also called dynamics): where are the current state, the action of the agent, and the resulting state, respectively.
 Finite set of rewards
 True reward function ,
 Reward corruption function ,
where the finite set of rewards is fixed, and the true reward and the observed signal are elements in . According to this description, Everitt et al.^{[4]} assume that a reward corruption is when the true reward is replaced by some other value, rather than numerical perturbations.
Linear Model of Reward Corruption
We propose to examine reward corruption function that is generalized as a linear function of true reward with Gaussian noise. Suppose we have the following trajectory as a result of a simulation of CRMDP (with a finite horizon of timesteps):
Let us suppose that the observed rewards are some corrupt observations of the true reward signals . We will assume that is a linear function of of the following form:
where are random variables corresponding to the slope and the intercept of the line, and is the noise function. Note that is a Gaussian process with a constant variance . We will assume are artifacts of a malfunctioning reward sensor and are fixed across all timesteps.
We justify this framework for the following reasons:
 Plausibility. If we assume that reward corruption is mainly due to faulty sensor mechanism, the bias, noise, and amplification introduced by machine errors can be reasonably captured by a linear model.
 Simplicity. We can characterize the faulty reward sensor behaviors with varying values of the three parameters .
 Tractability. This linear transformation closely follows linear models in classic statistical theories. This allows us to connect reinforcement learning to inference, hypothesis testing, and other wellknown statistical problems.
Experiments
In our experiments, we employ the proposed reward corruption function. We use OpenAI Gym's Taxiv2 task^{[6]} and observe the effect of various reward corruption patterns on the rate of convergence of the Qlearning algorithm. We use the standard Qlearning for computing the optimal policy of the agent.
We postulate the following properties of the transformation of the reward:
 Hypothesis 1: Inherent robustness. Without further modifications, the agent should be able to perform optimally against a mild level of noise.
 Hypothesis 2: Shift invariance. Adding a constant value to reward at every timestep should not make a significant difference in the resulting policy.
 Hypothesis 3: Positive scaling invariance. Multiplying a positive constant to reward at every timestep should not make a significant difference in the resulting policy.
Shift invariance and nonnegative scaling invariance are motivated by the operations that preserve convexity/concavity optimization theory.
We propose different corruption scenarios, simulated by a simple transformation of the true reward function. We investigate the following configurations of our linear function parameters:
 Scenario 0: No Corruption (Baseline). . There is no corruption, and thus the observed reward is equal to the true reward.
 Scenario 1: Gaussian Noise with Constant Variance. . Sample a real number from a univariate Gaussian distribution and add it to the true reward. We examine the effect of the varying values of .
 Scenario 2: Shifted Reward. . A constant is added to the true reward at every timestep. We examine the varying values of .
 Scenario 3: Scaled Reward. . The true reward is multiplied by a constant at every timestep. We examine the varying values of .
We used the following steps at each timestep to perform the experiments:
 Use Q values to select the action of the agent based on the current state (with greedy exploration)
 Get the next state and true reward from the environment
 Apply to get the observed reward
 Use to update the Q values as per standard Qlearning procedure.
We used temporal difference update for Q values with a learning rate of and used for greedy exploration strategy. Fixed rate of was used as the rate of discounted reward when updating Q values.
Results
We trained the Qlearning agent on the Taxiv2 task for 1e4 episodes using the configurations described from the above sections. Although the agent is only allowed to observe and work with observed reward values , we collected the true reward values to evaluate the agent's objective performance.
Effects of Reward Corruption on Rate of Convergence
We produced three different agents following each of Scenarios 1, 2, and 3. In each scenario, we varied the parameters of interest and plotted the accumulated true reward values (where is the number of timesteps in an episode) as a function of training episodes. The results are illustrated in the following figures:
Discussions
The following bullet points summarize our findings:
 Finding 1: Figure 1 supports Hypothesis 1: Inherent Robustness. However, with a high amount of noise variance, the agent's performance deteriorates.
 Finding 2: Figure 2 contradicts Hypothesis 2: Shift Invariance. It seems that adding a positive constant to the true reward at every time step has an adversarial effect on the agent's performance. Negative shifts, however, did not affect the agent's performance.
 Finding 3: Figure 3 supports Hypothesis 3: Positive Scaling Invariance .
Findings 1 and 2 are surprising. Intuitively, the expected return of the agent should not be affected by a zeromean noise process. Also, it is surprising that only positive reward shifts would hurt the performance.
Below, we unpack these findings with simple theoretical derivations.
Discounted Return with Corrupt Reward
Recall the definition of discounted return from Sutton and Barto:^{[1]}
QLearning Update Step with Corrupt Reward
Recall the Qlearning update step. The update itself is agnostic to timesteps, meaning the iterative updates only depend on the values in that have been computed so far. The agent calculates the updated value for an entry in a matrix based on the already populated values:
For ease of understanding, we simplify the notation and think of as an vector, and the operation is taken over the vector.
Remark 1: Inherent Robustness of Reinforcement Learning
Figure 1 suggests that some level of noise can be tolerated. Let us first examine if the Gaussian noise affects the theoretical value functions of the RL agent. Recall the definition of value functions from Sutton and Barto^{[1]}:
Thus, Bellman Equations show that modelbased RL agents are able to recover true value functions in presence of zeromean noise.
However, it is evident that increasing variance has detrimental effect on the Qlearning agent in terms of the rate of convergence.
The scenario dictates that . Therefore we can rewrite
Let us note that our matrix is initialized with zeros. Let us suppose we are doing the first few iterations of Qlearning. Then the choice of is sensitive to the extraneous noise involved in computing the entries of . For example, if for some , then the entry is likely to be negative. Then could be true. As such, the value selected with could fluctuate very easily depending on the actual value of at each iteration of Qlearning. The larger variance will allow larger values () to emerge more frequently, which makes more brittle, making the Qlearning agent's policy unlikely to converge to the global optimum.
Remark 2: Positive Reward Shifting as Positive Feedback Loop
Contrary to our hypothesis, Figure 2 suggests that the performance of the Qlearning agent is not completely invariant to the shifted reward. While negative shifts result in the same behavior as the baseline, positive shifts result in suboptimal cumulative true rewards.
We will follow similar steps as Remark 1. In our reward shifting scenario, we established: . Let's also use linearity of expectation, as well as use the fact that is independent of . Then the observed value function can be written:
Therefore, a nonzero introduces a constant bias to the value functions, and the magnitude of the bias is inversely proportional to . Theoretically, this should not be a problem in modelbased RL agents. Recall the optimal policy is such that:
Now, we explain why we observe the affects of positive reward shifts in Qlearning. Our approach is very similar to that of Remark 1,
 Recall that typical Qlearning selects update targets with .
 If then we introduce a huge bias in selecting . is very much likely to be the value that has just been populated.
 If then is likely to be true, regardless of the value.
Let us notice that the first case () can introduce a positive feedback loop, i.e. the agent will select the same few stateaction pairs over and over. Now, the second case () does not suffer the positive feedback loop, thanks to the agent selecting . Eventually, all entries of are populated and thus the agent can proceed to compute new values without bias.
Therefore, positive values of introduces a rewardhacking behavior for Qleaning agents via introducing a positive feedback loop, while the negative values of end up producing the optimal policy, given enough iterations.
Remark 3: Reward Scaling as Discount Rate Modification
Figure 3 confirms that nonnegative scaling does not negatively affect the agent's learning behavior. We observe that the agent may converge to the global optimum even faster than the baseline. However, when a negative multiplier is used, the agent is unable to learn the optimal Q values.
Again, we theoretically assert that with . By the linearity of expectation, we simply show
Then with modelbased policy:
Finally, we interpret the effect of positive reward scaling on RL agents. Earlier, we established that via the linearity of expectation. However, it is possible to frame this property as follows:
Limitations and Future Work
We only explored simple scenarios of the linear model introduced , where only one parameter is nonzero. Exploring more general cases with linear mode of reward corruption would be a natural next step. In this page, we only examined the Taxiv2 task, whose environment is simple, discrete, and deterministic. Taxiv2 can be solved via Qlearning and thus the effect of reward corruption on modelfree methods may not fully extend to other modelfree methods or modelbased methods. As stated, these observations are gathered with a Qlearning algorithm with allzero initialization, and different initial values for Qvalues may yield different empirical results.
Conclusion
We investigated a possible abstraction for reward corruption mechanism in reinforcement learning agents. The linear model of reward corruption gives us a simple framework to examine the effects of shifting, scaling, and noise on the performance of RL agents.
Annotated Bibliography
 ↑ ^{1.0} ^{1.1} ^{1.2} ^{1.3} Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
 ↑ “Faulty Reward Functions in the Wild.” OpenAI, 22 Dec. 2016, https://openai.com/blog/faultyrewardfunctions/.
 ↑ Amodei, Dario, et al. “Concrete Problems in AI Safety.” ArXiv:1606.06565 [Cs], June 2016. arXiv.org, http://arxiv.org/abs/1606.06565.
 ↑ ^{4.0} ^{4.1} ^{4.2} Everitt, Tom, et al. "Reinforcement learning with a corrupted reward channel." arXiv preprint arXiv:1705.08417 (2017).
 ↑ HadfieldMenell, Dylan, et al. Cooperative Inverse Reinforcement Learning. p. 9.
 ↑ OpenAI. Gym: A Toolkit for Developing and Comparing Reinforcement Learning Algorithms. https://gym.openai.com. Accessed 15 Apr. 2019.
To Add
Put links and content here to be added. This does not need to be organized, and will not be graded as part of the page. If you find something that might be useful for a page, feel free to put it here.
 Hypothesis: search methods work better if state space is smaller?
