# Course:CPSC522/Reinforcement Learning with Linear Model of Reward Corruption

## Reinforcement Learning with Linear Model of Reward Corruption

In this page, we introduce a framework to represent reward corruption called the linear model of reward corruption. Using this framework, we investigate the effects of various reward corruption patterns on the effectiveness of a Q-learning agent.

Principal Author: Nam Hee Gordon Kim

## Abstract

In reinforcement learning (RL), the main observation an agent gathers is the reward it receives from the environment after making an particular action in a particular state. However, sensors are subject to noise or corruption, and the agent's observed reward value may not accurately reflect the true reward the agent deserves from its action. In this page, we propose a simple framework to consider the discrepancies between the true reward value returned by the environment and the reward value observed by the agent. We use this framework to formalize simple scenarios where scaling, shifting, or noise is applied when the agent observes the reward value. We use a Q-learning agent on OpenAI Gym's Taxi-v2 task and evaluate whether scaling, shifting, and noise functions have adversarial effects on the agent's performance. Finally, we use theoretical derivations and algorithmic reasoning to explain the counter-intuitive results.

### Builds on

Foundations of reinforcement learning should be found in course notes, Course:CPSC522/Reinforcement Learning, Course:CPSC522/Markov Decision Process.

### Related Pages

Another page on this Wiki, authored by Alistair Wick, discusses the effects of reward functions on an RL agent's performance.

## Content

### Introduction

In reinforcement learning (RL), learning is not based on a fixed dataset but an environment that reacts to an agent's actions. Sutton and Barto[1] introduce RL problems "as problems [that] involve learning what to do--how to map situations to actions--so as to maximize a numerical reward signal." More precisely, RL is a class of problems pertaining to Markov Decision Process (MDP), where an agent's interaction with an environment is characterized by the specification of state space, action space, transition probability, and reward function. In the recent surge of interest in RL, the specification of reward function has been a subject of interest for the community. In a blog article, Clark and Amodei[2] of OpenAI demonstrated the failure cases of an RL agent where the reward in MDP is misspecified. Amodei et al.[3] formalize the issue of reward hacking as a concrete AI safety problem. Everitt et al.[4] investigate other possible patterns of reward corruption, such as sensory error, wireheading, and misinterpretation on behalf of a cooperative inverse reinversement leraning agent[5].

In this page, we explore a simple case of MDP reward corruption, where the reward channel is corrupted in such a way that the observed reward is some linear function of the true reward with some noise. We investigate the effect of various different linear patterns on the behavior of the RL agent.

### Preliminaries

In this section, we briefly review the concept of Markov Decision Process and introduce the Corrupt Reward Markov Decision Process framework. We furthermore introduce the linear model framework to be investigated.

##### Markov Decision Process

We will borrow the notation convention from Sutton and Barton's textbook[1]. An MDP is characterized by a 4-tuple of the following objects:

1. State space: ${\displaystyle S=\{s_{1},s_{2},\cdots \}}$
2. Action space: ${\displaystyle A=\{a_{1},a_{2},\cdots \}}$
3. State transition probability (also called dynamics): ${\displaystyle p(s'|s,a)}$ where ${\displaystyle s,a,s'}$ are the current state, the action of the agent, and the resulting state, respectively.
4. Reward function: ${\displaystyle R(s,a)}$where ${\displaystyle s,a}$ are the current state and the action of the agent, respectively.

A simple trajectory in MDP resembles the following form:

${\displaystyle s_{0},a_{0},r_{0},s_{1},a_{1},r_{1},\cdots ,s_{n-1},a_{n-1},r_{n-1},s_{n}}$

where ${\displaystyle n}$ is the number of state transitions simulated.

#### Corrupt Reward Markov Decision Process

Everitt et al.[4] build up on this notation and introduce the corrupt reward MDP (CRMDP), which is specified by the following objects (we modified the notation scheme to track closer to Sutton's):

1. State space: ${\displaystyle S=\{s_{1},s_{2},\cdots \}}$
2. Action space: ${\displaystyle A=\{a_{1},a_{2},\cdots \}}$
3. State transition probability (also called dynamics): ${\displaystyle p(s'|s,a)}$ where ${\displaystyle s,a,s'}$ are the current state, the action of the agent, and the resulting state, respectively.
4. Finite set of rewards ${\displaystyle R=\{r_{1},r_{2},\cdots \}}$
5. True reward function ${\displaystyle {\dot {R}}(s)}$, ${\displaystyle {\dot {R}}:S\rightarrow R}$
6. Reward corruption function ${\displaystyle C(s,r)}$, ${\displaystyle C:S\times {\dot {R}}\rightarrow {\hat {R}}}$

where the finite set of rewards ${\displaystyle R}$ is fixed, and the true reward ${\displaystyle {\dot {R}}}$ and the observed signal ${\displaystyle {\hat {R}}}$ are elements in ${\displaystyle R}$. According to this description, Everitt et al.[4] assume that a reward corruption is when the true reward is replaced by some other value, rather than numerical perturbations.

#### Linear Model of Reward Corruption

We propose to examine reward corruption function that is generalized as a linear function of true reward with Gaussian noise. Suppose we have the following trajectory as a result of a simulation of CRMDP (with a finite horizon of ${\displaystyle n}$ timesteps):

${\displaystyle s_{0},a_{0},{\hat {r}}_{0},s_{1},a_{1},{\hat {r}}_{1},\cdots ,s_{n-1},a_{n-1},{\hat {r}}_{n-1},s_{n}}$

Let us suppose that the observed rewards ${\displaystyle {\hat {r}}_{0},{\hat {r}}_{1},{\hat {r}}_{2},\cdots ,{\hat {r}}_{n}}$ are some corrupt observations of the true reward signals ${\displaystyle {\dot {r}}_{0},{\dot {r}}_{1},{\dot {r}}_{2},\cdots ,{\dot {r}}_{n}}$. We will assume that ${\displaystyle {\hat {r}}_{t}}$ is a linear function of ${\displaystyle {\dot {r}}_{t}}$ of the following form:

${\displaystyle {\hat {r}}_{t}={\dot {r}}_{t}\beta +\beta _{0}+\epsilon _{t}}$

where ${\displaystyle \beta ,\beta _{0}\in {\rm {I\!R}}}$ are random variables corresponding to the slope and the intercept of the line, and ${\displaystyle \epsilon _{t}\sim N(0,\sigma ^{2})}$ is the noise function. Note that ${\displaystyle \epsilon _{t}}$ is a Gaussian process with a constant variance ${\displaystyle \sigma ^{2}}$. We will assume ${\displaystyle \beta ,\beta _{0},\sigma ^{2}}$ are artifacts of a malfunctioning reward sensor and are fixed across all timesteps.

Figure 0. A simple graphical model representing the reward corruption process according to the proposed linear model. Note that the true reward value is independent of the parameters of the linear model, and a machine error may cause the observed reward to scale (beta), shift (beta0), or noisy (sigma).

We justify this framework for the following reasons:

• Plausibility. If we assume that reward corruption is mainly due to faulty sensor mechanism, the bias, noise, and amplification introduced by machine errors can be reasonably captured by a linear model.
• Simplicity. We can characterize the faulty reward sensor behaviors with varying values of the three parameters ${\displaystyle \beta ,\beta _{0},\sigma ^{2}}$.
• Tractability. This linear transformation closely follows linear models in classic statistical theories. This allows us to connect reinforcement learning to inference, hypothesis testing, and other well-known statistical problems.

### Experiments

In our experiments, we employ the proposed reward corruption function. We use OpenAI Gym's Taxi-v2 task[6] and observe the effect of various reward corruption patterns on the rate of convergence of the Q-learning algorithm. We use the standard Q-learning for computing the optimal policy of the agent.

We postulate the following properties of the transformation of the reward:

• Hypothesis 1: Inherent robustness. Without further modifications, the agent should be able to perform optimally against a mild level of noise.
• Hypothesis 2: Shift invariance. Adding a constant value to reward at every timestep should not make a significant difference in the resulting policy.
• Hypothesis 3: Positive scaling invariance. Multiplying a positive constant to reward at every timestep should not make a significant difference in the resulting policy.

Shift invariance and non-negative scaling invariance are motivated by the operations that preserve convexity/concavity optimization theory.

We propose different corruption scenarios, simulated by a simple transformation of the true reward function. We investigate the following configurations of our linear function parameters:

• Scenario 0: No Corruption (Baseline). ${\displaystyle \beta ,\beta _{0},\sigma _{t}^{2}=0\quad \forall t}$. There is no corruption, and thus the observed reward is equal to the true reward.
• Scenario 1: Gaussian Noise with Constant Variance. ${\displaystyle \beta =1,\ \beta _{0}=0,\ \sigma _{t}^{2}=c\quad \forall t}$. Sample a real number from a univariate Gaussian distribution and add it to the true reward. We examine the effect of the varying values of ${\displaystyle c}$.
• Scenario 2: Shifted Reward. ${\displaystyle \beta =1,\ \beta _{0}=b_{0},\ \sigma _{t}^{2}=0\quad \forall t}$. A constant is added to the true reward at every timestep. We examine the varying values of ${\displaystyle b_{0}}$.
• Scenario 3: Scaled Reward. ${\displaystyle \beta =b,\ \beta _{0}=0,\ \sigma _{t}^{2}=0\quad \forall t}$. The true reward is multiplied by a constant at every timestep. We examine the varying values of ${\displaystyle b}$.

We used the following steps at each timestep ${\displaystyle t}$ to perform the experiments:

1. Use Q values to select the action of the agent ${\displaystyle a_{t}}$ based on the current state (with ${\displaystyle \epsilon }$-greedy exploration)
2. Get the next state ${\displaystyle s_{t+1}}$ and true reward ${\displaystyle {\dot {r}}_{t+1}}$ from the environment
3. Apply ${\displaystyle {\hat {r}}_{t}={\dot {r}}_{t}\beta +\beta _{0}+\epsilon _{t}}$ to get the observed reward ${\displaystyle {\hat {r}}_{t+1}}$
4. Use ${\displaystyle {\hat {r}}_{t+1}}$ to update the Q values as per standard Q-learning procedure.

We used temporal difference update for Q values with a learning rate of ${\displaystyle \alpha =0.5}$ and used ${\displaystyle \epsilon =0.25}$ for ${\displaystyle \epsilon }$-greedy exploration strategy. Fixed rate of ${\displaystyle \gamma =0.99}$ was used as the rate of discounted reward when updating Q values.

### Results

We trained the Q-learning agent on the Taxi-v2 task for 1e4 episodes using the configurations described from the above sections. Although the agent is only allowed to observe and work with observed reward values ${\displaystyle {\hat {r}}_{t}}$, we collected the true reward values ${\displaystyle {\dot {r}}_{t}}$ to evaluate the agent's objective performance.

#### Effects of Reward Corruption on Rate of Convergence

We produced three different agents following each of Scenarios 1, 2, and 3. In each scenario, we varied the parameters of interest and plotted the accumulated true reward values ${\displaystyle \sum _{t=0}^{T}{\dot {r}}_{t}}$ (where ${\displaystyle T}$ is the number of timesteps in an episode) as a function of training episodes. The results are illustrated in the following figures:

Figure 1. Varying levels of sigma (standard deviation) with the fixed mean of 0 are used to apply Gaussian noise to true reward returned by OpenAI Gym's Taxi-v2 environment. The resulting corrupt reward is used to compute Q values, while the true reward is collected to observe the agent's performance. Blue line (perfect reward) shows best performance, as expected.
Figure 2. Cumulative true reward from the environment is plotted as a function of training episode. Perfect reward observation (beta0=0, blue) along with negative shifts converge to the global optimum, while positive shifts are stuck in a local optimum.
Figure 3. Cumulative true reward from the environment is plotted as a function of training episodes. Agents with positively scaled rewards converge to the global optimum. Negative rewards have an extremely adversarial effect on the agent's performance.

### Discussions

The following bullet points summarize our findings:

• Finding 1: Figure 1 supports Hypothesis 1: Inherent Robustness. However, with a high amount of noise variance, the agent's performance deteriorates.
• Finding 2: Figure 2 contradicts Hypothesis 2: Shift Invariance. It seems that adding a positive constant to the true reward at every time step has an adversarial effect on the agent's performance. Negative shifts, however, did not affect the agent's performance.
• Finding 3: Figure 3 supports Hypothesis 3: Positive Scaling Invariance .

Findings 1 and 2 are surprising. Intuitively, the expected return of the agent should not be affected by a zero-mean noise process. Also, it is surprising that only positive reward shifts would hurt the performance.

Below, we unpack these findings with simple theoretical derivations.

#### Discounted Return with Corrupt Reward

Recall the definition of discounted return from Sutton and Barto:[1]

${\displaystyle G_{t}=r_{t+1}+\gamma r_{t+2}+\gamma ^{2}r_{t+3}+\cdots =\sum _{k=0}^{\infty }\gamma ^{k}r_{t+k+1}}$
Let's introduce the notion of discounted observed return:

${\displaystyle {\hat {G}}_{t}={\hat {r}}_{t+1}+\gamma {\hat {r}}_{t+2}+\gamma ^{2}{\hat {r}}_{t+3}+\cdots =\sum _{k=0}^{\infty }\gamma ^{k}{\hat {r}}_{t+k+1}}$

#### Q-Learning Update Step with Corrupt Reward

Recall the Q-learning update step. The update itself is agnostic to timesteps, meaning the iterative updates only depend on the values in ${\displaystyle Q}$ that have been computed so far. The agent calculates the updated value for an entry in a matrix ${\displaystyle Q}$ based on the already populated values:

${\displaystyle Q(s,a)=r+\gamma \max _{a'}Q(s',a')}$
where ${\displaystyle a}$ is the action selected by the policy ${\displaystyle \pi }$ based on state ${\displaystyle s}$. Note that ${\displaystyle r}$ and ${\displaystyle s'}$ are generated by the environment having received the state-action pair ${\displaystyle (s,a)}$. In a usual Q-learning setup, the agent selects the next update target by
${\displaystyle s=s',a=\arg \max _{a}'Q(s',a')}$
.Now, consider the Q-learning update step with observed reward:
${\displaystyle {\hat {Q}}(s,a)={\hat {r}}+\gamma \max _{a'}{\hat {Q}}(s',a')}$
where the entries of ${\displaystyle {\hat {Q}}}$ must be computed based on the observed reward value ${\displaystyle {\hat {r}}}$. Suppose our matrix ${\displaystyle {\hat {Q}}}$ is initialized with zeros.

For ease of understanding, we simplify the notation and think of ${\displaystyle {\hat {Q}}(s',\cdot )}$ as an ${\displaystyle n_{\text{actions}}\times 1}$ vector, and the ${\displaystyle \max }$ operation is taken over the vector.

${\displaystyle {\hat {Q}}(s,a)={\hat {r}}+\gamma \max {\hat {Q}}(s',\cdot )}$
A Q-learning agent's policy is such that the action yielding best Q-value is selected given a state. More formally,
${\displaystyle {\hat {\pi }}(s)=\arg \max {\hat {Q}}(s,\cdot )}$

#### Remark 1: Inherent Robustness of Reinforcement Learning

Figure 1 suggests that some level of noise can be tolerated. Let us first examine if the Gaussian noise affects the theoretical value functions of the RL agent. Recall the definition of value functions from Sutton and Barto[1]:

${\displaystyle {\hat {v}}_{\pi }(s)=E_{\pi }\left[{\hat {G}}_{t}|S_{t}=s\right]=E_{\pi }\left[\sum _{k=0}^{\infty }\gamma ^{k}{\hat {r}}_{t+k+1}{\bigg |}S_{t}=s\right]}$
In our noise scenario, we established: ${\textstyle {\hat {r}}_{t}={\dot {r}}_{t}+\epsilon _{t}}$.
${\displaystyle {\hat {v}}_{\pi }(s)=E_{\pi }\left[\sum _{k=0}^{\infty }\gamma ^{k}({\dot {r}}_{t+k+1}+\epsilon _{t+k+1}){\bigg |}S_{t}=s\right]}$
${\displaystyle =E_{\pi }\left[\sum _{k=0}^{\infty }\gamma ^{k}{\dot {r}}_{t+k+1}+\sum _{k=0}^{\infty }\gamma ^{k}\epsilon _{t+k+1}{\bigg |}S_{t}=s\right]}$
${\displaystyle =E_{\pi }\left[\sum _{k=0}^{\infty }\gamma ^{k}{\dot {r}}_{t+k+1}{\bigg |}S_{t}=s\right]+E_{\pi }\left[\sum _{k=0}^{\infty }\gamma ^{k}\epsilon _{t+k+1}{\bigg |}S_{t}=s\right]}$
${\displaystyle =E_{\pi }\left[\sum _{k=0}^{\infty }\gamma ^{k}{\dot {r}}_{t+k+1}{\bigg |}S_{t}=s\right]+\sum _{k=0}^{\infty }\gamma ^{k}E_{\pi }\left[\epsilon _{t+k+1}|S_{t}=s\right]}$
By the independence between ${\displaystyle \epsilon _{t+k+1}}$and ${\displaystyle S_{t}}$:
${\displaystyle =E_{\pi }\left[\sum _{k=0}^{\infty }\gamma ^{k}{\dot {r}}_{t+k+1}{\bigg |}S_{t}=s\right]+\sum _{k=0}^{\infty }\gamma ^{k}E_{\pi }\left[\epsilon _{t+k+1}\right]}$
Now, recall ${\displaystyle \epsilon _{t}\sim N(0,\sigma ^{2})}$ and therefore ${\displaystyle E_{\pi }[\epsilon _{t}]=0}$ for all ${\displaystyle t}$.
${\displaystyle =E_{\pi }\left[\sum _{k=0}^{\infty }\gamma ^{k}{\dot {r}}_{t+k+1}{\bigg |}S_{t}=s\right]=E_{\pi }\left[{\dot {G}}_{t}|S_{t}=s\right]={\dot {v}}_{\pi }(s)}$
Therefore, ${\displaystyle {\dot {v}}_{\pi }(s)={\hat {v}}_{\pi }(s)}$. Similarly, ${\displaystyle {\dot {q}}_{\pi }(s,a)={\hat {q}}_{\pi }(s,a)}$ if ${\textstyle {\hat {r}}_{t}={\dot {r}}_{t}+\epsilon _{t}}$ and ${\displaystyle \epsilon _{t}\sim N(0,\sigma ^{2})}$

Thus, Bellman Equations show that model-based RL agents are able to recover true value functions in presence of zero-mean noise.

However, it is evident that increasing variance has detrimental effect on the Q-learning agent in terms of the rate of convergence.

The scenario dictates that ${\textstyle {\hat {r}}={\dot {r}}+\epsilon }$. Therefore we can rewrite

${\displaystyle {\hat {Q}}(s,a)={\hat {r}}+\gamma \max {\hat {Q}}(s',\cdot )}$
${\displaystyle ={\dot {r}}+\epsilon +\gamma \max {\hat {Q}}(s',\cdot )}$
Now, the Q-learning algorithm dictates that the entries of ${\displaystyle {\hat {Q}}(s',\cdot )}$ were already populated according to the recurrence:
${\displaystyle {\hat {Q}}(s',a')={\dot {r}}'+\epsilon '+\gamma \max {\hat {Q}}(s'',\cdot )}$
For all appropriate ${\displaystyle a'}$.

Let us note that our matrix ${\displaystyle {\hat {Q}}}$ is initialized with zeros. Let us suppose we are doing the first few iterations of Q-learning. Then the choice of ${\displaystyle \max {\hat {Q}}(s',\cdot )}$ is sensitive to the extraneous noise involved in computing the entries of ${\displaystyle {\hat {Q}}(s',\cdot )}$. For example, if ${\displaystyle \epsilon '<<0}$ for some ${\displaystyle a'}$, then the entry ${\displaystyle {\hat {Q}}(s',a')}$ is likely to be negative. Then ${\displaystyle \max {\hat {Q}}(s',\cdot )=0}$ could be true. As such, the value selected with ${\displaystyle \max {\hat {Q}}(s',\cdot )}$ could fluctuate very easily depending on the actual value of ${\displaystyle \epsilon }$ at each iteration of Q-learning. The larger variance will allow larger ${\displaystyle \epsilon }$ values (${\displaystyle |\epsilon |>>0}$) to emerge more frequently, which makes ${\displaystyle \max {\hat {Q}}(s',\cdot )}$ more brittle, making the Q-learning agent's policy unlikely to converge to the global optimum.

#### Remark 2: Positive Reward Shifting as Positive Feedback Loop

Contrary to our hypothesis, Figure 2 suggests that the performance of the Q-learning agent is not completely invariant to the shifted reward. While negative shifts result in the same behavior as the baseline, positive shifts result in sub-optimal cumulative true rewards.

We will follow similar steps as Remark 1. In our reward shifting scenario, we established: ${\textstyle {\hat {r}}_{t}={\dot {r}}_{t}+\beta _{0}}$. Let's also use linearity of expectation, as well as use the fact that ${\displaystyle \beta _{0}}$ is independent of ${\displaystyle \pi }$. Then the observed value function ${\displaystyle {\hat {v}}_{\pi }(s)}$ can be written:

${\displaystyle {\hat {v}}_{\pi }(s)=E_{\pi }\left[\sum _{k=0}^{\infty }\gamma ^{k}({\dot {r}}_{t+k+1}+\beta _{0}){\bigg |}S_{t}=s\right]}$
${\displaystyle =E_{\pi }\left[\sum _{k=0}^{\infty }\gamma ^{k}{\dot {r}}_{t+k+1}{\bigg |}S_{t}=s\right]+\sum _{k=0}^{\infty }\gamma ^{k}E_{\pi }\left[\beta _{0}|S_{t}=s\right]}$
${\displaystyle =E_{\pi }\left[\sum _{k=0}^{\infty }\gamma ^{k}{\dot {r}}_{t+k+1}{\bigg |}S_{t}=s\right]+\beta _{0}\sum _{k=0}^{\infty }\gamma ^{k}}$
${\displaystyle =E_{\pi }\left[\sum _{k=0}^{\infty }\gamma ^{k}{\dot {r}}_{t+k+1}{\bigg |}S_{t}=s\right]+{\frac {\beta _{0}}{1-\gamma }}}$
${\displaystyle =E_{\pi }\left[{\dot {G}}_{t}|S_{t}=s\right]+{\frac {\beta _{0}}{1-\gamma }}={\dot {v}}_{\pi }(s)+{\frac {\beta _{0}}{1-\gamma }}}$
And similarly, ${\displaystyle {\hat {q}}_{\pi }(s,a)={\dot {q}}_{\pi }(s,a)+{\frac {\beta _{0}}{1-\gamma }}}$.

Therefore, a nonzero ${\displaystyle \beta _{0}}$ introduces a constant bias to the value functions, and the magnitude of the bias is inversely proportional to ${\displaystyle \gamma }$. Theoretically, this should not be a problem in model-based RL agents. Recall the optimal policy is ${\displaystyle \pi ^{*}:S\rightarrow A}$ such that:

${\displaystyle \pi ^{*}(s)=\arg \max _{a}\sum _{s',r}p(s',r|s,a)\left[r+\gamma v_{*}(s')\right]}$
where ${\displaystyle v_{*}(s)}$ is the optimal state value function. With our observed reward,

${\displaystyle {\hat {\pi }}^{*}(s)=\arg \max _{a}\sum _{s',{\dot {r}}}p(s',{\dot {r}}|s,a)\left[{\hat {r}}+\gamma {\hat {v}}_{*}(s')\right]}$
${\displaystyle =\arg \max _{a}\sum _{s',{\dot {r}}}p(s',{\dot {r}}|s,a)\left[{\dot {r}}+\beta _{0}+\gamma \left({\dot {v}}_{*}(s')+{\frac {\beta _{0}}{1-\gamma }}\right)\right]}$
${\displaystyle =\arg \max _{a}\sum _{s',{\dot {r}}}p(s',{\dot {r}}|s,a)\left[{\dot {r}}+\beta _{0}+\gamma {\dot {v}}_{*}(s')+{\frac {\gamma \beta _{0}}{1-\gamma }}\right]}$
${\displaystyle =\arg \max _{a}\sum _{s',{\dot {r}}}p(s',{\dot {r}}|s,a)\left[{\dot {r}}+\gamma {\dot {v}}_{*}(s')+{\frac {1}{1-\gamma }}\right]}$
${\displaystyle =\arg \max _{a}\sum _{s',{\dot {r}}}p(s',{\dot {r}}|s,a)\left[{\dot {r}}+\gamma {\dot {v}}_{*}(s')\right]+{\frac {1}{1-\gamma }}\sum _{s',{\dot {r}}}p(s',{\dot {r}}|s,a)}$
${\displaystyle =\arg \max _{a}\sum _{s',{\dot {r}}}p(s',{\dot {r}}|s,a)\left[{\dot {r}}+\gamma {\dot {v}}_{*}(s')\right]+{\frac {1}{1-\gamma }}}$
Since the ${\displaystyle \arg \max }$ operation is invariant to adding/subtracting constants,
${\displaystyle =\arg \max _{a}\sum _{s',{\dot {r}}}p(s',{\dot {r}}|s,a)\left[{\dot {r}}+\gamma {\dot {v}}_{*}(s')\right]={\dot {\pi }}^{*}(s)}$
Then the true-reward policy must be the same as the observed-reward policy, i.e. ${\displaystyle {\dot {\pi }}={\hat {\pi }}}$. Therefore the convergence onto the optimal policy should not be affected by Scenario 2.

Now, we explain why we observe the affects of positive reward shifts in Q-learning. Our approach is very similar to that of Remark 1,

${\displaystyle {\hat {Q}}(s,a)={\dot {r}}+\beta _{0}+\gamma \max {\hat {Q}}(s',\cdot )}$
${\displaystyle {\hat {Q}}(s',a')={\dot {r}}'+\beta _{0}+\gamma \max {\hat {Q}}(s'',\cdot )}$
Again, let us suppose we are doing the first few iterations of Q-learning. Note the following:

• Recall that typical Q-learning selects update targets with ${\displaystyle s=s',a=\arg \max {\hat {Q}}(s',\cdot )}$.
• If ${\displaystyle \beta _{0}>>0}$ then we introduce a huge bias in selecting ${\displaystyle \max {\hat {Q}}(s',\cdot )}$. ${\displaystyle \max {\hat {Q}}(s',\cdot )}$ is very much likely to be the ${\displaystyle {\hat {Q}}(s',a')}$ value that has just been populated.
• If ${\displaystyle \beta _{0}<<0}$ then ${\displaystyle \max {\hat {Q}}(s',\cdot )=0}$ is likely to be true, regardless of the ${\displaystyle {\hat {Q}}(s',a')}$ value.

Let us notice that the first case (${\displaystyle \beta _{0}>>0}$) can introduce a positive feedback loop, i.e. the agent will select the same few state-action pairs over and over. Now, the second case (${\displaystyle \beta _{0}<<0}$) does not suffer the positive feedback loop, thanks to the agent selecting ${\displaystyle \max {\hat {Q}}(s',\cdot )=0}$. Eventually, all entries of ${\displaystyle {\hat {Q}}(s',\cdot )}$ are populated and thus the agent can proceed to compute new values without bias.

Therefore, positive values of ${\displaystyle \beta _{0}}$ introduces a reward-hacking behavior for Q-leaning agents via introducing a positive feedback loop, while the negative values of ${\displaystyle \beta _{0}}$ end up producing the optimal policy, given enough iterations.

#### Remark 3: Reward Scaling as Discount Rate Modification

Figure 3 confirms that non-negative scaling does not negatively affect the agent's learning behavior. We observe that the agent may converge to the global optimum even faster than the baseline. However, when a negative multiplier is used, the agent is unable to learn the optimal Q values.

Again, we theoretically assert that ${\displaystyle {\dot {\pi }}^{*}(s)={\hat {\pi }}^{*}(s)}$ with ${\displaystyle {\hat {r}}_{t}={\dot {r}}_{t}\beta }$. By the linearity of expectation, we simply show

${\displaystyle {\hat {v}}_{\pi }(s)=\beta E_{\pi }\left[\sum _{k=0}^{\infty }\gamma ^{k}{\dot {r}}_{t+k+1}{\bigg |}S_{t}=s\right]=\beta {\dot {v}}_{\pi }(s)}$
And ${\displaystyle {\hat {q}}_{\pi }(s,a)=\beta {\dot {q}}_{\pi }(s,a)}$ similarly.

Then with model-based policy:

${\displaystyle {\hat {\pi }}^{*}(s)=\arg \max _{a}\sum _{s',{\dot {r}}}p(s',{\dot {r}}|s,a)\left({\dot {r}}\beta +\gamma {\hat {v}}_{*}(s')\right)}$
${\displaystyle =\arg \max _{a}\sum _{s',{\dot {r}}}p(s',{\dot {r}}|s,a)\left({\dot {r}}\beta +\gamma \beta {\dot {v}}_{*}(s')\right)}$
${\displaystyle =\arg \max _{a}\beta \sum _{s',{\dot {r}}}p(s',{\dot {r}}|s,a)\left({\dot {r}}+\gamma {\dot {v}}_{*}(s')\right)}$
Now, we must suppose ${\displaystyle \beta \geq 0}$ to proceed. Since the ${\displaystyle \arg \max }$ operation is invariant to non-negative scaling,
${\displaystyle =\arg \max _{a}\sum _{s',{\dot {r}}}p(s',{\dot {r}}|s,a)\left({\dot {r}}+\gamma {\dot {v}}_{*}(s')\right)={\dot {\pi }}^{*}(s)}$
Therefore, positive scaling of true reward should still let the RL agent converge to the optimal policy. If ${\displaystyle \beta <0}$ then this equality does not hold. Q-learning with all-zero initialization honors this property by default. Note that aside from the entries of ${\displaystyle {\hat {Q}}}$ having values of different scales, the algorithm itself is not inherently affected by scaling. Therefore Finding 3 corroborates with this fact.

Finally, we interpret the effect of positive reward scaling on RL agents. Earlier, we established that ${\displaystyle {\hat {v}}_{\pi }(s)=\beta {\dot {v}}_{\pi }(s)}$ via the linearity of expectation. However, it is possible to frame this property as follows:

${\displaystyle {\hat {v}}_{\pi }(s)=E_{\pi }\left[\sum _{k=0}^{\infty }\beta \gamma ^{k}{\dot {r}}_{t+k+1}{\bigg |}S_{t}=s\right]}$
${\displaystyle =E_{\pi }\left[\sum _{k=0}^{\infty }\left(\beta ^{1/k}\gamma \right)^{k}{\dot {r}}_{t+k+1}{\bigg |}S_{t}=s\right]}$
Note that ${\displaystyle \gamma }$, the discount rate, is now modified by ${\displaystyle \beta ^{1/k}}$ (with ${\displaystyle \beta \geq 0}$) at each timestep. Interestingly, ${\displaystyle \beta ^{1/k}}$ is a decaying function that is independent of ${\displaystyle \gamma }$ and its influence on the value function asymptotically decreases with the number of timesteps taken. A large value of ${\displaystyle \beta }$ results in larger values of ${\displaystyle \beta ^{1/k}}$, which overall increases value functions, although the decay is more steep than that of smaller ${\displaystyle \beta }$ values. We suggest that the positive scaling can be thought of as an application of non-linear scaling to the discount rate ${\displaystyle \gamma }$.

#### Limitations and Future Work

We only explored simple scenarios of the linear model introduced ${\textstyle {\hat {r}}_{t}={\dot {r}}_{t}\beta +\beta _{0}+\epsilon _{t}}$, where only one parameter is non-zero. Exploring more general cases with linear mode of reward corruption would be a natural next step. In this page, we only examined the Taxi-v2 task, whose environment is simple, discrete, and deterministic. Taxi-v2 can be solved via Q-learning and thus the effect of reward corruption on model-free methods may not fully extend to other model-free methods or model-based methods. As stated, these observations are gathered with a Q-learning algorithm with all-zero initialization, and different initial values for Q-values may yield different empirical results.

### Conclusion

We investigated a possible abstraction for reward corruption mechanism in reinforcement learning agents. The linear model of reward corruption ${\textstyle {\hat {r}}_{t}={\dot {r}}_{t}\beta +\beta _{0}+\epsilon _{t}}$ gives us a simple framework to examine the effects of shifting, scaling, and noise on the performance of RL agents.

## Annotated Bibliography

1. Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
2. “Faulty Reward Functions in the Wild.” OpenAI, 22 Dec. 2016, https://openai.com/blog/faulty-reward-functions/.
3. Amodei, Dario, et al. “Concrete Problems in AI Safety.” ArXiv:1606.06565 [Cs], June 2016. arXiv.org, http://arxiv.org/abs/1606.06565.
4. Everitt, Tom, et al. "Reinforcement learning with a corrupted reward channel." arXiv preprint arXiv:1705.08417 (2017).
5. Hadfield-Menell, Dylan, et al. Cooperative Inverse Reinforcement Learning. p. 9.
6. OpenAI. Gym: A Toolkit for Developing and Comparing Reinforcement Learning Algorithms. https://gym.openai.com. Accessed 15 Apr. 2019.