# Course:CPSC522/Experiments with Reinforcement Learning

## Experimenting with Reinforcement Learning

My hypothesis is that Reinforcement Learning (RL) techniques can be studied effectively by using them to solve "toy" problems of limited scope in a virtual environment. To evaluate this hypothesis, I will attempt to compose and solve such problems using these techniques, and evaluate my results. Essentially, this is an attempt to gain experience with, and understanding of RL with increasingly complex sandbox examples.

Principal Author: Alistair Wick

## Abstract

Here I present a grid-world sandbox simulator environment for studying basic RL techniques, with the aim of better understanding the implementation and performance of those techniques. I have started with an implementation of tabular discounted-reward Q-Learning for a single agent.

### Related Pages

More advanced and versatile RL systems can be built using Function Approximation; Neural Networks, for example, can be used to approximate the ${\displaystyle Q}$ function.

Visitors interested in the application of these advanced RL methods may find some enlightenment in the Deep Neural Network and Game of Go page, which discusses the application of "Deep Q"/Deep Reinforcement learning to the game of Go.

Action Selection for MDPs also tackles related problems.

## Content

Figure 1: Training an AI agent with Q Learning, then observing the resulting policy. The agent is rewarded for reaching one of two goals - the goal "containing" the reward is chosen at random with a 50/50 chance, so the agent first visits the closest goal before continuing on to the second if no reward is presented. Grid colors indicate the expected future reward as discovered by Q Learning.

### Overview

An AI agent occupies a 2D tiled block world of variable size and composition. The agent attempts to find a policy to maximize its "reward", a cumulative measure of value provided by the simulation environment in response to the agent's actions. Generally, increasing the reward means reaching target "goal" tiles as quickly as possible, while avoiding dangerous "punishment" tiles. The agent can traverse the world by taking actions, which move the system from one state to the next until a terminal state is reached, which ends the current "episode". The action space is kept simple: moves can be made up, down, left and right from a given position. The state space consists primarily of the agent's ${\displaystyle x}$ and ${\displaystyle y}$ position on the grid, but I have expanded this by adding "knowledge", a sort of memory embedded in the agent's state (more on this later), and other extensions are possible.

### Simulation Detail

I chose to restrict the state and action space to be quantifiable to enable use of simple tabular RL methods. An easily visualized and conceptually simple world representable with these restrictions is a tiled "grid world", where positions are integer (rather than continuous) vectors. An agent in a grid world, like a piece in Chess, can sit exactly on a given tile, not between two tiles or outside the grid. The agent has a small set of available actions at any given state, shown in figure 2, which comprise attempts at directional movements. These may be more or less successful depending on how the particular environment has been configured; moving into a wall will always bounce the agent back to its last position, and when a stochastic environment is used, the result of a move is not deterministic. In the stochastic case, the agent may end its move in a tile adjacent to its intended target with a configurable probability ${\displaystyle p_{slide}}$ (Figure 3), and must learn to take this into account when choosing actions to take.

Figure 2: Available actions for a simple grid-world AI agent
Figure 3: Stochastic probabilities for the next state of an agent taking a right-move (East) action into empty grid tiles.

I used the Unity game engine to set up and run my experiments, as it provides a convenient object-oriented world with 2D and 3D placement mechanics, appealing visualizations, and the opportunity to expand to more complex environment models. Creating "maps" (specific tile layouts) for the grid world environment is achieved in the Unity editor application. The user sets a grid size on the environment object, then places special "tagged" objects on the grid, which are picked up by the environment object and registered as tiles in their set positions, rounded to integer values. In this manner, an environment can be constructed with various spawning locations for the agent, open "empty" tiles, impassable walls, terminal "punishment" tiles, and two classes of "goal" tiles, which are also terminal when active. The agent then completes episodes of training or play by entering an active goal or punishment tile, or when its cumulative reward has dropped below an arbitrary threshold. The cost of movement and punishment, and the value of rewards for the two classes of goal can be set by the user. Additional randomness can be introduced to the rewards by applying independent normal distributions to the reward for either/both goal classes.

The two classes of goal tiles (A and B) can be active or inactive, and offer no reward (and are non-terminal) in the inactive case. The choice of rewarding either A or B can be set deterministically before learning, in which case the agent learns to seek the relevant goal tiles, or the choice can be left to be made randomly by the environment at the start of each new episode. This becomes an interesting proposition with the addition of "Visited A" and "Visited B" booleans to the state space, which are each set to true when a tile of the respective class has been reached by the agent. In the case where the choice is 50/50, the agent must learn to visit the closest available goal tile, before moving on to the other class of goal if no reward is presented.

### Q Learning

All existing demos use discounted-reward Q Learning[1] to attempt to find an estimated value function, ${\displaystyle Q}$, for possible state-action pairs: that is, ${\displaystyle Q(s,a)}$ is a function which returns the expected future reward for an agent which takes action ${\displaystyle a}$ from state ${\displaystyle s}$. Given my quantified state and action space, I was able to implement this function as a table lookup contained in a C# class TabQ, which stores a numeric value for each state/action pair.

The Q table is updated off-policy using value iteration (a Bellman backup) as the agent explores the world. In abstract terms, at timestep ${\displaystyle t}$ and state ${\displaystyle s_{t}}$, the agent chooses some action ${\displaystyle a}$, and observes the resulting state ${\displaystyle s_{t+1}}$ and a numeric reward ${\displaystyle r_{t}}$ from the environment. Value iteration then pulls the value for ${\displaystyle Q(s_{t},a)}$ towards that of the maximally valued action of its new state (the action which would be taken on-policy):

${\displaystyle Q(s_{t},a_{t}):=(1-\alpha _{e})Q(s_{t},a_{t})+\alpha (r_{t}+\gamma \max _{a}Q(s_{t+1},a))}$

${\displaystyle \gamma }$ here represents an adjustable constant ${\displaystyle [0...1]}$ "discount factor", which reduces the expected value of distant rewards. ${\displaystyle \alpha _{e}}$ is the learning rate for episode ${\displaystyle e}$, which adjusts the "strength" of the value updates—${\displaystyle \alpha =1}$ is only optimal in totally deterministic environments, as we will see, but a low learning rate leads to weak updates and consequentially slow learning. Most of my demos involve some level of non-determinism, so a fixed high learning rate is rarely an option. I use a per-episode exponentially decaying learning rate to begin learning with fast, noisy updates before "smoothing out" the resulting non-optimal policy with a gradually declining learning rate. This decay is set as:

${\displaystyle \alpha _{e}:=\alpha _{end}+(\alpha _{start}-\alpha _{end})(1-\tau _{\alpha })^{e}}$

Where ${\displaystyle \tau _{\alpha }}$ controls the decay rate.

The policy optimized by Q-Learning is to simply take the highest-valued action at any given state: ${\displaystyle \pi (s)=\arg \!\max _{a}Q(s,a)}$.

### Results

Basic World: Here I'll start with a simple, largely deterministic example. The agent can "spawn" (start an episode) in any of the four corners of a 19x19 square map - the random selection of one of the four spawn locations is the only element of non-determinism in this example. The agent must navigate to the single available goal tile, moves deterministically (${\displaystyle p_{slide}=0}$) with cost ${\displaystyle -0.2}$, and always gets the same ${\displaystyle +5}$ reward for entering the terminal goal state. Thanks to this, I am able to set ${\displaystyle \alpha =1.0}$ for rapid learning. Here I show a timelapse of training progress on the map with the agent hidden from view, alongside video of the agent following the resulting optimal policy. The coloring of the map tiles on the left indicates ${\displaystyle \max _{a}Q(s,a)}$, the value of the expected future reward when the agent is at that tile. I also overlay crude "arrows", which indicate the direction selected by the learned policy at that tile: the white end points in the policy direction.

Note the general training progress, especially towards the start: random movements with no positive rewards gradually lower Q around the spawn points, biasing the agent to moving towards the center, where it eventually discovers the goal. The goal reward is then propagated outward with increasing speed, as the probability of randomly moving into the propagated area increases as the area grows. The hard boundary of this area reflects the choice of ${\displaystyle \alpha =1}$, and would be smoothed out with a lower learning rate. The reward then floods the map, before the policy is gradually resolved into an optimal one (note the changing arrow directions).

See below for a graph of the cumulative reward at termination for the simple world above. This shows both exploratory training and on-policy validation, which runs without learning, over the course of 1000 episodes. In this fairly trivial case, an optimal policy for the full set of spawn states is reached after around 200 episodes. An optimal policy for the entire map is reached through random exploration after 1-3000 episodes, and while this is interesting to see, it's of little use in this toy example. An on-policy agent in a wolrd with deterministic movement can never reach tiles outside the canonical path laid out by the learned policy.

Random Reward: In the next example, I challenge a deterministic agent to distinguish between two rewards of equal expected value, where the upper, green goal A is fixed at ${\displaystyle r_{a}=10}$, and the lower, blue goal B is randomly distributed ${\displaystyle r_{b}\sim {\mathcal {N}}(10,4^{2})}$. There are still no punishment tiles or complex movement—in fact, movement is essentially constrained to one dimension, as the world is a U-shaped tunnel with a different goal at either end. This example serves to highlight the importance of tuning with this learning model, and of tuning the ${\displaystyle \alpha }$ learning rate in particular.

In a small number of episodes (around 1-200), an agent with ${\displaystyle \alpha =1}$ rapidly propagates the reward for both goals, but inevitably propagates extremely noisy rewards for B. This noise leads to cyclic "sticking points" in the learned policy, where an on-policy agent would become permanently trapped—far from optimal behavior! This may even involve the agent simply colliding with the wall, to stay in a state with higher expected reward than those adjacent to it. This cannot be resolved without altering learning parameters, particularly ${\displaystyle \alpha }$; an agent set up like the one on the left in the above video will never learn an optimal policy. Lowering alpha slightly (not shown) leads to "waves" propagating out from B, which can also cause policy cycles (e.g. two adjacent states at the "peak" of the wave pointing at one another). Instead, alpha must be dropped extremely low, in this case down to around 0.01, and learning run for thousands of episodes to fully propagate values close to the expected reward, rather than just a random sampling of the reward. In this case, the spawn state flips occasionally between pointing to goal A vs goal B during the latter phase of training. This, finally, is the behavior we would expect of a policy optimizer: randomly choosing between two equivalent, and equally attainable rewards. Note that there is no obvious way for a Q learner to prefer fixed vs stochastic rewards: the ${\displaystyle \alpha =1}$ learner chose to take the lower tunnel approximately as often as the upper one, but got stuck in every such case. The only way to resolve random rewards with Q learning is to spend an extended learning period with low ${\displaystyle \alpha }$, smoothing the Q function. Naturally, this becomes more difficult with increasing random variance and state space.

Stochastic movement: Here, I'll briefly cover some effects of the stochastic movement described above on the agent. In this environment, the agent must navigate a short winding tunnel, before passing through a more open area with terminal punishment tiles to reach the goal. The major difference between this and previous worlds is the addition of randomness to the movement, with ${\displaystyle p_{slide}=0.3}$, a 30% chance of sliding off the desired direction of movement to either side. Sliding into a wall (or the edge of the map) simply bounces the agent back into the desired tile, but this introduces an additional cost of ${\displaystyle -1}$ over the movement cost of ${\displaystyle -0.2}$. This makes travel through the tunnel an expensive proposition, and complicates learning thanks to the random distribution of these collision costs. The agent must also find the optimal policy for passing the punishment tiles—moving close to them risks terminating the episode and removing any possibility of later reward, but passing further away introduces the cost of collisions with the map edge. As shown below, the agent finds that avoiding the punishment tiles is most important, but also moves away from the wall once past the dangerous area.

Two-tunnel: Finally, I'll show an environment combining stochastic movement, multiple spawns, and a new element: the random goal activation described in the Simulation Detail section above. This requires the expansion of the state space to include historical information, or "knowledge", of whether or not the agent has visited each goal. This is a similar setup to the one shown in figure 1, where the expanded state space is reflected in the visualization. The optimal policy, which the agent successfully learns, is to visit the closest goal before continuing on to the further goal if no reward is presented (shown here when the goal turns gray). This naturally changes if the rewards are unequal, either by value or distribution.

Significantly more episodes of training than in previous examples are required to learn successfully, thanks to both the expanded state space and the randomness of the goal rewards; as in the random reward example, the policy must be "smoothed" with decreasing ${\displaystyle alpha}$ over time, removing policy loops. These factors compound together, as smoothing requires many visits to each state.

### Conclusions

One of the important observations is that Q-Learning has a tendancy to slip into suboptimal policies when the results (rewards) of those policies are both stochastic, and poorly differentiated from alternative policy options. A fairly benign example is the tendency of the agent in the two-tunnel environment to "prefer" one goal over another in a particular run, which occurs more frequently when collisions are punished more strongly, and (equivalently, from an expected reward perspective) when high ${\displaystyle p_{slip}}$ causes more frequent collisions in the tunnels. The progression of learning goes something like this: the agent, with a high ${\displaystyle \alpha }$ and ${\displaystyle \epsilon }$, explores the environment and rapidly propagates noisy, stochastic rewards back to the spawn states. These noisy rewards inevitably end up appearing unequal to the agent at various points, and so the agent becomes biased towards one goal in particular - perhaps a string of immediate successes at one goal coincided with the propagation of lower, delayed rewards at the other. As ${\displaystyle \alpha }$ grows lower, a necessary feature to allow smoothing of stochastic rewards, the ability of the agent to make major policy changes is significantly reduced; even if it does, randomly, travel down the less-preferred tunnel, the Q updates are so weak that the true reward value is not successfully propagated back. There are several potential ways of solving problems of this nature, including manually tweaking the parameter start/end values and decay rates, or brute-forcing the problem with more episodes of experience. However, this illustrates a wider problem with RL systems in the literature: they are very difficult to train effectively, often lacking reliable convergence to optimal policies, and requiring significant manual, human work to achieve good solutions.

Epsilon exploration is difficult to manage - low epsilon values open the door to local minima policies, but even moderately high values can cause an agent to frequently self-terminate early in an episode in "dangerous" environments, which also slows or prevents meaningful exploration. Intelligent exploration strategies are therefore desirable. In a finite state space, an exploration policy which attempts to push out into states which have not recently been visited could be feasible. This would be more difficult in an infinite state space, with approximated Q functions, as the agent would not be able to simply "keep track" of where it has visited without additional approximations.

As shown in the random reward example, noisy rewards can lead to closely related states creating catastrophic policy loops. For example, at a particular timestep, acting on-policy may yield ${\displaystyle \pi (s_{a})\Rightarrow s_{b}}$, but ${\displaystyle \pi (s_{b})\Rightarrow s_{a}}$, creating an endless loop. This is especially problematic for an agent with deterministic movement but stochastic rewards, as the agent cannot exit these loops by chance, and will instead follow the doomed policy until its reward drops below the termination threshold. These loops can arise in otherwise well-resolved policies, typically directly adjacent to potential goal tiles. The "smoothing" method described in this article was the only reliable way I found to fix these in my tabular Q learning implementation; finalizing the training with an extended period of very-low-${\displaystyle \alpha }$, dropping as low as ${\displaystyle \alpha =0.004}$ in some cases.

The lessons learned here might be transferable to (or provide motivation for) more complex Reinforcement Learning systems, especially those like DeepMind's Atari player which are based on Q Learning methods[2]. Fundamentally, exploring large state spaces by what amounts to trial and error is a difficult proposition.

## Annotated Bibliography

1. Watkins, C.J.C.H. (1989), Learning from Delayed Rewards (PDF) (Ph.D. thesis), Cambridge University
2. Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).