Course:CPSC522/Grouped Prioritized Experience Replay (GPER)

From UBC Wiki

Grouped Prioritized Experience Replay (GPER)

This page explores the concept of grouping prioritized experience replay memories to improve learning efficiency in temporal difference reinforcement learning environments.

Principal Author: Obada Alhumsi, Tommaso D’Amico

Collaborators:

Abstract

Grouped Prioritized Experience Replay is a hypothetical extension of Prioritized Experience Replay where prioritization is done by grouped experiences rather than individual state action pairs. The page below explores the effectiveness of this method with respect to Q-Learning, Q-Learning with experience replay and Q-Learning with prioritized experience replay.

Builds on

Grouped Prioritized Experience Replay builds upon Prioritized Experience Replay which in turn builds upon the concept of Experience Replay in Temporal Difference Learning. The algorithms will use Q-Learning as a fundamental building block.

Related Pages

Deep-Q Networks are used instead of Q-Learning to apply these algorithms in continuos learning environments.

Experience Replay is also tightly related to Hippocampal Replay in the literature.

Content

Introduction

Grouped Prioritized Experience Replay is a hypothetical extension of Prioritized Experience Replay where prioritization is done by grouped experiences rather than individual state-action pairs. The intention is to render learning more efficient by prioritizing experience replay to specific action or state groups with which the agent is most uncertain about. This extension is motivated by training excersices seen in a variety of sports, where an athlete repeatedly preforms actions in a set of similar envrionment states or performs similar actions in a set of random environment states. The former can be seen in soccer, where the grouped training excersices include training freekicks, cornerballs, counterattacks and other more complicated states that the game presents. The latter can be seen in table tennis, where the grouped training excersices include training a players forehand return( hitting the ball with the red side of the paddle) and backhand return(hitting the ball with the black side of the paddle). This project explores the effectiveness of this method with respect to Q-Learning, Q-Learning with experience replay and Q-Learning with prioritized experience replay.

Background

Temporal Difference Learning & TD Error

Temporal Difference Learning can be more easily understood in comparison to its conterparts methods such as monte carlo methods. In Temporal Difference Learning, the predictions of current states based on the environment are updated with each new time step, unlike Monte Carlo Methods where these are only updated at the end of an episode. This means that while an agent is actively exploring its environment, the feedback it receives influences its predictions of the current state at each time step. The TD error function reports back the difference between the estimated reward at any given state or time step and the actual reward received. Temporal difference is the basis of algorithms such as Q-learning, TD(0) and SARSA.

Experience Replay

Experience Replay is an extension of Q-learning where instead of just running Q-learning on state-action pairs as they occur, the agent stores these state-actions pair experiences and updates them based on future experiences that occur, essentially introducing the concept of memory to the algorithm. This way the algorithm has more efficient use of its experiences and is able to learn faster than standard Q-learning.

Prioritized Experience Replay

Prioritized Experience Replay attempts to optimize Experience Replay to areas where the agent is more uncertain of its actions by prioritizing experiences with a high temporal-difference error (with some stochastic elements). This allows the agent to learn more quickly by avoiding going over experiences it has mastered and focusing on the ones where it seems to be making more mistakes. By doing so Prioritized experience replay allows for more valuable memory storage as only relevant experiences remain stored and are replayed. For more detailed information on this topic visit the page on Deep-Q Networks.

Hypothesis

The hypothesis to be tested is that the concept of grouping priorities in experience replay by state or action can improve training efficiency and minimize error for temporal difference learning algorithms in comparison to plain Q-Learning, Q-Learning with Experience Replay and Q-Learning with Prioritized Experience Replay.

Algorithm

Grouped Prioritized Experience Replay is a type of prioritization focused on groups rather than individual experiences. In order to acomplish group prioritization, after each time step, the memory of the last experience is stored according to its groupping. The TD Error of all stored experiences are then calculated and each group average TD error is obtained. The group with the highest average TD error is then sampled from at uniform.

For example, if grouping is done by action, after a certain time step, if the agent moved up, the resulting experience will be placed under the action group "move up". The error of each experience is then calculated and averaged for each group. It may be that the agent is most uncertain of movements to the left, therefore those experiences will be replayed, ideally improving its performance "going left" more quickly.

A simple implementation of grouped prioritized experience replay.

Testing

Taxi-v3 Game Environment Render

Test Environment

The testing for this algorithm was done using a modified version of the Taxi-v3 game environment present in the OpenAi Gym Library.

Taxi-v3 Game with Passenger inside Taxi

In the original Taxi-v3 game and in the modified version the taxi starts off at a random square and the passenger is at a random location. The taxi drives to the passenger's location, picks up the passenger, drives to the passenger's destination and then drops off the passenger. Once the passenger is dropped off, the episode ends.

The following are the labels demonstrated in the render of the game:

  •    letters and numbers (R, G, B, Y, 0-7): locations for passenger and destinations
  •    blue: passenger
  •    magenta: destination
  •    yellow: empty taxi
  •    green: full taxi
Modified Taxi-v3 game environment

The modified version of the game is an extention of the map and an increase in locations and destinations for the passanger. This was done in an effort to obtain a larger discrete environment to train the algorithms on to more suitably display the differences in training efficiency.

The original Taxi-v3 game has 500 states (5 x-axis positions, 5 y-axis positions, 4 destination locations and 5 passanger locations) and 6 actions (move up, move down, move left, move right, pickup passenger and dropoff passenger).

The Modified Taxi game has 28800 states (20 x-axis positions, 20 y-axis positions, 8 destination locations and 9 passanger locations) and the same 6 actions as the original.

Rewards in the modified Taxi game were given as follows:

  • -1 for movement
  • -1 for wrong dropoff or pickup
  • +100 for successful dropoff

Wrongful dropoff and pickup have been ket at a low penalty to avoid the agent from learning to avoid the action faster than it would find the reward. This was done as during fine tuning it was found that the agents would often be stuck in local minima where they would just move back around until the end of the episode.

Methodology

In order to test the performance of Grouped Prioritized Experience Replay a set of algorithms was used as baseline for performance. Since GPER builds upon other concepts the following algorithms were used in comparison:

  • Q-Learning
  • Q-Learning with Experience Replay
  • Q-Learning with Proportional Prioritized Experience Replay
  • Q-Learning with Rank-Based Prioritized Experience Replay
  • Q-Learning with GPER action grouping (6 groups)
  • Q-Learning with GPER x-y position state grouping (4 groups, each group makes a quadrant of the grid)
  • Q-Learning with GPER destination location grouping (8 groups)
  • Q-Learning with GPER passenger location grouping (9 groups)

All algorithms were trained with the same parameters and environment to avoid biases.

Results

For the results below the algorithms have played through 40000 games of Modified Taxi-v3 with a limit of 900 turns per game.

The table below summarizes some values from the results:

Scores of each algorithm over 40000 episodes of Modified Taxi-v3
Q-Learning Experience Replay Proportional PER Rank-Based PER Action GPER Passenger Location GPER Destination Location GPER XY Taxi Location GPER
Average Reward 475.49 499.17 480.06 483.38 -578.94 -539.39 183.27 355.46
Highest Reward 991 991 990 990 991 991 991 991
Average Episode Length 318.6 302.96 316.74 313.89 755.99 741.19 435.05 364.41
Lowest Episode Length 9 9 10 10 9 9 9 9

The following plot represent the reward obtained by the agent at each episode averaged every 100 episodes for each algorithm tested.

Given the rewards of -1 for movement and wrongful actions and +1000 for completing the task the minimum score is of -900 and the maximum will approach +1000.

Average rewards over 40000 episodes for each tested algorithm

The following plot demonstrates the average episode length for each algorithm over time. In the plot below lower scores are better as it represents the average number of movements it takes for an agent to complete the game over the learning process.

Average Epsiode Length over 40000 episodes for each tested algorithm

After plotting these, there seemed to be some ambiguity on the variance of these averages, do curves such as Action GPER actually understand the game but only complete it 1/4 of the time or are they taking 750 steps every episode to complete the tasks? Below is a curve looking at the change in episode lenght over time. This has been further smoothed out to facilitate visualization of the trends.

Average Change in Rewards over 40000 episodes for each algorithm tested

Discussion

From the first plot shown it can be seen that although all algorithms seem to converge, some converge at lower scores than others. As emphasized in the thrid plot this is mainly due to the averaging effect used to clean up the data for visualization. Meaning that most algorithms seem to have reached the maximum highscores although those illustrated with lower averages tended to be more erratic in behaviour, meaning that most episodes they will not successfully terminate the game and are more likely to exert erratic behaviours.

It is possible that with further training these would eventually converge to the global maximum although it appears that they have settled in local maxima. Tuning of the epsylon greedy parameter may help in this case as it would allow these algorithms to explore further scenarious that they seem to not have been exposed to prior.

It can be seen that of the GPER methods tested achieved lower performance compared to prior established algorithms. Provided that the same parameters were used for all trainings and there are no extra parameters introduced in the proposed adaptation, the conclusion may be that the technique does not compare in performance to prior established algorithms.

From the second plot illustrating episode length over time, the tested GPER algorithms never quite learned the game, appearing to not complete most times. For example in Passenger Location GPER the agent seems to run out of steps about 3 out of 4 times it attempts the game. This may be as the passenger location grouping is only relevant to the game for the first component where the agent needs to find the passenger.

The best performing GPER grouping tested seemed to be grouping by XY Taxi coordinates, where the grid is divided into 4 quadrants/groups. Unlike passenger location and destination location grouping, XY grouping is useful throughout the entire game, which could have lead to its increased performance.However, action groupings, which are useful throughout the game as well, did not perform as well. This leads to the idea that state grouping could generally be better than action grouping, nonetheless, further testing on different environment and parameter fine tuning should be performed before coming to such a conclusion.

It is interesting to notice that initially XY Taxi Location GPER also seems to perform on par with the established algorithms. This may be because of the desired effect of grouping seeked by this experiment, where by grouping experiences by relevant states, the algorithm restricts prioritization only to groups considered relevant by the designer, reducing the need for the algorithm to train against redundant behaviours.

In the third plot we see the change in rewards of each algorithm over their learning process. IT can be seen that established algoriths initially have a high variance in their rewards, but converge to have small differences, indicating they have perfected their strategy. The same cannot be said about the GPER algorithms where large differences in score remain constant until the end of the learning process. This indicates that these did not converge to the solution and still only manage to complete the task occasionally.

Conclusion and Further Work

Overall GPER does not seem to be a better algorithm than currently used algorithms for experience replay in Q-Learning. It does although seems to provide valuable insights to which state types may be relevant for the learning task. From the graphs above there is a clear difference in performance for each grouping which suggests a difference in relevance of each state type to the learning process. This may be relevant to understanding what influences the learning process of an agent over time.

Further testing of these algorithms is required with different environments and paramenters to determine their functionality and usefulness. As shown in the results above, little difference among pre established algorithms can be seen in this particular example, demonstrating that perhaps more eleborate environments are required to highlight their differences.

An aspect that can be further looked at is that of utilizing probability among groups when sampling instead of sampling from the grouping with the worst error every time, this may avoid algorithms getting stuck on a specific issue that may be learned by learning other aspects of the overall task.