# Better caching using reinforcement learning

## Better Caching using Reinforcement Learning

This wiki explains the experiment of applying UCB1 algorithm to the existing Linux kernel cache algorithm.

Principal Authors: Surbhi Palande
Collaborators of work: Renato Costa <renatomc@cs.ubc.ca>, Jose Pazos <jpazos@cs.ubc.ca>

## Abstract

Unfortunately, finding a good page frame reclaiming algorithm is a rather empirical job, with very little support from theory. Thus it becomes a perfect candidate for getting help from AI techniques such as "Reinforcement Learning". The main objective is to tune the parameters in such a way to achieve good system performance, without asking too many questions about why it works well. Often, it's just a matter of "let's try this approach and see what happens”. In this wiki, we explain the experiment of applying a simple Bandit algorithm called as "UCB1" to the existing Linux kernel cache. In the content that follows we explain in brief how the Linux kernel cache works and then follow it up with our simple technique. We then show some results and explain why they are not very flattering. This is followed by our vision for this project.

### Builds On

Informationally, this work builds on

• Reinforcement Learning [1]
• Lectures on Reinforcement Learning [2]
• UCB1 algorithm [3]
• Understanding of the Linux Kernel [4]
• Linux Virtual Memory Manager [5]
• Applying Machine Learning to the Marker Algorithm [6]

## Applying reinforcement learning to the Linux kernel page cache

### Linux Kernel Page Cache

The Linux Page cache maintains a radix tree for every process address space. For this radix tree it maintains two lists [7].

How Active-inactive lists in the Linux Kernel Cache works (work by Johannes Weiner @ Redhat)[7]

There is another problem with the current Linux page cache: Processes that read or write a lot of data but only do so once also have their data cached. For example, imagine a backup process. A backup process will read nearly all data stored on disk (and maybe write it to an archive file), but caching the data from this read or write operations does not make a lot of sense because it is not very likely to be accessed again soon (at least not more likely than if the backup process did not run at all). However, the Linux page cache will still cache that data and move other data out of memory. Now, accessing that data again will result in slow read requests from the hardware, effectively slowing the system down, sometimes to a point that is worse than if the page cache had been disabled completely. Though there is a way to tell the kernel “do not cache the data accessed by this process”, it needs to be explicit. There is no way in which the kernel learns this on its own. We also want to solve this problem.

### Motivation for using Reinforcement Learning

We want to prevent the above mentioned loss of information that occurs when an evicted page is brought in the cache. We rather want to use this stored information for the life of the process and beyond. Thus, we want to store the information stored in a shadow now in a field of the page itself. We can then use this information while moving a page to the inactive list from the active list. For eg: when moving a page to the inactive list, we can move all the pages which have a history of not be accessed immediately. In the current Linux kernel, this information is completely lost after the page is brought back in the cache. In our experiment we store this information as long as the process lives. Ideally we want make this information persistent across the life of the process. For now, however we store this information only while the process is alive. In the future we plan to store this information on the disk as well. This way we already have an access pattern for every previously accessed page of a restarted process. This information can be used the next time the page is accessed.
We can store the following information:

• how many times was the page moved from the inactive list to the active list
• how many times was the page accessed after eviction on an average.
• how many times was the page accessed before being evicted.

We can compute the penalty based on the above conditions and use this to make the following decisions:

• should the page move to the inactive list?
• should the page be evicted?

### Caching with Reinforcement Learning

#### UCB1 Algorithm

While the full details and intuition of this algorithm are presented [3], we present a condensed view based on the blog, here. This is what is tried and tested for the sake of this assignement and also to keep things simple.

A multi armed bandit is a simple model that is used when we want to achieve a balance between exploration and exploitation. It is also used when we do not want to perform any online training, take actions and then learn how good the actions were. The rewards of each arm are independently collected and are revealed independently too. This clearly fits our requirements with the Linux kernel cache. In general, if the action was good, we increase the reward. If the action was bad, we reduce the reward. The k-arms here are the k choices that we have - i.e which page should we evict amongst the k given pages. The reward that we get is not immediate - but as we get hits and misses on the cache. For a page that we decide to keep, we get a reward. Whereas for a page that we evicted and now needs to be back in the cache, we get a penalty. We want to maximize this reward. However, we do not know the expected reward exactly. If we knew it, it would be trivial to solve the problem. So we use the value that we so far have got, to predict if evicting the page is better than keeping it. By acting greedily, we try to evict a page which has the lowest reward and keep all pages that have higher reward. In this way we exploit our current knowledge of the cached pages. However, we also want to explore as the rewards associated with a page change with the access patterns. We do this ideally by randomly evicting a page. Exploration may produce a greater reward in the end. There are many sophisticated methods for achieving this balance using mathematical formulations. epsilon-greedy algorithms forces the non greedy algorithms to be tried indiscriminately. We increse the reward using a UCB1 bounding factor - so that the increase over time reduces the uncertainty of the action-reward.

nj: i.e the total number of time we decide to take an action (evict/keep) on THIS page from the page cache. We store this variable as "page->mlcache_plays" in our code. We increment this everytime we take an action of evicting a page out the cache i.e by shrinking the inactive page list.

T: total number of rounds - the number of times we take an action and gain a reward. So we update this count every time a reward is calculated for all the pages. In our code we call this as "step" Each time we take a favorable action we increment it by a value that gets smaller as the total number of rounds increase. Thus we start by incrementing our reward by a bigger value and optimistically choosing it. As time passes, we increment the values lesser.

We bound our rewards as follows: ${\displaystyle reward=1+{\sqrt {2*ilog2(T)/n_{j}}}}$

To understand this: imagine that we never say play the action ${\displaystyle n_{j}}$ only once, but we have taken action a lot more times - thus T is high but ${\displaystyle n_{j}}$ is 1. Thus the reward that such a page gets is higher than that of a page where ${\displaystyle n_{j}}$ is higher. Thus our reward grows for pages on which we havent taken an action than pages on which we have taken an action. Ultimately we will try to take an action on a page on which we haven't done so. Note that here, action is "evicting a page". So ultimately we never rule out making a decision of eviction on any page.

In general the algorithm that we use is as follows:

• When a page is added to the cache, it gets a reward as follows:
• If this is the first time the page is added to the cache, then no shadow entry is found. So, set reward = weighted average of the pages in the cache
• If this page was a previously evicted page then there was a shadow entry found. So set reward = shadow entry reward. We also establish that this page was wrongly evicted in favor of the pages in the cache. So we reduce the reward (i.e increase the penalty) of the rest of the pages in the cache. This is with the hope that the rest of the pages will be removed before this page.
• When a page is removed from the Inactive list, we store the rewards in a shadow entry for that page in the radix tree.
• When a page is accessed again - we get a hit and so we increase the reward (i.e decrease the penalty). This is with the hope that the page that has a lot of hits is removed the later than pages that have lesser hits.

Each time we increase the reward, we increase it using the UCB1 upper bound method. This way each page gets a chance of getting more reward and hence being evicted at some point, even if it has the more favorable settings than the other pages. When two pages have the same reward for eviction, then we look at what page had a smaller access after eviction count. If this was the same as well, then we also look at what page was shuffled more between the inactive pages and the active pages. This count indicates that the page has a quiet time while it is in the active cache - i.e it is not accessed as regularly as the other page.

#### Alternatives

Alternatives that I was considering was unsupervised, online TD learning using function approximation method. In particular I was looking at whether this model can be trained using TensorFlow or PyTorch. Initially I was also thinking of using a database cache instead of the Linux kernel cache. However, this now remains as future work and the details can be found in the relevant section.

#### What can we not do?

Ideally given any page we will like to know whether to evict the page or not. At the beginning of this project, the idea was to come up with a mathematical model which will decide if a page should be evicted or not without overfitting. However as was learnt, this decision is completely dependent on the access pattern of the page, the users intentions, could be random as well. The decision to evict a page is not based on the environment. Thus even if we associate parameters with the environment, this will not help in making a decision about evicting the page as the environment is shared amongst all the pages. We also cannot guess user intentions and make guesses about the random access patterns. Perhaps it might be possible to learn the patterns without recording them. However, for the purpose of this assignment, that is what was done. In general, page eviction can be done by using scheduler information - so we know what processes are next going to be scheduled and we can use previous access patterns. Note that in order for the previous access patterns to be available, they need to be persistent. Persistence can only be achieved for all permenant pages. We can never know much about the termporary or dynamic pages, other than during the lifetime of the process. However at the moment we are NOT using the scheduling information and we are also NOT making the recorded information persistant.

Secondly, file based accesses can only be recorded from when a file is read, written to or first time mapped for writing. The case where a mapped file is accessed as an array cannot be detectd through functions. Hence we do not try to chase the mmaped page accesses.

Technically for the purpose this experiment, to keep things simple, we also turn off NUMA support in the kernel.

#### How do we measure success?

Simple: By calculating the ratio of hits :: misses

#### Test Case

We run the following processes:

• Boot the kernel
• make -j all on the linux source code.

We measure their performance measured as a function of time and the hits :: misses ratio. We print this ratio explicitly.

#### Results

No improvement observed. Both the kernels behave similarly.
Comparision of hits to misses ratio between vanilla kernel and modified kernel with UCB1 - measured every minute while doing a make -j all on linux code. Similar behvaior seen.

We can see that the Linux kernel cache actually performs slightly better or similar. However, I believe this is because of the following reasons:

• the same pages are not reaccessed many times as the processes them selves are not long lasting. The hits and caches are lost when a process dies. Persistently saving the access information can help in better decision making. Processes like "make" usually have a definite access pattern that can be exploited by this persistence.
• the modified algorithm actually tries to keep LRU and uses the statistics when choosing between two similar pages. As the next step we should try applying UCB1 in a completely disruptive manner.

-

#### Additional things that we can do in this project

Other factors that can help from AI techniques are:

• Adjusting the ratio of the active - inactive list
• Compare behavior of different bandit algorithms
• Store the caching information persistently. This is really important and should really give better results. Did not do this due to lack of time. This also has the additional advantage that hot blocks can later be identified and used for better placement on disks. SO for a SSD, these hot blocks could be placed together, rather than placing them with the cold blocks. This way we can avoid unnecessary writes to the cold blocks due to the SSD property.

#### For the future

##### Plan 1

Eventually we want to use the experience that we learn from this system to run on a deep reinforcement learning function approximator that will provide us a decision of which page we should evict. The experience that we get over a period of time, makes this unsupervised learning problem into a supervised learning problem. Current Linux cache behaves mostly like LRU. However, it is beneficial sometimes to behave like LFU. The current algorithm for eviction is somewhere in between but certainly not the best. Hopefully function approximators could give better results. There has not been much research in applying AI techniques to caching. Hopefully because this is very much an empirical problem rather than a theorotical one, function approximators might give us a better solution than currently exists. Due to lack of time, I was not able to try out this technique for this assignment. However, this is a topic of my near future research that I will be working on with Dr Alexandra Fedorova and Tony Mason (except that I will be working on application level cache)

One problem in using neural network based reinforcement learning for kernel level caching decisions that is very outright is that, all frameworks run in userspace whereas caching decisions in the kernel are taken in the kernel space. Alternatively, we could apply this approach to user space caches and apply simpler kernel level functions for policy gradient based reinforcement learning to the problem of caching.

##### Plan 2

We could also use techniques like used in this assignment to evaluate what policy works better and accordingly switch between policies depending on the access pattern.

## Annotated Bibliography

1. [ https://youtu.be/2pWv7GOvuf0 David Silver's lecture series on Reinforcement Learning]
2. UCB1 algorithm Cite error: Invalid <ref> tag; name "ucb1" defined multiple times with different content

## Help Taken

Thankful to discussion with Professor David Poole and my fellow student - Jose Pazos <jpazos@cs.ubc.ca>who applied UCB1 algorithm to Linux kernel caching - but in a different way - where he applied UCB1 to a user space processes kernel radix tree of pages and did not keep any history to learn from. I have tried to apply UCB1 to the existing function refill_inactive() which is a generic eviction function and also kept a history that can be used for reinforcement learning.