# Course:CPSC522/Deep Neural Network

## Game of Go with Monte-Carlo Tree Search and Deep Neural Networks

This page describes the basics of Monte-Carlo Tree Search and it's usage with Deep Neural Networks to Master the Game of Go which has recently defeated the top human Go players.

Principal Author: **Tanuj Kr Aasawat**

Paper 1: Efficient selectivity and backup operators in Monte-Carlo Tree Search

Paper 2: Mastering the game of go with deep neural networks and tree search

## Abstract

The game of Go has been the most challenging classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. It has more than 10^{170} states and up to 391 legal moves. This enormous and intractable search space puts a challenge for efficient decision-making task and to directly approximate an optimal solution using policy or value function. The major breakthrough in computer Go was the introduction of Monte-Carlo Tree Search (MCTS). AlphaGo, the creation of Google's DeepMind, has finally reached a professional level of Go by combining MCTS with policy and value networks. The success in Go has also provided hope that human-level performance could be achieved in other intractable artificial intelligence domain.

This page first describes the paper which presented a novel way of using MCTS for efficient selectivity and backing-up methods which eventually won the 10^{th} KGS computer Go tournament. Later this page describes how the previous paper influenced in combining MCTS with Deep Neural Networks and creating a professional level computer-Go which has already defeated a professional human Go player. As on March 15 2016, AlphaGo defeated world Go Champion Lee Se-dol by 4-1 in a historic five match game.

### Builds on

This page builds on Deep Neural Network, Convolutional Neural Network, Monte-Carlo Tree Search, Supervised Learning and Reinforcement Learning.

### Related Pages

Related pages are: Game of Go and AlphaGo

## Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search

### Introduction

All games of perfect information (like Chess, Checkers, Go, etc.) have an optimal value function which determines the outcome of the game from every board position. These games may be solved by recursively computing the optimal value function in a search tree containing approximately b^{d} possible sequences of moves, where b is the game’s breadth (number of legal moves per position) and d is its depth (game length). In large games, such as chess (b ≈ 35, d ≈ 80) and especially Go (b ≈ 250, d ≈ 150), exhaustive search is infeasible.

The traditional approach for writing a program to play two-person game (like Chess, Go) consists in combining alpha-beta search with a heuristic position evaluator. The heuristic evaluator is based on domain specific knowledge, and provides values at the leaves of the search tree. This technique has been very successful for games such as chess, checkers etc. But this approach doesn't fits well for the Game of Go. Main difficulty in Go is most of the positions in the game of Go are very dynamic. An alternative to static evaluation that fits the dynamic nature of Go positions is Monte-Carlo evaluation. Monte-Carlo evaluation consists in averaging the outcome of several continuations. The accuracy of Monte-Carlo evaluation can be improved with tree search using selectivity thereby pruning the tree and keeping only the best-looking moves after each iteration.

Monte-Carlo tree search is a new search technique, which has revolutionized computer Go, and is rapidly replacing traditional search algorithms as the method of choice in challenging domains such as General Game Playing, real-time strategy games, etc. The basic idea is using self playing i.e to simulate many thousands of random games from the current position and checking if the current position will lead to success. New positions are added into a search tree, and each node of the tree contains a value that predicts who will win from that position. These predictions are updated by Monte-Carlo simulation: the value of a node is simply the average outcome of all simulated games that visit the position. The search tree is used to guide simulations along promising paths, by selecting the child node with the highest potential value. This results in a highly selective search that very quickly identifies good move sequences. This paper presents a new rollout policy based algorithm, **Crazy Stone**, by combining Monte-Carlo evaluation with tree search. The main contribution of this paper is new efficient selectivity and backup techniques.

### Algorithm Structure

The structure of the algorithm consists in iteratively running random simulations from the root position. This produces a tree made of several random games. This tree is stored in memory. At each node of the tree, the number of random games that passed through this node is counted, as well as the sum of the values (score) of these games, and the sum of the squares of the values. Not all the nodes are stored because storing the whole tree would waste too much time and memory. Only nodes close to the root are memorized. This is done by applying the following rules:

– Start with only one node at the root.

– Whenever a random game goes through a node that has been already visited once, create a new node at the next move, if it does not already exist.

As the number of games grows, the probability distribution for selecting a move at random is altered. In nodes that have been visited less than the threshold, moves are selected at random according to heuristics. Beyond this threshold, the node is called an internal node, and moves that have a higher value tend to be selected more often. This way, the search tree is grown in a best-first manner.

### Selectivity

The underlying idea in this technique is to not lose time exploring useless parts of the search tree, thereby the moves that look best should be searched deeper, and bad moves should be searched less.

Generally, most of the selectivity algorithms proposed in the framework of Monte-Carlo evaluation rely on the central limit theorem. When trying to compare the expected values of many random variables, this theorem allows to compute probability that "the expected value of one variable is larger than the expected value of another variable". Some prior works used this principle to propose progressive pruning which cuts off moves whose probability of being best according to the distribution of the central-limit theorem falls below some threshold. Moves that are cut off are never searched again. This method provides a very significant acceleration.

Progressive pruning can save a lot of simulations, but it is very dangerous in the framework of tree search. When doing tree search, the central limit theorem does not apply, because the outcomes of random simulations are not identically distributed: as the search tree grows, move probabilities are altered. For instance, the random simulations for a move may look bad at first, but if it turns out that this move can be followed up by a killer move, its evaluation may increase when it is searched deeper. In order to avoid the dangers of completely pruning a move, it is possible to design schemes for the allocation of simulations that reduce the probability of exploring a bad move, without ever letting this probability go to zero. The basic principle of Crazy Stone’s selectivity algorithm is to allocate simulations to each move according to its probability of being better than the current best move. This scheme seems to be sound when the objective is to obtain an accurate backed-up value, since the probability of being best corresponds to the probability that this simulation would have an influence on the final backed up value if the algorithm had enough time to converge.

### Backup Method

This is the main contribution of this paper. 1,500 positions were sampled at random from self-play games. Basic idea of this method is to determine and backup the moves from current node which led to maximum game playing. The backup operator of Crazy Stone was determined empirically. For each of these 1,500 positions, the tree search was run for 2^{19} simulations. The estimated value of the position was recorded every 2^{n} simulations, along with useful features to compute the backed-up value. Backup formulas were tuned so that the estimated value after 2^{n} simulations matches the estimated value after 2^{(n+1)} simulations. This process was iterated a few times during the development of Crazy Stone.

The adjoining table contains the mean squared error and mean error for different value-backup methods. The error is measured as the difference between the value obtained by the value-backup operator on the data available after S simulations, with the “true” value obtained after 2S simulations. The “true” value is the value obtained by searching with the “Mix” operator, which is explained in following algorithm. Data in the table clearly demonstrates that the mean operator under-estimates the node value(negative error value), whereas the max operator over-estimates(positive error value) it. The robust max operator returns the value of the move that has the maximum number of games. Most of the time, it will be the move with the best value. In case it is not the move with the best value, it is wiser not to back up the value of a move that has been searched less.

Following algorithm demonstrates the “Mix” operator, a linear combination between the robust max operator and the mean operator,that was found to provide the best value back up(error tending to zero). It is with some refinements to handle situations where the mean is superior to the max (this may actually happen, because of the non-stationarity of evaluations).

#### Algorithm

float MeanWeight = 2 * WIDTH * HEIGHT; // Threshold is WIDTH and HEIGHT if (Simulations > 16 * WIDTH * HEIGHT) //Simulations is the number of random games that were run from this node MeanWeight = float(Simulations) / (16 * WIDTH * HEIGHT); float Value = MeanValue; //MeanValue is the mean value of these simulations if (tGames[1] && tGames[0]) { float tAveragedValue[2]; for (int i = 2; −−i >= 0;) tAveragedValue[i] = (tGames[i] * tValue[ i ] + MeanWeight * Value) / (tGames[i] + MeanWeight); //tValue[i] are the backed-up values of the moves, tGames[i] their numbers of simulations if (tGames[0] < tGames[1]) //Move number 0 is the best move (move with the highest number of games) { if (tValue[1] > Value) Value = tAveragedValue[1]; else if (tValue[0] < Value) Value = tAveragedValue[0]; } else Value = tAveragedValue[0]; } else Value = tValue[0]. return Value;

### Discussion

This paper presented a new algorithm for Monte-Carlo tree search with a new efficient backup method. It was implemented in a computer-Go program that performed very well in tournaments, and won a 100-game match convincingly against a state-of-the-art Monte-Carlo Go-playing program, IndiGo indicating that the tree search algorithm presented in this paper is more efficient than the one used in IndiGo. On the other hand, results against GNU Go indicates that Crazy Stone is still weaker.

The major issue with this paper is that it uses a global tree search which won't prove reasonable for scaling it to large boards, particularly to 19*19 Go boards. Further, it needs to overcome tactical weaknesses so as to increase the winning rate.

From the discussion and results it can be analyzed that Crazy Stone is not best although it uses new technique of selecting best move and reducing search space. In the following paper discussion we will see how it develops on the previous paper's contribution for Monte Carlo tree search but with a more stronger policy and value networks so as to play large board, 19*19 Go board game, efficiently and with very high winning percentage.

## Mastering the game of Go with deep neural networks and tree search

### Introduction

AlphaGo is the brainchild of Google DeepMind. AlphaGo has defeated the European Go champion Fan Hui, five to zero. This was the first time a computer Go program had beaten a professional human player in even games on a full-sized board. As on March 15 2016, AlphaGo achieved the biggest milestone in the field of Artificial Intelligence by having a historic win over world Go Champion Lee Se-dol.

#### Framework

AlphaGo's algorithm uses a combination of Monte Carlo tree search and deep neural network implementation of "value network" and "policy network". The neural networks were trained in many stages.

In the first stage a supervised learning(SL) policy network (p_{σ}) was trained from expert human moves using a database of around 30 million moves of recorded historical games. This provided fast, efficient learning updates with immediate feedback and high-quality gradients. They also trained a fast policy network (p_{π}) that can rapidly sample actions during rollouts.

In the second stage, a reinforcement learning (RL) policy network (p_{ρ}) was trained that improved the SL policy network by optimizing the final outcome of games of self-play. This adjusted the policy towards the correct goal of winning games, rather than maximizing predictive accuracy.

In the final stage, a value network (v_{θ}) was trained that predicted the winner of games played by the RL policy network against itself.

Recently, deep convolutional neural networks have achieved unprecedented performance. They use many layers of neurons, each arranged in overlapping tiles, to construct increasingly abstract, localized representations. AlphaGo also employs a similar architecture for the game of Go. They pass in the board position as a 19 × 19 image and use convolutional layers to construct a representation of the position. We use these neural networks to reduce the effective depth and breadth of the search tree: evaluating positions using a value network, and sampling actions using a policy network.

#### Why AlphaGo is so successful?

MCTS described in the **previous paper(described in above section)** uses Monte Carlo rollouts to estimate the value of each state in a search tree. As more simulations(to increase winning prediction) are executed, the search tree grows larger and the relevant values become more accurate. The strongest current Go programs, like Crazy Stone, are based on MCTS, enhanced by policies that are trained to predict human expert moves. These policies are used to narrow the search to a beam of high-probability actions, and to sample actions during rollouts. **However, prior work has been limited to shallow policies or value functions based on a linear combination of input features.**

In the following sections we will see how strong policy and value networks helped AlphaGo to beat human Go champion. The key idea is to reduce search space, because exhaustive search is infeasible. Search space could be reduced either breadth-wise or depth-wise. Strong policies are needed to reduce search space breadth-wise. For reducing search space depth-wise, optimal value function is needed. In the following sections we will explore each one of them and analyze how it skyrocketed the performance of AlphaGo.

### The Three Stages

#### 1. Supervised Learning (SL) of policy networks

In the first stage of the training pipeline, supervised learning was used to predict expert moves in the game of Go thereby reducing the "action candidates" and eventually reducing search space breadthwise.

**Input** to the policy network p_{σ}, where σ is the corresponding weights, is a simple representation of the board state. The policy network was trained on randomly sampled state-action(s, a) pair using stochastic gradient descent to maximize the likelihood of the human move **a** selected in state **s**.

**Training Model:**

Δσ ∝ ∂log p_{σ}(a|s)/∂σ

A 13-layer convolutional neural network, called as SL policy network, was trained from 30 million positions from KGS Go Server. The SL policy network predicted expert moves on a test set with an accuracy of 57% compared to the state-of-the-art from another group of 44.4%. Adjoining graph depicts how small improvement in accuracy led to significant improvement in playing strength. As larger networks achieve better accuracy but are slower to evaluate during search, they exploited GPUs to process such large networks.

#### 2. Reinforcement Learning (RL) of policy networks

The another way to reduce "action candidates" is through self-plays i.e. playing against itself. This is the second stage of training pipeline. RL policy network p_{ρ}, where ρ is the corresponding weights, has the same topology as the SL policy network p_{σ}. Playing game with current policy network and a randomly selected previous iteration of policy network stabilizes training and prevents overfitting.

**Training Model:**

Δρ ∝ ∂log p_{ρ}(a_{t}|s_{t})z_{t}/∂ρ

z_{t} is the terminal reward at the end of the game from the perspective of the current player at time step t with value +1 for win and -1 for loss. The RL policy network won more than 80% of the games against SL policy network. Using no search at all, RL policy network won 85% of the games against Pachi, the strongest opensource Go program having sophisticated Monte-Carlo search.

#### 3. Reinforcement Learning (RL) of value networks

This is the final stage of training pipeline. It focuses on position evaluation to predict how good is the current position. If the outcome is 1, it means the position is a good board position and similarly if the outcome is 0, it means the position is a bad board position. The structure of this value network is similar to that of policy network but it outputs a single prediction instead of probability distribution. The weights of the value network are trained using stochastic gradient descent to minimize the mean squared error(MSE) between predicted value *v _{θ}(s)*, where θ is the corresponding weights, and corresponding outcome

*z*.

**Training Model:**

Δθ ∝ ∂v_{θ}(s)(z-v_{θ}(s))/∂θ

A new self-play data set was generated. It consisted of 30 million distinct positions, each sampled from a separate game. Each game was played between the RL policy network and itself until the game terminated. Training on this data set led to MSEs of 0.226 and 0.234 on the training and test set respectively, thereby indicating minimal overfitting.

### Monte-Carlo Tree Searching with policy and value networks

Alpha Go combines the policy and value networks in an MCTS algorithm that selects actions by look-ahead search.

Flow of this algorithm is: Selection → Expansion → Evaluation → Backup

Each edge *(s, a)* of the search tree stores an action value *Q(s, a)*, visit count *N(s, a)*, and prior probability *P(s, a)*. The tree is traversed by simulation (that is, descending the tree in complete games without backup), starting from the root state. At each time step *t* of each simulation, an action *a _{t}* is

**selected**from state

*s*so as to maximize action value which is proportional to the prior probabilities.

_{t}When the traversal reaches a leaf node s

_{L}at step L, the leaf node may be

**expanded**. The leaf position s

_{L}is processed just once by the SL policy network p

_{σ}. The output probabilities are stored as prior probabilities

*P*for each legal action

*a*.

The leaf node is

**evaluated**in two very different ways: first, by the value network

*v*; and second, by the outcome

_{θ}(s_{L})*z*of a random rollout played out until terminal step T using the fast rollout policy

_{L}*p*; these evaluations are combined, using a mixing parameter λ, into a leaf evaluation

_{π}*V(s*

_{L})V(s_{L}) = (1−λ)v_{θ}(s_{L}) + λz_{L}

At the end of simulation, the action values and visit counts of all traversed edges are updated. Each edge accumulates the visit count N and mean evaluation Q of all simulations passing through that edge. Once the search is complete, the algorithm chooses the most visited move from the root position.

### Discussion

Using this approach, AlphaGo evaluated thousands of times fewer positions than IBM’s Deep Blue which has even smaller search space than Go. Adjoining figure shows the results of a tournament between different Go programs. Programs were evaluated on an Elo scale: a 230 point gap corresponds to a 79% probability of winning. Horizontal lines show KGS ranks achieved online by that program. Needless to say that these results now do not matter as AlphaGo has already unlocked the greatest milestone of AI so far.

Policy and value networks requires several orders of magnitude more computation than traditional search heuristics. To efficiently combine MCTS with deep neural networks, AlphaGo uses an asynchronous multi-threaded search that executes simulations on CPUs, and computes policy and value networks in parallel on GPUs. They also implemented a distributed version of AlphaGo that exploited multiple machines, 40 search threads, 1,202 CPUs and 176 GPUs and it is currently being in use to play against world Go champion Lee Se-dol.

## Annotated Bibliography

[1] Mastering the game of Go with deep neural networks and tree search

[2] Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search

[3] Monte-Carlo tree search and rapid action value estimation in computer Go

[4] Crazy Stone

[5] AlphaGo

[6] Watch AlphaGo vs World Go Champ: First Match

[7] How AlphaGo works