2 Programming [50 Points]
2.1 Setup
2.2 Introduction and Objective
It’s important to note that the MCTS you implement in this homework isn’t the exact same as the lecture. The main di# erence is that instead of a rollout for the simulation step, you’ll use a neural network to get the value. Make sure to read the whole spec before coding. It’ll save you a huge amount of time.
You will need to submit two ! les to the Gradescope autograder:
1. Your copy of the Google Colab notebook eecs492hw4.ipynb. This notebook will contain both your code and your written answers to the conceptual questions!
2.3 Othello
2.4 Google Colab Notebook
Important: Make your own copy of the notebook before starting to code! To make a copy of the Google Colab notebook:
You can ! nd the Google Colab notebook to start here.
If your runtime disconnects because of inactivity, you will have to run the necessary code blocks again. Also note that we’re not using GPU in our Google Colab, so do not manually enable GPU.
Tip: You can directly edit the cell that contains MCTS.py. When you are done, just copy the code into a ! le called MCTS.py and submit to the autograder along with your copy of the eecs492.hw4.ipynb ! le to Gradescope for the coding portion.
2.5 High level of AlphaZero Algorithm
1. For numIters (number of iterations), execute episodes numEps times2. For each episode, perform MCTS (MCTsearch()) on the tree until a terminal node is reached (this is where your implementation in MCTS class will be used)3. Each search will generate training samples4. Train the neural network on the generated training samples5. Have the neural network play against the old version of the neural network arenaCompare times6. If it did better, make the neural network the new best one7. Repeat.
Overview
The code in the MCTS cell block will contain additional comments to help you understand what functions are relevant for your code. Below is high level information that may be helpful for your understanding of the algorithm.
eecs492hw4.ipynb
There is an argument variable (args) for you to tune hyperparameters and toy around if you like. For the actual project, do not modify these values. The existing hyperparameters are needed for grading. Only modify the hyperparameters if you’re curious about how the model will perform when optimized. The most important hyperparameters are given below.
– The number of iterations the algorithm will take when learning
– The number of episodes each iteration takes. This will determine how many training samples the model will have each iteration
– The number of times the new neural network plays against the old one.
Coach class
This class is the helper function to train your algorithm. Most of the learning will be done here.
– This performs the training of the model
– This is where MCTsearch() will be indirectly called through getActionProb(). Your implementation will update the policy values. When executeEpisode() tries to take an action, it’ll take one according to the policy values you have updated. The action this function takes will be compared against the instructor’s solution set. Although you won’t write this function, this function will test your search implementation indirectly.
MCTS.py
– Determines if the current board position is the end of the game. Will return r, a value that returns 0 if the game hasn’t ended, 1 if the player won, -1 if the player lost
– Computes, r, the reward from the neural network. Use this for the simulation step instead of a rollout. Returns the reward of the simulated rollout– NOTE: Assume that the neural network gives a reward of the appropriate sign whenever it is called. That is, if the neural network returns +1, the current player won the game. If the neural network returns -1, the opponent won the game. This assumption allows us to not track whose turn it is in the game tree. Instead, we know when we return a reward back up to a parent node, the parent is our opponent and therefore the reward should be negated. (See the hints for each function below for more speci! c information on where these reward adjustments take place.)
– Returns all valid actions you can take in terms of a list of integers
– Returns u, an upper con! dence bound value for a given state and action
– Updates the necessary values for backpropagation for a given state and action
– Expands the state in MCTS
– This is the simulate phase of MCTS
– This is the selection phase of MCTS
– This is the backpropagation phase of MCTS
– This is the main function for your Monte Carlo Tree Search
SELECT
Hints for ! rst TODO:
• Remember that self.gameEnded returns 0 if the game is not over, 1 if the current player won, and -1 if the current player lost. Thus, if r is something besides 0, we should treat it as an end state.
– Note that in an end state, the state, board, and action should all be None, thus we should return something that looks like None, None, None, reward
– In this case, take a look at how expand and simulate may help in this scenario.
∗ The simulate function gives the reward from the simulated rollout (neural network prediction) starting from the current state.∗ The expand function adds the current state to the set of visited.
• Remember that when we return from the select function, we are returning a level up to the opposite player. For example, if a reward is -1 for the max player, we should return 1 for the min player. Thus you should reverse the value returned.
Hint for second TODO: Use the self.getConfidenceVal function to get the highest con! - dence value and the associated best action to take with the highest con! dence value. Note that this conceptually is where we iterate over possible actions in the select phase of MCTS to ! nd the best course of action we take.
BACKPROPAGATE
– When r’ is not 0, this indicates that we have reached an end state (game win or game loss) in the rollout. As a result, we can set the overall “accumulated” reward to the current reward in the sequence since it is an end state.
– In the other case when r’ is 0, we know that this must be an intermediate state, action pair that we have not discovered before. When we come across such a pair, we must call self.updateValues with the current accumulated reward which propagates rewards into the search tree. Afterwards, we must adjust the reward to that of the other player (max → min, min → max).
SIMULATE
Hint for second TODO: At this step, we have just obtained a board state that produces areward value that isn’t equal to 0 (meaning the rollout has either concluded in a loss or win). After we complete a rollout in the MCTS algorithm, what should we do to update our values from all the nodes we visited during the rollout?
2.6 Helpful Tips / Bug Board
Bug 1: Not returning -r in select end cases and return r instead
Note that we truncated the entire output but if this where your output starts being di# erent from instructor’s solution, double check this
Bug 2: Not using old reward in self.updateValues within backpropagate
If this is your bug, your output might look similar to the output for bug 1. Double check that you are using the old reward in the else case (non-end state) for backprop.
If this is your bug, your output will look like this: [[...(5, 0), (4, 3), (2, 1), (3, 1), (4, 0), (1, 0), (4,2), (4, 1)...]]