eecs492 hw4

2 Programming [50 Points]

For this coding section, there are 2 conceptual questions at the end you must answer for full score on homework. Please answer these questions in the eecs492hw4.ipynb ! le and submit this along with MCTS.py on Gradescope for the coding portion.

2.1 Setup

All programming assignments for this course use Python. If you’re new to Python or not very experienced with it, please come to o" ce hours; we’re happy to help! We recommend checking out the Python section in the Resources document on Google Drive. We have also shared a document Debugging Tips to help you debug your code.

2.2 Introduction and Objective

In this assignment, we’ll be implementing AlphaZero for Othello. There are many parts to AlphaZero but we will only focus on the Monte Carlo Tree Search (MCTS) portion of it. As long as you have a basic knowledge of MCTS and machine learning (which you should at this point of the course!) then you’ll be ! ne with understanding the project. To summarize, you’ll be implementing the MCTsearch(), select(), backpropagate(), simulate() functions in the MCTS class provided by the Google Colab notebook in one of the cells. Once you have ! nished your code, you will submit your MCTS.py ! le (that you create from your code in the Colab notebook) to Gradescope.

It’s important to note that the MCTS you implement in this homework isn’t the exact same as the lecture. The main di# erence is that instead of a rollout for the simulation step, you’ll use a neural network to get the value. Make sure to read the whole spec before coding. It’ll save you a huge amount of time.

2.2.1 Submission Instructions

You will need to submit two ! les to the Gradescope autograder:

1. Your copy of the Google Colab notebook eecs492hw4.ipynb. This notebook will contain both your code and your written answers to the conceptual questions!

2. The implemented MCTS class in MCTS.py. Once you are done with your code in the Colab notebook, copy the code in the MCTS cell into a ! le called MCTS.py. (The Gradescope autograder can’t grade Colab notebooks, which is why this copy-and-paste into a Python ! le is necessary!)

2.3 Othello

Having an understanding of Othello isn’t needed for the implementation. But having knowledge of the game helps with understanding the test cases and the actions you can take. You don’t need an in-depth understanding of the game and you can ! nd details here.

2.4 Google Colab Notebook

For this programming section, because we are running neural networks, some computers are unable to run this assignment. To make sure all computers are able to run the code e" ciently, we are requiring students to write and run the code through a Google Colab Notebook. To start this programming assignment, please make a copy of eecs492hw4.ipynb.

Important: Make your own copy of the notebook before starting to code! To make a copy of the Google Colab notebook:

1. When you open the notebook, go to “File” > “Save a copy in Drive” in the upper left-hand corner.
2. The copied notebook will default to being in your main Drive. Feel free to update the location so you know where the ! le is!

You can ! nd the Google Colab notebook to start here.

If your runtime disconnects because of inactivity, you will have to run the necessary code blocks again. Also note that we’re not using GPU in our Google Colab, so do not manually enable GPU.

Tip: You can directly edit the cell that contains MCTS.py. When you are done, just copy the code into a ! le called MCTS.py and submit to the autograder along with your copy of the eecs492.hw4.ipynb ! le to Gradescope for the coding portion.

2.5 High level of AlphaZero Algorithm

This section will help you answer the conceptual questions at the end of the coding section. The bolded variables are hyperparameters discussed in the eecs492hw4.ipynb section. This isn’t necessary for the implementation, it’s just nice to know how the whole thing works.
1. For numIters (number of iterations), execute episodes numEps times
2. For each episode, perform MCTS (MCTsearch()) on the tree until a terminal node is reached (this is where your implementation in MCTS class will be used)
3. Each search will generate training samples
4. Train the neural network on the generated training samples
5. Have the neural network play against the old version of the neural network arenaCompare times
6. If it did better, make the neural network the new best one
7. Repeat.

Overview

All relevant code for the homework will be in the eecs492hw4.ipynb ! le. Note that one of the cells in this ! le should be copied into an MCTS.py ! le and submitted along with eecs492hw4.ipynb ! le to the coding section on Gradescope. 10Your MCTsearch() implementation will not directly in$ uence what gets graded. Your implementation will update policy values for a given state and action. So when the algorithm in executeEpisode() tries to take an action, it’ll take one according to the policy values you have updated.

The code in the MCTS cell block will contain additional comments to help you understand what functions are relevant for your code. Below is high level information that may be helpful for your understanding of the algorithm.

eecs492hw4.ipynb

This is the main ! le you will use to run and test your solution. This ! le sets up the program so you can start to train with MCTS. You do not need to worry about setting up the board or running the code, that will be done for you.

There is an argument variable (args) for you to tune hyperparameters and toy around if you like. For the actual project, do not modify these values. The existing hyperparameters are needed for grading. Only modify the hyperparameters if you’re curious about how the model will perform when optimized. The most important hyperparameters are given below.

• numIters
– The number of iterations the algorithm will take when learning
• numEps
– The number of episodes each iteration takes. This will determine how many training samples the model will have each iteration
• arenaCompare
– The number of times the new neural network plays against the old one.

Coach class

This class is the helper function to train your algorithm. Most of the learning will be done here.

It is de! ned in one of the cells on the ipynb ! le. These are the most relevant methods
• learn(self)
– This performs the training of the model
• executeEpisode(self)
– This is where MCTsearch() will be indirectly called through getActionProb(). Your implementation will update the policy values. When executeEpisode() tries to take an action, it’ll take one according to the policy values you have updated. The action this function takes will be compared against the instructor’s solution set. Although you won’t write this function, this function will test your search implementation indirectly.

MCTS.py

This ! le will be created by you from the appropriate cell on eecs492hw4.ipynb and submitted to Gradescope. These are the relevant methods for the homework. You shouldn’t call any methods outside of these. The bolded methods are the ones you have the implement for the homework.
• gameEnded(self, canonicalBoard)
– Determines if the current board position is the end of the game. Will return r, a value that returns 0 if the game hasn’t ended, 1 if the player won, -1 if the player lost
• predict(self, state, canonicalBoard)
– Computes, r, the reward from the neural network. Use this for the simulation step instead of a rollout. Returns the reward of the simulated rollout
– NOTE: Assume that the neural network gives a reward of the appropriate sign whenever it is called. That is, if the neural network returns +1, the current player won the game. If the neural network returns -1, the opponent won the game. This assumption allows us to not track whose turn it is in the game tree. Instead, we know when we return a reward back up to a parent node, the parent is our opponent and therefore the reward should be negated. (See the hints for each function below for more speci! c information on where these reward adjustments take place.)
• getValidActions(self, state)
– Returns all valid actions you can take in terms of a list of integers
• getConfidenceVal(self, state, action)
– Returns u, an upper con! dence bound value for a given state and action
• updateValues(self, r, state, action)
– Updates the necessary values for backpropagation for a given state and action
• expand(self, state)
– Expands the state in MCTS
simulate(self, state, board)
– This is the simulate phase of MCTS
select(self, state, board)
– This is the selection phase of MCTS
backpropagate(self, seq)
– This is the backpropagation phase of MCTS
MCTsearch(self, initial_board)
– This is the main function for your Monte Carlo Tree Search
Below are some speci! c hints / tips that will help with the implementation for each function:

SELECT

Hints for ! rst TODO:

• Remember that self.gameEnded returns 0 if the game is not over, 1 if the current player won, and -1 if the current player lost. Thus, if r is something besides 0, we should treat it as an end state.

– Note that in an end state, the state, board, and action should all be None, thus we should return something that looks like None, None, None, reward
• The other case we must handle at the beginning is if we haven’t visited the current state before and thus there is no information about it and its rewards.
– In this case, take a look at how expand and simulate may help in this scenario.
∗ The simulate function gives the reward from the simulated rollout (neural network prediction) starting from the current state.
∗ The expand function adds the current state to the set of visited.

• Remember that when we return from the select function, we are returning a level up to the opposite player. For example, if a reward is -1 for the max player, we should return 1 for the min player. Thus you should reverse the value returned.

Hint for second TODO: Use the self.getConfidenceVal function to get the highest con! - dence value and the associated best action to take with the highest con! dence value. Note that this conceptually is where we iterate over possible actions in the select phase of MCTS to ! nd the best course of action we take.

BACKPROPAGATE

Hint for ! rst TODO:
• Consider the cases for r’:
– When r’ is not 0, this indicates that we have reached an end state (game win or game loss) in the rollout. As a result, we can set the overall “accumulated” reward to the current reward in the sequence since it is an end state.
– In the other case when r’ is 0, we know that this must be an intermediate state, action pair that we have not discovered before. When we come across such a pair, we must call self.updateValues with the current accumulated reward which propagates rewards into the search tree. Afterwards, we must adjust the reward to that of the other player (max → min, min → max).

SIMULATE

All you should need to know are provided in comments in the code (MCTS.py).
MCTSEARCH
Hints for ! rst TODO:
• Since we know that the selection phase of MCTS (self.select()) returns a tuple of (state’, board’, action, r’), think about how to leverage the returned values in setting the new board, state, and r.
• In addition, at each selection step, you will need to update the sequence properly with the newly generate r’.

Hint for second TODO: At this step, we have just obtained a board state that produces areward value that isn’t equal to 0 (meaning the rollout has either concluded in a loss or win). After we complete a rollout in the MCTS algorithm, what should we do to update our values from all the nodes we visited during the rollout?

2.6 Helpful Tips / Bug Board

In this section, we will provide some common bugs to look into if your solution doesn’t match instructor solution:

Bug 1: Not returning -r in select end cases and return r instead

If this is your bug, your output might look like: [[(2, 1), (3, 1), (4, 2), (5, 1), (4, 1), (3, 0), (4, 3), (1,3), (0, 3), (1, 1)...

Note that we truncated the entire output but if this where your output starts being di# erent from instructor’s solution, double check this

Bug 2: Not using old reward in self.updateValues within backpropagate

If this is your bug, your output might look similar to the output for bug 1. Double check that you are using the old reward in the else case (non-end state) for backprop.

Bug 3: Using state’ when inserting into sequence LifoQueue
If this is your bug, you might get a compile issue on the Coach cell that looks like this:
Key Error: (b’\x00\x00...
Bug 4: Not setting r1 properly in backprop else case

If this is your bug, your output will look like this: [[...(5, 0), (4, 3), (2, 1), (3, 1), (4, 0), (1, 0), (4,2), (4, 1)...]]

2.7 Coding Section Conceptual Questions

For the following questions, you may ! nd it helpful to run our Othello visualizer in Google Colab notebooks and tweak the hyperparameters to guide your answers. Please answer on the cells provided in the Google Colab at the end of the notebook. (Note that changing hyperparameters may make your solution not produce correct output even if it is correct, thus we advise doing so after getting full points on Gradescope)

2.7.1 Question 1 (5 pts)

What might happen to the moves the agent makes if you use an implementation of MCTS where backpropagate doesn’t actually update any values in the tree (that still runs)? Why?

2.7.2 Question 2 (5 pts)

Give 2 reasons why having a really high number of iterations and episodes might be non-ideal (changing the hyperparams to this might give you a clue). If your code does not work, answer to the best of your ability and justify.

发表评论

电子邮件地址不会被公开。 必填项已用*标注