Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
CSC 498: Assignment 1
To complete the exercise, you can use the tex template provided in the materials github. Insert your answers into the solution space below each question. In case you are unfamiliar with Latex, you may also submit handwritten solutions, but make sure they are clean and legible.
Submit the exercise before 23:59 pm on the due date on quercus. To submit, please bundle your completed exercise sheet, your jupyter notebook and any material for the bonus task into one zip file. Name the zip file studentnumber_lastname_firstname.zip and upload it on quercus.
Each student will have 3 grace days throughout the semester for late assignment submissions. Late submissions that exceed those grace days will lose 33% of their value for every late day beyond the allotted grace days. Late submissions that exceed three days of delay after the grace days have been used will unfortunately not be accepted. The official policy of the Registrar’s Office at UTM regarding missed exams can be found here https://www. utm.utoronto.ca/registrar/current-students/examinations. If you have a compelling reason for missing the deadline, please contact the course staff as soon as possible to discuss hand in.
For assignment questions, please use Piazza and the office hours, but refrain from posting complete or partial solutions.
I Theoretical background
1.Chain-walk MDP (10)
Figure 1: Simple Chain MDP
Consider a simple 5-state MDP shown in the Figure 1. The agent starts at state s1, and has two actions available in each of the states si , with reward 0. Taking any action from state S−1 or S4 results in a reward r > 0 and the agent stays in the state. The actions are deterministic and always succeed. Assume a discount factor γ < 1.
1. Compute the optimal values for each state with a discount of γ = 0.9. Show the equations and any simplifications so we can follow your reasoning. (3)
2. Runtime comparison (10)
II Coding assignment
To start, download all necessary code from github for assignment 1 from https:// github.com/pairlab/csc498-material. Set up your Python environment and make sure you can run jupyter.
Run first section of the jupyter notebook assignment1.ipynb (This requires you to run all cells within Assignment 1, Task 1).
III Policy Evaluation
Time |
State |
Action |
Reward |
1 |
S0
|
A1
|
−1
|
2
|
S−1
|
A1
|
+1
|
3
|
S0
|
A1
|
+1
|
4
|
S1
|
A1
|
+5
|
5
|
S2
|
N/A
|
N/A
|
IV Bonus challenge
You will be able to get full points in all exercises without these questions, but we strongly encourage you to at least try to complete them. The bonus points will improve your final exercise score in the final grade calculations.
For each of the bonus questions, we will only provide minimal guidance and a high level task description. This means you are strongly encouraged to play around, think about different strategies and discuss your findings in your submission. Upload a description of your solution and relevant code alongside your submission.
Next, you need to discretize the state and action space to use and policy iteration or value iteration approach. You are free to use any strategies here, there are no bounds on your creativity (except your hardware limitations). We do suggest to start simple though.
Finally, evaluate your agent using at least 16 independent runs of the original environ ment. Does the final reward align with the estimated value function of your agent? Are there failure cases and can you explain these? We expect the whole code to run in under 15 minutes.
To obtain full points, we expect clean code, a small written report containing a short discussion of your choice of ML model, your discretization scheme and a graph of reward over time steps showing the mean and standard deviation over all your runs. In addition, please add a small discussion of the final results. Please provide your code and all written parts together in form of a single jupyter notebook.