CSC 498 Introduction to Reinforcement LearningAssignment 3


Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due


CSC 498: Assignment 3 

To complete the exercise, you can use the tex template provided in the materials github. Insert your answers into the solution space below each question. In case you are unfamiliar with Latex, you may also submit handwritten solutions, but make sure they are clean and legible. Submit the exercise before 23:59 pm on the due date on quercus. 

To submit, please bundle your completed exercise sheet and your jupyter notebook into one zip file. Name the zip file studentnumber_lastname_firstname.zip and upload it on quercus. 

Each student will have 3 grace days throughout the semester for late assignment submissions. Late submissions that exceed those grace days will lose 33% of their value for every late day beyond the allotted grace days. Late submissions that exceed three days of delay after the grace days have been used will unfortunately not be accepted. The official policy of the Registrar’s Office at UTM regarding missed exams can be found here https://www. utm.utoronto.ca/registrar/current-students/examinations. If you have a compelling reason for missing the deadline, please contact the course staff as soon as possible to discuss hand in. For this assignment, you can hand in up to one week late with no penalty. 

For assignment questions, please use Piazza and the office hours, but refrain from posting complete or partial solutions. 

I Policy Gradient Theory 

We proceed to define two policy classes: Softmax Policy Class 

Let θ ∈ R d , and ψ(s, a) : S × A → R d be any function from state-action pairs to R d . Now consider the policy class defined by: πθ(a|s) = exp(ψ(s, a) T θ) P a 0 exp(ψ(s, a0 ) T θ) (1) 

Gaussian Policy Class 

Let θ ∈ R d again, but now φ(s) : S → R d be any function from states to R. Then, if A = R, we define the policy class as follows: 

πθ(a|s) = 1 √ 2πσ2 exp  − 1 2σ 2 (a − φ(s) T θ) 2  (2) 

i.e. a Gaussian centered at φ(s) T θ with a constant variance σ 2 . 

1. What is the score function for the softmax policy class? What is the corresponding (5) policy gradient update equation?

Solution: 

2. What is the score function for the Gaussian policy class? What is the corresponding (5) update equation? 

Solution: 

II Policy Gradient Implementation 

We will be using the continuous MountainCar task from OpenAI Gym. For more details please see the documentation here: https://gym.openai.com/envs/Pendulum-v0/. 

1. Implement REINFORCE, Gaussian policy (10) 

To start, download all necessary code from github for assignment 3 from the course webpage. Set up your Python environment and make sure you can run jupyter. If you have not done so already, you will need to set up the Gym package from OpenAI. 

Run the first section of the jupyter notebook assignment3.ipynb. 

Task 2 contains scaffolding code for a Gaussian policy agent. To be exact, the agent defines it’s policy with the following distribution: 

π(a|s, θ) = 1 √ 2πσ exp  − 1 2σ 2 (a − φ(s) T θ) 2  (3) 

where σ is a constant, and θ ∈ R 3 is a static constant. Note: the actions are bounded in [min-action, max-action]; when sampling actions that end up beyond this range, you can either resample the actions or set the actions to be clipped at the maximum/minimum. 

In the code cell provided in the notebook, please replace the sections marked T ODO with your own code. Your agent should be able to find a solution that is roughly optimal. You are free to chose parameters such as the learning rate, number of episodes, if you find this helps performance. 

You may find your algorithm does not perform exceptionally well; this is fine, so long as it shows improvement as it trains. We added some plotting utilities to help you visualize this. 

Please document the performance of this method, and perform some experimentation with multiple seeds/parameter values. 

2. REINFORCE with Value Function Learning (10) 

In the subsequent part of Task 2, you will need to implement a linear value function estimator parameterized by ξ: 

Vˆ ξ(s) = φ T (s)ξ (4) 

you will use this quantity to rescale the rewards, such that: 

At = rt + γVˆ ξ(st+1) − Vˆ ξ(st) (5) 

and use At when computing the updates for your policy. This parameters ξ are updated with gradient descent, in a separate update immediately after the policy is updated. It will use the update equation from value function approximation, i.e. 

ξ ← ξ − αV T X T t=1 rt + γφT (st+1)ξ − φ T (st)ξ  φ(st) (6) 

This uses a separate learning rate αV that you can tune to increase performance. 

Please document the performance of this method, and perform some experimentation with multiple seeds/parameter values.

Does this increase or decrease the performance of our model? What could we potentially do to improve the quality of this estimator? 

III Approximate Q Learning Theory 

1. What is the main challenge when trying to solve an MDP using function approximation (5) compared to tabular Q learning?

Solution: 

2. Name one advantage and one disadvantage DQN has over Policy Gradient methods like (5) REINFORCE. 

Solution: 

3. Many Q-learning methods opt to output a vector Q(s, ·) ∈ R (5) |A| rather than each value Q(s, a) ∈ R individually. What is one advantage and one disadvantage of this approach? 

Solution: 

4. In supervised learning, models are trained to minimize the difference between their (5) output and data sampled from some fixed distribution. For example, regression with mean-squared error has the following error function: 

L(θ) = E(x,y)∼D kfθ(x) − yk 2 

for some family of models f with parameter θ, and some fixed dataset D. We then optimize for θ to minimize this loss. 

This looks similar to the Q-learning problem, but they are not the same. Outline and explain one difference between these two problems. 

Solution: 

IV DQN

As before, you will find the scaffolding code for a simple DQN agent in the Jupyter Notebook. Fill out all sections marked with ??? to complete the core logic of the DQN agent. Note that you will need to use PyTorch for this exercise, so you might need to refer to the PyTorch documentation (https://pytorch.org/docs/stable/index.html).

1. Vanilla DQN Implementation (10) 

For the first subtask, complete the vanilla DQN implementation. The train method returns all losses computed during the training. Visualize the losses using matplotlib and discuss the results. As always, only modify the code in the marked places and do not add or remove any libraries or dependencies. Your agent might not converge to a stable solution, if so, vary the parameters to investigate their impact on the performance. 

Note that the NN based training can take some time, so it is acceptable if you do not achieve very good performance with limited computational capacity. In addition, the agent is currently not using experience replay, which is sometimes necessary for the agent to converge properly. As a bonus (5 points) you can implement an additional agent using experience replay. 

2. Double DQN Implementation (10) 

For the second subtask, complete the Double DQN implementation. Note that you have to change several parts of the previous algorithm, all relevant sections are marked. The code logic is mostly similar to before, but needs some additional variables, so several parts only have to be tweaked slightly. 

Run the algorithm, plot the loss function and discuss differences to the previous implementation.

发表评论

电子邮件地址不会被公开。 必填项已用*标注