Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
COMP3702 Artificial Intelligence (Semester 2, 2024)
Assignment 3: Reinforcement Learning
Key information:
Reinforcement Learning
Gymnasium API
Additional dependencies are required in some environments as per the documentation: e.g. pip install swig for Box2D environments and pip install pygame for the visualisation libraries.
To get an understanding of these environments, you can visualise them using the human render mode using code as below:
|
import gymnasium as gym
# Initialise the environment # You can replace the environment e.g. "LunarLander-v2" or -v3 # Ensure the environment (e.g. classic-control, box2d) is installed using pip/conda
env = gym.make("CartPole-v1", render_mode='human')
# Reset the environment to generate the first observation
observation, info = env.reset(seed=42)
for _ in range(500):
env.close()
|
PyTorch
If your computer has a dedicated GPU, select PyTorch with GPU support, and download the appropriate drivers (ROCm for AMD GPUs and CUDA for Nvidia GPUs). Make sure you only install one - either the GPU or the CPU version! Note that the classic-control environments can train faster using CPU than GPU.
Task
For background, we recommend that you read through the following tutorials. You can make use of the code in your solutions with attribution:
• Official DQN PyTorch Tutorial: https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html
When training reinforcement learning algorithms, typically we assess the solution quality using the 100-step moving average episode reward (i.e., R100) received by the learning agent. At time step t, the 100-step moving reward is the average episode reward earned by your learning agent in the episodes [t − 100, t]. If the Q-values imply a poor quality policy, this value will be low. If the Q-values correspond to a high-value policy, the 100-step moving average reward will be higher. We use a moving average because rewards may only be received occasionally and the episode reward is affected by sources of randomness including the exploration strategy. You will need to write a function that plots the R100 vs episodes for analysis in the report.
The report
Question 1. Q-learning vs Value Iteration
Question 2. Comparing CartPole-v0 and CartPole-v1
Table 1: Differences between CartPole-v0 and -v1
|
Environment |
max_episode_steps |
reward_threshold |
|
CartPole-v0 CartPole-v1 |
200 500 |
195.0 475.0 |
An episode ends if any one of the following occurs:
a) Implement a function to plot the R100 value vs Episode reward. You will need to import a plotting library, e.g. import matplotlib.pyplot as plt, and can implement a function similar to that used in the PyTorch tutorial. Copy or screenshot your code implementation for your answer, citing any resources you used to develop this. (As part of this, you may also want to implement saving and loading of results and/or plots). (5 marks)
b) Plot the R100 value vs Episode number for CartPole-v0 and CartPole-v1 DQN models. Ensure your axes are correctly labelled and indicate what each plot represents (e.g., using a legend or caption). (5 marks)
c) Describe and compare the learnt policies for CartPole-v0 and CartPole-v1. You may make use of the saved video examples on Blackboard titled “CartPole-v0.mp4” and “CartPole-v1.mp4”. Based on your observation of these learnt policies, the definition of the environment and your plots, explain why you think the values of max_episode_steps and reward_threshold were increased from v0 to v1. (5 marks)
Note: you may need to train the model several times to observe the desired behaviour differences. You can use the human render mode to visualise the policies extracted from the trained neural networks as in the Tutorial solutions or simply describe the saved videos on Blackboard.
Question 3. Loss function and Target network
a) With reference to TD-learning, describe the loss function used to train the neural network in DQN. Use equations and highlight the components corresponding to the “target value” and the neural network’s current state-action value predictions. (5 marks)
Question 4. Learning-rate
a) Plot the quality of the policy learned by DQN, as given by R100, against episode number for three different fixed values of the learning_rate (which is called α in the lecture notes and in many texts and online tutorials). For this question, do not adjust α over time, rather keep it the same value throughout the learning process. Your plot should display the solution quality up to an episode count where either the performance stabilises (typically > 1000 episodes) or a clear difference in learning rates can be observed. (5 marks)
b) With reference to your plot(s), comment on the effect of varying the learning_rate. (5 marks)
c) Use a plot (either self-drawn or sourced and cited) to describe what happens when the learning_rate is too high. (5 marks)
Question 5. Epsilon
Question 6. DQN vs Double DQN or Duelling DQN
An implementation of Duelling DQN is included in the tutorial11 code.
Question 7. Applying DQN beyond CartPole (20 marks)
This question is effectively repeating the experiments from the previous few questions, but where you get to choose which experiments to perform (e.g. which hyper-parameters to vary), and for your chosen environment instead of CartPole. e.g. choose some hyper-parameters (learning rate, epsilon, tau/target sync interval, number of hidden layers, etc), try a few values for each, and select the best value for each of your chosen hyper-parameters based on your results (which allows you to justify your choice of those values).
Criteria:
- Experiment with ≥ 3 hyperparameters and provide evidence and justification for your selection
- Provide evidence of experimentation including plots comparing performance of various hyperparameter settings
- Describe what you observed / the effect of the parameter change and why you chose it/ why you modified it
- Report final settings of hyperparameter values
- Demonstrate that you can solve a level. e.g. define what is a good policy/solved for your selected environment and show that your agent achieves this. e.g. R100 value or description/screenshot of behaviour
Bee Cart Pole credit to Anonymous COMP3702 student and Generative AI.
Academic Misconduct
It is the responsibility of the student to ensure that you understand what constitutes Academic Misconduct and to ensure that you do not break the rules. If you are unclear about what is required, please ask.
In the coding part of COMP3702 assignments, you are allowed to draw on publicly-accessible resources and provided tutorial solutions, but you must make reference or attribution to its source, in comments next to the referenced code, and include a list of references you have drawn on in your solution.py docstring.
• https://guides.library.uq.edu.au/referencing/chatgpt-and-generative-ai-tools
Failure to reference use of generative AI tools constitutes student misconduct under the Student Code of Conduct.
It is the responsibility of the student to take reasonable precautions to guard against unauthorised access by others to his/her work, however stored in whatever format, both before and after assessment. You must not show your code to, or share your code with, any other student under any circumstances. You must not post your code to public discussion forums (including Ed Discussion) or save your code in publicly accessible repositories (check your security settings). You must not look at or copy code from any other student.
All submitted files (code and report) will be subject to electronic plagiarism detection and misconduct proceedings will be instituted against students where plagiarism or collusion is suspected. The electronic plagiarism detection can detect similarities in code structure even if comments, variable names, formatting etc. are modified. If you collude to develop your code or answer your report questions, you will be caught.
For more information, please consult the following University web pages:
• Information regarding Academic Integrity and Misconduct:
– https://my.uq.edu.au/information-and-services/manage-my-program/student-integrity-and conduct/academic-integrity-and-student-conduct– http://ppl.app.uq.edu.au/content/3.60.04-student-integrity-and-misconduct
• Information on Student Services:
– https://www.uq.edu.au/student-services/
Late submission
It may take the autograder up to an hour to grade your submission. It is your responsibility to ensure you are uploading your code early enough and often enough that you are able to resolve any issues that may be revealed by the autograder before the deadline. Submitting non-functional code just before the deadline, and not allowing enough time to update your code in response to autograder feedback is not considered a valid reason to submit late without penalty.
Assessment submissions received after the due time (or any approved extended deadline) will be subject to a late penalty of 10% per 24 hours of the maximum possible mark for the assessment item.
In the event of exceptional circumstances, you may submit a request for an extension. You can find guide-lines on acceptable reasons for an extension here https://my.uq.edu.au/information-and-services/manage-my-program/exams-and-assessment/applying-extension, and submit the UQ Application for Extension of Assessment form.