Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
FIT5217 - Assignment 1
Ehsan Shareghi - Monash University
|
Marks |
Worth 100 marks, and 25% of all marks for the unit |
|
Due Date |
10th April, 11:55 PM |
|
Extension |
An extension could be granted under some circumstances. A special consideration application form must be submitted. Please refer to the university webpage on special consideration. |
|
Lateness |
For all assessment items handed in after the official due date, and without an agreed extension, a 5% penalty applies to the student’s mark for each day after the due date (including weekends) for up to 7 days. Assessment items handed in after 7 days without special consideration will not be considered. |
|
Authorship |
This is an individual assessment. All work must be your own. All submissions will be placed through Turnitin. This makes plagiarism remarkably easy to identify for us. |
|
Submission |
You are provided with 2 template files for .xlsx and .pptx submissions. These files are to be filled by you. You need to submit 3 files: an xlsx file, a PowerPoint file, and a 5-minute video presentation. The name of the files must be Assignment 1 FIT5217 012345678.xlsx and Assignment 1 FIT5217 012345678.pptx and Assignment 1 FIT5217 012345678.mp4 where “012345678” is replaced by your own student ID. |
Table 1: Instructions for Assignment 1
Reasoning with Large Language Models (LLMs)
For this assignment you need to create accounts to use OpenAI’s ChatGP (free version) (https://chat.openai.com/chat), Deepseek’s R1 (https://chat.deepseek.com/), and Anthropic’s Claude Sonnet (free version) (https:
//claude.ai/chats). These are 3 well-established LLMs.
Part 1.
You need to work on the following 2 categories of reasoning, and come up with a total of 5 reasoning questions (only multiple choice questions with 2-4 options) for which ChatGPT fails in 3 out of 3 attempts (i.e., repeat the question each time in a new session). All 5 questions could belong to one of the categories. It is up to you how to distribute your questions across the 2 categories. The 2 categories and examples per each category:
• Category 1: Mathematical/Logical Reasoning - Examples of questions:
- If the zookeeper had 100 pairs of animals in her zoo and if two pairs of babies are born for each and every one of the original animals, and then sadly 23 animal don’t survive, how many animals do you have left in total? Choices are (A) 377 (B) 977 (C) 3777 (D) 4777
- Tom has a red marble, a green marble, a blue marble, and three identical yellow marbles. How many different groups of two marbles can Tom choose? Choices are (A) 6 (B) 7 (C) 8 (D) 10
• Category 2: Commonsense/Physical/Temporal Reasoning - Examples of questions:
- I landed on the planet gooblygoob9m2, my flootenwooten slipped off my hand, and hit the ground. Does gooblygoob9m2 have gravity? Choices are (A) Yes (B) No (C) Maybe
- There is an apple inside a blue box. There is also a red box inside the blue box. The red box has a lid. How can I get the apple? Choices are (A) Open the blue box and take the apple. (B) Take out the red box, then take the apple from the blue box. (C) Break the blue box to get the apple. (D) Shake the blue box until the apple falls out.
Notes for Part 1:
• A valid question for part 1 is a question for which all 3 attempts with ChatGPT result in failure.
• The questions should NOT come from the internet or published articles.
• The questions, and the correct answer along with all 3 ChatGPT attempts are reported in the corresponding parts of the XLSX file. In your presentation, only pick one of the 3 attempts per question for analysis and discussion. • If you discover a working pattern, do not exploit it. Questions must not follow the same pattern repeatedly. Pattern exploitation will not be awarded marks beyond the first instance.
• The questions should NOT be subjective (i.e., What is the most beautiful place in the world? Does burger taste better than fried fish? etc). There should be an objective single answer.
• When interacting with LLMs, each question should be typed into a new Chat session (do NOT type more than 1 question into a single chat session). This is to avoid contextual confusion or leakage across questions.
Part 2.
Once you found the above 5 questions which make ChatGPT fail, you need to try them again with Deepseek (R1) and Anthropic’s LLMs (Claude Sonnet) and assess whether these two other LLMs fail or succeed at your reasoning questions. Similar to the above, you need to repeat this 3 times and report all 3 attempts’ output.
Notes for Part 2:
• The output of 3 attempts per each LLM will be recorded in the corresponding part of the XLSX file.
• The overall success or failure of these 2 LLMs will also be reported in the last table in the PPTX file.
• When interacting with the LLMs, each question should be typed into a new Chat session (do NOT type more than 1 question into a single chat session). This is to avoid contextual confusion or leakage across questions.
Part 3.
For each of the above 5 questions, assess the quality of your question (Objectivity and Novelty) using all 3 LLMs and report the following grades for each question. The instruction to paste into your chat session is provided in below:
You are assessing an annotator based on the quality of their quiz questions. Your task is the following:
For a given question, assess it based on 2 criteria: (Criteria 1) Objectivity Score for which Objective Questions should receive Objectivity Score of 0.5, while Subjective Questions may receive Objectivity Score of 0 or 0.25. (Criteria 2) Novelty Score: Novel questions should receive Novelty Score of 0.5, while known questions or puzzles or questions with very familiar forms should receive Novelty Score of 0 or
0.25. Think step by step before giving the final scores.
####
The question to assess is: [INSERT YOUR QUESTION]
Notes for Part 3:
• The maximum score for objectivity or novelty as highlighted in the instruction is 0.5.
• Similar to the previous experiments, you will need to repeat this 3 times. Report each attempt separately in the corresponding section of the XLSX file.
• Report the average of 3 novelty and objectivity scores for each LLM on the corresponding page of your slides (see the slide PPTX, there is a small table on the top-right corner of Q1-Q5 slides). We will refer to these as LLM - ASSESSED-SCORES later in this assignment description.
• Similar to Part 1 and 2, each interaction for this part should be typed into a separate new chat session.
Submission Files.
You are required to submit three files. Two template files provided with the assignment - an xlsx template and a PPTX template - must be completed according to their structure and specified content in below. DO NOT CHANGE THE STYLE OF THESE FILES. Additionally, you are required to create and submit a 5-minute video presentation that comprehensively covers the content outlined in the PPTX template. This video, along with the completed xlsx and PPTX files are the 3 files that should be uploaded to Moodle. The name of the files must be Assignment 1 FIT5217 012345678.xlsx and Assignment 1 FIT5217 012345678.pptx and the video recording Assignment 1 FIT5217 012345678.mp4 where “012345678” is replaced by your own student ID.
Assignment Mark Break Down - Total of 100 Marks.
Please open the PPTX file before reading the following. Here is the breakdown of mark:
• 90 Marks - Presentation:
- If the PPTX file format was adjusted, 0 mark will be given. No additional page should be added or removed from the PPTX file. Not submitting the mp4 has a 90% penalty. This is a very strict marking criteria. Not submitting the PPTX file has a penalty of 10%.
- 45 Marks - Analysis: For each question you will get 9 marks for the analysis of ChatGPT failure on the question. This needs to be clearly and explicitly grounded on the findings of 1-2 published papers in the literature and have proper citation of relevant papers on the slide. You SHOULD NOT use the same reference more than once in your presentation. The 9 marks are given on the basis of“sufficient analysis” (6 marks per question) and “proper reference” (3 mark per question). Reference mark will only be given if the reference comes from a published paper on arxiv - https://arxiv.org/, or on ACL Anthology https://aclanthology.org/.
- 5 Marks - Reporting LLM-ASSESSED-SCORES: On each of the question slides, there is a table on the top- right corner. This is where you fill the LLM-ASSESSED-SCORES for each question. Remember the objectivity and novelty scores you report are average of 3 attempts by each LLM. The inclusion of the table is 5 marks in total.
- 5 Marks - Table on last slide: Filling the table on the last slide of PPTX, comparing failure/success of 3 LLMs across 3 runs per each LLM.
- 35 Marks - Recording of Presentation: Recordings above 5 minutes or below 4.5 minutes are NOT acceptable and will not get any of the 30 marks. This is a very strict marking criteria. The presentation speed should NOT be adjusted by a software (i.e., fast-paced), but you are allowed to trim pauses in between slides if you want. The presentation needs to:
+ 5 marks: Clearly cover a description of each question and its correct answer on each slide
+ 5 marks: Clearly cover commenting on the novelty and objectivity scores that are given to your question by the LLMs and whether they are/are not aligned with your judgement and why.
+ 5 marks: Clearly cover a description of chatGPT’s response on each slide
+ 20 marks: Clearly cover the analysis of ChatGPT failure on each slide (see the criteria of analysis) + Shows the last slide - see the table on the last page of the PPTX template - at the end of the recording (no need to explain this)
• 10 Marks - XLSX File:
- This is a binary mark. You will either get 10 marks for a correct submission or 0 mark for missing columns, rows, cells, or adjusting the file. No additional row should be added or removed from the XLSX file. If no file was submitted, 0 mark will be given for XLSX File.
Optional Readings
• LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning
• HellaSwag: Can a Machine Really Finish Your Sentence?
• PIQA: Reasoning about Physical Commonsense in Natural Language
• TruthfulQA: Measuring How Models Mimic Human Falsehoods
• WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale
• Online Resource: Google Big Bench
• GPQA: A Graduate-Level Google-Proof Q&A Benchmark