Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
DATA1002 / 1902 - Informatics: Data and Computation
2024 Sem2
Group Project Stage 3
THE PROJECT WORK FOR THIS STAGE:
Task |
Description |
Group/ individual
|
The models created in this stage must all be predicting (in different ways) one common attribute in the one common dataset. You are allowed to use a dataset you already have from Stage 1 or 2, but you are equally free to change dataset and even domain. There are no requirements for particular origin or volume in the dataset for this stage but note that many machine learning techniques do not work well unless the dataset is quite clean. We have made available a dataset (on a topic of our choice) and any group can use that data instead, if they prefer. We recommend that you do some preliminary data analysis to convince yourself that there is some relationship between the other attributes and the one you are going to predict (otherwise predictions will not be very effective). You also need to choose how you will measure the effectiveness of predicting; we recommend that you use one of the measures that is built-in for scikit-learn to calculate, given the test data and the predictions made for those items. For higher levels than pass, you need more than one measure that you will calculate on each model. | ||
1 |
Identify an attribute that you will all make predictions about and find a dataset that contains this attribute. The attribute you are predicting may be quantitative or nominal |
Group |
2 |
Decide on the measure of success for the predictive models you will be producing. You will need to justify your choice of measure. |
Group |
3 |
Divide the dataset into a training set and a test set. We suggest having at least one-tenth of the original dataset in the test dataset. |
Group |
4 |
Coordinate in choosing the methods you will use, to each produce a predictive model for this attribute, using the training dataset (the coordination is needed to avoid duplication between members, and to enable a good conclusion for your report). |
Group |
Each member needs to produce one predictive model, that will predict the chosen attribute from the values of some or all the other attributes. Details are in the marking scheme below. It is required that all the members have different ways to produce their predictive model. So you need to coordinate among the members, in case two members want to do the same approach, one at least will need to change (a bit – maybe you can each use the same general training technique, but scale the data attributes differently, or use a different subset of the input attributes, etc.)!
Each member then needs to work with the training set and the test set, to produce the material for their section in Part A of the report. This will involve writing Python code (we recommend using scikit-learn) to produce a predictive model based on the training set, and then running the model on the test set and calculating the agreed metric for how good the predictions are. Part A needs to include the code you each write, higher levels of mark require additional discussion and explanation (as indicated in the marking scheme) |
||
5 |
Use Python (for example, the scikit-learn library) to produce a predictive model for the chosen attribute, from the training dataset, using the kind of model and the training method, which was allocated to you by the group. If your method for training has hyper-parameters, you should adjust them as well as possible, but only using parts of the training dataset in doing so [You must not use any of the test dataset for this.]
|
Individual |
6 |
Evaluate the quality of the predictive model you produced, in terms of the measure of success that the group chose.
|
Individual |
7 |
Write your section in Part A of the report, in which you present the work you have done individually.
|
Individual |
Working together as a group, you need to write up a presentation of what you have found about machine learning approaches. This needs to be written to communicate with readers whose focus is data science, in particular, they want to learn more about machine learning and when different approaches work well or not. We realise that your models are likely to be limited, and indeed it may be that none of the models you produced give good predictions– that’s ok, just be honest in saying what you found.
The structure of the report is described in the submission section below. The structure will serve as the basics for grading for this project. From the combined document, you need to produce a PDF. As well, there needs to be a file which compresses a folder, within which are subfolders for each member, the subfolders contain the dataset the member worked with, and the code or spreadsheet for producing their analysis (both summaries and charts). One person submits both PDF and zipped folder, to the submission links on Canvas, on behalf of the whole group. Every member of the group will get the marks earned by the combined submission. |
||
8 |
Write Part B of the report, that discusses the different models and their strengths and weaknesses. This should be written for a reader who is interested in machine learning.
|
Group |
9 |
Produce a PDF of the whole report, with all individual sections and the jointly written Part B and produce the compressed folder with all the data and code from each member. Submit it all.
|
Group
|
SUBMISSION FOR STAGE 3
Deliverable |
Description |
Report |
|
The report should have a front page, that gives the group name, and lists the members involved (giving their SID and unikey, not their name), and then the body of the report has two parts as follows:
|
|
Part A |
|
1. There is an initial section which briefly states the domain and the dataset you are using, and which attribute will be predicted. It also indicates how you split this into training and test data. This section is not marked as such, it is just so the marker can understand the setting for the rest of the report. (max 1 page)
2. There should be one section for each member (the section should state the SID/unikey of the group member who did the work reported in this section, max 4 pages per individual). In this section, there should be some subsections.
|
|
Part B |
|
3. This section is jointly written by the group. It is written for readers who are interested in machine learning In it, you describe the different ways the members produced predictive members, and comment on the evaluations, to draw conclusions about the benefits of the different approaches (see the marking scheme for more guidance on what is expected here). (max 2 pages)
|
|
Write whatever is needed to show the reader that you have earned the marks, and don’tsay more than that! In most cases, the code to produce a summary or chart will be fairly short (a few dozen lines at most), and the evaluation of a chart should not take more than a half- page.
|
|
|
|
Data and Code
|
This should be submitted through the Canvas system, as a single zip or tar.gz file. So you should put have a single folder, with subfolders for each member. The subfolder for a member should contain the Python code to calculate a predictive model and calculate some measure of effectiveness of the model (as well, if you have done any further transforms on attributes before training/testing, the code for these should also be part of what is in your folder). Compress the top folder (with all these subfolders and their contents), then submit the single compressed file.
|
MARKING
M1: PRODUCING PREDICTIVE MODELS [2 POINTS] - INDIVIDUAL
This component is assessed based on the corresponding subsections of each separate member section in Part A of the report; the uploaded data and code may be checked by the marker as supporting evidence for claims made in the report.
Full marks |
In addition to the satisfaction of the Distinction criteria, there is a clear explanation and an argument for why this is a reasonable approach to consider for the task. This discussion should go well beyond simply reporting that the model predicts well, to argue that one could reasonably hope that it might be good, in several ways.
|
Distinction |
In addition to the satisfaction of the Pass criteria, you have at least two methods and one of the methods used that goes beyond what is covered in the Ed ML module, i.e. a more sophisticated preparation and use of the relevant ML method.
|
Pass |
You applied Python on the agreed training dataset, such that you produced a predictive model for the agreed attribute; The code that you wrote to produce the model (including doing any preliminary attribute transformations) must be explicitly shown in the report. The ways in which you produced your models should all be different from other members in the team (this could be different algorithmic training techniques, different choice of hyper-parameters, different scaling, or choice of input attributes, etc).
|
Flawed |
Some predictive model is produced using Python. |
This component is assessed based on the corresponding subsections of each separate member section in Part A of the report; the uploaded data and code may be checked by the marker as supporting evidence for claims made in the report.
Full marks |
In addition to the satisfaction of the Distinction criteria, there is a reasonable discussion relating the outcome of the measurements to the nature of the training approach, characteristics of the dataset and any transformations done.
|
Distinction |
In addition to the satisfaction of the Pass criteria, each member has correctly reported on more than one measure of performance of the model on the test dataset; the code that does this measurement must be explicitly shown in the report. Also, for each approach there is a sensible discussion of the interpretation of the measurements (for example, whether it is indicating overfitting or underfitting).
|
Pass |
Each member has correctly reported on some measure of performance of the model on the test dataset; the code that does this measurement must be explicitly shown in the report. The ways in which the various members’ models are produced should all be different from one another (this could be different algorithmic training techniques, different choice of hyper-parameters, different scaling, or choice of input attributes, etc).
|
Flawed |
Some reasonable attempts to evaluate the effectiveness of some of the predictive models or no accompanied supporting code & data.
|
M3: CONCLUSION [1 POINT] - SHARED
Full marks |
In addition to the satisfaction of the Distinction criteria, the Conclusion section also discusses honestly and with insight, the limitations and uncertainties about the comparisons made between different machine learning techniques (for example, what are limitations of the measurements which were used). It draws the reader in and engages their attention with vivid and stylish prose.
|
Distinction |
In addition to the satisfaction of the Pass criteria, the Conclusion section provides useful insight into strengths and weaknesses of the different machine learning methods for this task. It also indicates features of the dataset that impact on the outcomes. It clearly links to the readers’ background and aims. The structure needs to be logical and well-organised.
|
Pass |
The Conclusion section provides some accurate and clear information about the machine learning techniques that were used for this task, and how the resulting predictive models performed.
|
Flawed |
The Conclusion section describes the machine learning techniques that were used.
|
GROUP ISSUES
Rules
- If there is any group of 4 where all the members are happy to keep the group unchanged, then it will not be forced to change.
- Any group of 3 members from Stage 2, can choose to stay together; however, they may receive an extra person joining the group for Stage 3.
- If for any reason any members in a group want to leave, then they should inform the tutor at the start of week 10 lab, by not joining their former group.
- If someone who wants to leave a group after week 10, they need to urgently contact their lab tutor1 , naming the lab and group they wish to leave. The lab tutor will endeavor to form groups of the proper size, by combining people who have left groups, and/or by adding such people to existing groups with less than 4 members.
- If several people (from a previous group) all want to leave that group but stay together with one another, then they can let the tutor know; the tutor will try to achieve this, but it is not guaranteed.
- If someone wants to join a specific existing group which less than four members, they should tell the tutor, but again this can’t be certain.
Sometimes, people ask to have someone else removed from a group (usually, for non- contributing in Stage 1 or 2). This is not allowed. Instead, the people who are unhappy with someone, can choose to leave the group themselves (as described above), thus leaving the other person in the former group.
Group process
DISPUTE RESOLUTION
- You need to inform your lab tutor2 .
- Make sure that your email names the group, and is explicit about the difficulty
- also make sure this email is copied to all the members of the group (including anyone you are complaining about).
- If necessary, the coordinator will split a group, and leave anyone who didn’t participate effectively, in a group by themselves (they will need to achieve all the outcomes on their own).
- This option is only available up until Monday of week 11, which is the last day with time to resolve the issue before the due date.
- For any group issues that arise after Monday of week 11, you will need to try to resolve theproblem on your own, and you will continue to be treated as a single group which all get the same mark for the group component at this Stage, based on whatever is submitted (though you should still let the coordinator know about them).