Introduction to Machine Learning Assessment

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

Introduction to Machine Learning Assessment

Practical analysis with assessed report. Each student will have access to the same dataset and will also be required to generate simulated data as part of the assessment.

The assessment comprises two parts, each worth 50% of the module grade.  The first part involves writing R code for a machine learning (ML) algorithm and using the code to analyse some simulated data.  The second part requires the use of ML algorithms already implemented in R to analyse a real dataset.  The questions will guide the writing of the R code and the data analysis.

The students will submit one markdown file containing both the code used to generate hypotheses/plots and the comments supporting the various decisions (the comments are limited to approximately 1500 words).

Accompanying this assessment document is an R markdown file that provides guidance for writing the code and structuring the report.  A CSV data file is also provided.

Indicative marks for each part of the assessment are provided below.

Part 1: Constructing a k-nearest neighbour classifier.

You are going to construct a k-nearest neighbour classifier from scratch and will use it to analyse a simulated binary classification problem.

Note: if you get completely stuck when trying to write the code for the classifier, you could make use of an existing implementation of k-nearest neighbours in R, but if you do this you will miss out on the marks associated with writing the code.

a. Using the code provided in the accompanying R markdown file, generate training, validation, and test datasets of sizes 100, 50 and 50 respectively.

Note that you should set edit the value of seedNumber to be equal to the last 4 digits of your USN.

Once you have simulated these datasets, plot the training data, colouring the points according to their class label. [2 marks]

b. Let xT be an observation whose class label we would like to predict.  Write an R function named indicesOfKNearestNeighbours to identify the indices of the k nearest neighbours of xT in the training dataset. [8 marks]

c. Write a function called predictClassLabel that predicts the class label of xT on the basis of the majority vote of its k nearest neighbours. This function should return the predicted class label of xT (which should be 0 or 1). Note that you may assume that we do not encounter tied votes and should state how the issue of tied votes can be straightforwardly avoided when we have a binary classification problem (as in this case). [8 marks]

d. Using the validation data set, together with the predictClassLabel function that you just wrote, show how the (validation) accuracy (= correct classification rate) of your classifier changes as you vary k. On the basis of your results, choose an ``optimal” value of k and provide a justification for your choice. [8 marks]

Also show how the training accuracy of the classifier changes as you vary k, and explain why we should not use the training accuracy to determine the ``optimal” value of k. [3 marks]

e. Using your classifier with the single ``optimal” value of k that you just chose, assess the predictive performance of your classifier on the test dataset, reporting the accuracy and confusion matrix. [2 marks]

By making predictions on a grid of values for X1 and X2, visualise the decision boundary and/or decision regions. Show the test data on this plot. [10 marks]

f. Create a new function called predictClassProbability that returns the probability of xT belonging to Class 1. Using your predictClassProbability function, together with the roc function from the pROC package, plot an ROC curve for your classifier and report the AUROC. [9 marks]

Part 2: Analysing scRNA-seq data

The data set comprises 408 cells (rows) for which we have measured the expression of 500 genes (first 500 columns) and for each of which we have a class label (final column, “classification”).

a. Perform a principal components analysis of the data. Produce a 2-d scatterplot of the data showing the projection of the data onto the first two principal components, in which the colour of the points indicates the class labels. [4 marks]

b. Provide a 2-d visualisation of the data using tSNE. Consider values of the perplexity equal to 1, 5, and 50. Which of these values for the perplexity do you think is most sensible and why? Briefly comment on the differences between the results obtained using PCA vs tSNE. Based on your results, which do you prefer and why? [6 marks]

c. Identify highly correlated variables. How many do you find? [2 marks]

d. Identify zero and near zero variance variables. How many do you find? [2 marks]

e. Split the data into training and test datasets using a 70/30 train/test split. [2 marks]

f. Using caret, train a random forest classifier on the Cuomo training data. What preprocessing should be performed and why? Note that you may fix the ntree parameter to 100 (i.e. no need to optimise/tune this parameter), but you should optimise mtry. Report the value of mtry that you find to be optimal. Assess the predictive performance of the resulting classifier on the test dataset, reporting the accuracy and confusion matrix. [10 marks]

g. Produce a dotchart showing the importance of the 10 most important variables, as measured by the fitted random forest. [4 marks]

h. Using caret, train a k-nearest neighbour classifier on the Cuomo training data. What is the optimal value for k? Assess the predictive performance of the resulting classifier on the test dataset, reporting the accuracy and confusion matrix. [6 marks]

i. Using the results of your previous principal components analysis, consider the projection of the Cuomo dataset into 2-dimensions. Split the resulting 2-dimensional dataset into training and test datasets (note: this should be the same train/test split that you used earlier). Using caret, train a k-nearest neighbours classifier on the resulting 2-dimensional training dataset, finding an optimal value for k. Assess the predictive performance of the resulting classifier on the (2-d) test dataset, reporting the accuracy and confusion matrix. [6 marks]

j. Compare and contrast the predictive performances achieved using: (1) the random forest classifier; (2) the k-nearest neighbour classifier (trained on the original, high-dimensional data); and (3) the k-nearest neighbour classifier trained on the 2-dimensional dataset that was obtained using PCA. Which classifier(s) have the best/worst predictive performance? Why do you think this is the case? [8 marks]

Students should submit

● One markdown file. All code, plots and comments will be embedded in the markdown file. The maximum number of words is 1500. There is a maximum limit of 10 plots. The knitted, resulting pdfs or html files should also be uploaded.

● The submitted R code (embedded in the markdown file) will be used to assess the reproducibility of the analysis; the code should run without errors (compilation or runtime). The packages used for the analysis should be clearly listed at the top of the markdown file.

A fully-answered question will demonstrate that students can do all of the following:

● apply at least two ML approaches to answer a scientific question

● critically assess the suitability of the models (e.g. in terms of pre-requirements, convergence)

● summarise appropriate outputs from the models

clearly explain the scientific findings from the analysis.

Marking Rubric:

Refer (<60%)

Pass (60 – 69%)

High Pass (70 – 74%)

Distinction (>75%)

Part 1 - coding

Code that has little relation to the proposed analysis and model

Mostly correct implementation of the analysis and model, but with errors in minor aspects that don’t affect the main conclusions

Code that is technically correct but moderately lacking in structure

Clearly written, accurate code implementing all required aspects of the analysis and model

Part 2 – real data analysis

Code that has little relation to the proposed analysis and model

Mostly correct implementation of the analysis and model, but with errors in minor aspects that don’t affect the main conclusions

Code that is technically correct but moderately lacking in structure

Clearly written, accurate code implementing all required aspects of the analysis and model

Discussion and interpretation of results

Wrong quantities summarised or fundamentally incorrectly summarised

Mostly correct quantities summarised

Correct quantities summarised, correctly interpreted

All appropriate quantities summarised accurately, with an insightful explanation of their interpretation

Visualisation

Incorrect or uninterpretable visualisation

Interpretable visualisation that conveys most of the desired information

Correct visualisation that conveys the desired information

Correct and clear visualisation that conveys the desired information, with full and accurate labels, annotations and legends.

发表评论

电子邮件地址不会被公开。 必填项已用*标注