2023/2024
Machine Learning and Forecasting (BUSN9108)
Plagiarism: The electronic version will be analysed using the Turnitin plagiarism-detection software. You will have the opportunity to submit drafts of your report to Turnitin where you will be able to see the originality report (but you may have to wait until the next day to see this). There will thus be no excuse for including any unreferenced material from other sources. You will probably include tables and graphs; it is essential that you reference the source at the bottom of every such entry. Plagiarism is most common when students copy and paste descriptive material from some source (and remember that doing this and then changing the occasional word still constitutes plagiarism). Please also see
https://www.kent.ac.uk/education/academic-integrity/guide-for-students/what-is-plagiarism
Please note that a report that consists mainly of material you have not written yourself is not going to be acceptable, even if it is properly referenced. We are interested in what you can write about the information you have found.
Please ensure the similarity score of your report is no more than 25%.
We advise that you keep a copy of this assignment.
Submission methods
• Only submit your PDF-formatted report to the turnitin system on Moodle;
• Do not print your Word-formatted file to a PDF file. Instead, you should save your Word-formatted file to a PDF file. The turnitin system does not accept a printed PDF file;
• Do not copy and paste any part of this document onto your report;
• Submit your work via moodle.kent.ac.uk by 20:00 pm, 22nd February 2024 (Hong Kong time).
Total marks: 100%
Weighting: 100%
Submission method
• Only submit your PDF-formatted report to the turnitin system on Moodle;
• Do not print your Word-formatted file to a PDF file. Instead, you should save your Word-formatted file to a PDF file. The reason is that the turnitin system does not allow you to upload a printed PDF file.
• Do not copy and paste this page onto your report.
Length • Your answer sheet should include up to 10 pages (excluding the cover page). The font size is 11. If it is over ten pages, the markers reserve the right not to mark the extra pages.
Attention • Any uninterpreted figures, tables, or equations will not be marked.
• Do not simply copy and paste SPSS (or Weka) interface/outputs onto your assignment. The focus of this assignment is not on how well you use the SPSS (or Weka).
• You can use any data analysis software package in answering the questions in this assignment.
Question 1.
Download dataset Box-Plot_Dataset.sav from module BUSN9108 on moodle.kent.ac.uk. Answer the following three questions.
(1).Use a boxplot to compare the Miles_Per_Gallon of the three groups Origin=1, 2, 3. Provide two of your findings. [3 marks]
(2).Plot the Q-Q plot of Acceleration to check if it follows the normal distribution. What is your conclusion? [4 marks]
(3).Use both the Kolmogorov-Smirnov(K-S) test and the Shapiro-Wilk (S-W) test to test whether the variable Acceleration is normally distributed. What is your conclusion and why? [4 marks]
Question 2.
A company surveyed how frequently consumers of various age groups use two different payment methods when making purchases. Sample data for 3000 customers shows the results by four age groups.
(1).Test for the independence between method of payment and age group. Formulate the problem statistically by posing it as a hypothesis test. What is the p-value? Using α =0.05, what is your conclusion? [4 marks]
(2).Assume the payment is an ordinal variable and Method 1 is better than Method 2. What is the γ association measure? Please keep 4 decimal places. [4 marks]
Question 3.
A dataset was split into two subsets: training set and test set. Two regression models with the same number of independent variables were built based on the training dataset. Their AIC values on the training dataset and the test dataset are given below (AIC=Akaike Information Criterion). [8 Marks]
Which model will you select and why? Discuss it.
Question 4.
Download dataset Regression_Dataset.sav from module BUSN9108 on moodle.kent.ac.uk. The dataset is sampled from a dataset in the UCI data bank, which implies that the dataset is a subset of the original dataset. You may find the description of the original dataset on https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data .
(1).Develop a linear regression model with variable critical_temp as the dependent variable and the other variables as independent variables (you may select some of them). Provide your modelling process and final model. [18 marks]
(2).Discuss the performance of your model. [3 marks]
Question 5.
Download dataset Classification_Dataset.sav from module BUSN9108 on moodle.kent.ac.uk. The dataset is sampled from a dataset in the UCI data bank, which implies that the dataset is a subset of the original dataset. You may find the description of the original dataset on https://archive.ics.uci.edu/ml/datasets/seismic-bumps.
(1).Develop a logistic regression model with variable class as the dependent variable and the other variables as independent variables when the variable selection method is Backward: Ward. Provide your modelling process. [10 marks]
(2).Provide your final model and interpret the coefficient of a categorical variable (i.e., a nominal or ordinal variable) and that of a continuous variable (i.e., a scale variable) in your model, respectively. [7 marks]
Question 6.
Download dataset credit-card-clients.sav from module BUSN9108 on moodle.kent.ac.uk. The dataset is sampled from a dataset in the UCI data repository, which implies that the dataset is a subset of the original dataset. You may find the description of the original dataset on https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients. Use three modelling methods to build three classification models with variable Y as the dependent variable and the other variables as independent variables (you may select some of them), respectively. Please note: (A) Logistic regression should not be used in answering this question, and (B) if you want to use Decision trees, you can only use one of the three decision trees: QUEST, CRT and CHAID.
Answer the following questions.
(1).Provide your modelling process for each method. [10 marks]
(2).Select the modelling method with the best performance and explain how you will use it in the future. [3 marks]
Question 7.
A construction equipment company, CX, started its business about 100 years ago. Recently, one of its product, compact excavator, has become the leading product of CX, but the sales team has suffered with repeated over productions and under productions, due to wrong sales forecast. The production team needs to schedule how many compact excavators need to be produced one month in advance. See dataset TimeSeries_Dataset.xlsx from module BUSN9108 on moodle.kent.ac.uk for the sales data.
Ms Chan, the senior manager of the sales team, is very anxious with the accuracy of the current sales forecasting model, which uses the moving average of the last three months. The sales team was quite puzzled by the great variability in the sales every month, and the current forecast model seems to be too simplistic and could not justify why it averages the last three periods. Some of her sales team suggested that the historical data for the number of sales might contain seasonal dependencies, but they have no idea how to confirm the existence of the seasonal patterns and how to model this feature. Some others also suggested using a regression model, but they could not find any relevant external factors that could explain the behaviour of the sales time series, so they decided to enhance the forecast based on the historical time series data at this stage.
(1).Before any modelling, visually analyse each of the systematic patterns (e.g. trend and seasonality) in the time series and discuss their existence or/and patterns. [3 marks]
(2).Divide the data into the training dataset/period (up to the end of 2015) for estimating forecast models, and the test (hold-out) dataset/period (Jan-2016 onward) to evaluate your model forecasts. Develop a Holt’s exponential smoothing (HES) model and a Holt-Winter’s model. When estimating the model parameter(s), you are suggested to use the MSE(mean squared error) in the fitting period only.
a) Present the two models, respectively. Note: you need to provide the mathematical models. [5 marks]
b) Based on Part a), which model do you recommend finally and why? [2 marks]
Question 8.
Download dataset Cluster-analysis_Dataset.sav from module BUSN9108 on moodle.kent.ac.uk. The dataset contains the following variables measuring the geometric parameters of a kind of plant:
v1. area A,
v2. perimeter P,
v3. length of kernel,
v4. width of kernel,
v5. asymmetry coefficient
v6. length of kernel groove
Answer the following questions.
(1).Use Ward’s method to determine the number of clusters and explain the reason, [3 marks]
(2).Based on the number of clusters determined from the above step, use the k-means clustering method to cluster the observations and interpret the outcome. [3 marks]
Question 9.
Download dataset FactorAnalysis_Dataset.sav from module BUSN9108 on moodle.kent.ac.uk. This dataset contains responses to a questionnaire on factors related to the quality of a public place. Each observation represents a response from a user. Answer the following questions.
(1).How many factors do you select and how do you select them? [2 marks]
(2).What is the cumulative percentage of variance accounted for by your selected factors? Interpret it. [2 marks]
(3).If you use rotation method “Varimax” and use extraction method “Principal component analysis”, which variables are your factor(s) associated with? [2 marks]