Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
INF2167 Assignment #2: Cross-Validation in Machine Learning
Task Overview
You are a data scientist working for a business called ShopCo, a retailer that sells a trendy toy (Labubu) both online and in brick-and-mortor stores. The VP of Sales wants to understand drivers of weekly units sold and asks you to compare linear vs. non-linear ML models using a train/test workflow. Leadership would like to see evidence of model validation, as they would like to use the results from the model to inform subsequent inventory and staffing decisions.
• VP sales is knowledgeable about statistics
• They ask you to put together a slideshow outlining your methodological approach and results
• You are instructed to submit a .pdf copy of the slides ahead of a meeting to give stakeholders time to review your work.
In your slides, focus on telling a story about interesting relationships that you find in the data, rather than describing the analysis process itself. This means that you can feel empowered to only show select code (or no code, depending on your style), and that you should omit visuals / analyses that are not relevant to the narrative.
Please submit two files on Quercus:
1. A .pdf copy of a slide deck generated through Rstudio showing your and results
2. The .Rmd file that you used to generate the slides
Data Overview
The goal is to predict units sold per week using the other variables provided in the dataset. Which variables you include and how you build the model is your choice.
I am including an overview of the different variables in the dataset here for your convenience:
|
Variable |
Type |
Description |
Example Values |
|
week_id |
Numeric (int) |
Week number (1–104) used to simulate seasonal variation across ~2 years. |
17, 83 |
|
region |
Factor (3 levels) |
Geographic region of ShopCo’s operations. Captures local economic differences. |
East, West, Central |
|
season |
Factor (2 levels) |
Indicates whether the observation occurred during the Holiday season (e.g., Nov–Dec) or Regular sales period. |
Holiday, Regular |
|
channel |
Factor (2 levels) |
Sales channel — Online or |
Online, Store |
|
price |
Numeric (continuous) |
Store — capturing mode of customer engagement. Average retail price (in dollars) of the focal product for that week. Higher prices generally reduce units sold. |
18.75, 24.30 |
|
discount_rate |
Numeric (0–0.5) |
Fractional discount applied to the product that week (0 = no discount, 0.5 = 50% off). Expect non-linear effects. |
0.10, 0.25 |
|
ad_spend |
Numeric (continuous) |
Marketing and advertising budget for that week (USD). Exhibits diminishing returns on sales. |
12,500, 37,000 |
|
site_speed_ms |
Numeric (continuous) |
Average website load time in milliseconds (ms). Very slow sites reduce sales, but ultra-fast ones bring little extra gain. |
900, 1500 |
|
loyalty_score |
Numeric (0–10) |
Index reflecting average customer loyalty or membership engagement. Higher scores = more repeat customers. |
4.8, 9.2 |
|
competitor_price |
Numeric (continuous) |
Average price of a similar competitor product that week. If competitors charge more, ShopCo tends to sell more. |
22.50, 19.75 |
|
units_sold |
Numeric (integer, DV) |
Dependent variable — total number of units sold that week. This is what students aim to model and predict. |
370, 890 |
|
revenue |
Numeric (continuous) |
Derived variable: total revenue = units_sold × price × (1 - discount_rate). |
6875.40, 19890.25 |
Presentation Content
There is no single “best way” to set up a presentation. Aim to make the slides visually pleasing and easy to follow.
Your presentation should include:
• A title slide
• An introduction / overview of the analysis goals
• Some brief information about exploratory data analysis (e.g., histograms, summary tables)
• A description of how models were built, and why you selected certain predictors
• Describe which predictors you are modeling using lines vs. polynomials, and explain why
• A description of how the data are split into a training set and a test set, and information about cross-validation
• Evaluation of model performance
• Provide interpretations and descriptions of business implications - What do the results from the modeling approach tell us?