MATH253 Week 10 Tutorial
R Tutorial
This tutorial sheet is related to material covered in chapter 12 (how to do tests/calculations/plots from this chapter in R).
Solutions will be available on Canvas on Friday 5pm.
Part A
In an experiment to investigate the performance of a multi-user computer system, the following data were collected, consisting of observations on the average time taken (y, in seconds) for each terminal to complete a particular task when the same task was submitted simultaneously to x terminals.
x |
40 |
50 |
60 |
45 |
40 |
10 |
30 |
20 |
50 |
30 |
65 |
40 |
65 |
65 |
y |
9.9 |
17.8 |
18.4 |
16.5 |
11.9 |
5.5 |
11.0 |
8.1 |
15.1 |
13.3 |
21.8 |
13.8 |
18.6 |
19.8 |
The data can be found on Canvas – ile Tutorial10 timings.xlsx.
- Download the ile Tutorial10 timings.xlsx to your computer into a folder dedicated to R. Make sure that this folder is set up as your working directory in RStudio.
- In RStudio open a new R script.
- Load the ile Tutorial10 timings using readxl package, creating the variable called timeDF. (See Tutorial 2 for details how to load data using readxl package.)
- Make sure you save your R script in the folder dedicated to R and it is a good idea to keep saving it after each task you complete.
2. Carry out Simple Linear Regression analysis for these data as follows.Create the new variables for the columns in timeDF as following:
term <- timeDF$Terminals
time <- timeDF$Time
Use the command plot to create a scatterplot with term on the horizontal axis and time on the vertical axis. From your scatterplot, would you say that a straight line provides a reasonable model for the given data?
We use the command lm and to print the results we use the command summary. So, we run the following code:
lin <- lm(time term, data=timeDF)
lin
summary(lin)
Note that in lm we irst deine what the column of the response variable y is, which in our case is time. This is then followed by the symbol tilde ~, and then the column of the explanatory variable x, which in our case is term. The expression time ~ term tells R that our responses in time depend on term. The order is important here so take care that you put the column of the response variable y on the left-hand side of ~ and the column of the explanatory variable x on the right-hand side of ~.
Using data = timeDF, we tell R with which data set to work which in our case is timeDF.
Note that calling lin gives only the coe伍cients of the linear regression. The code summary(lin) provides more information, such as the test statistics and p-values for the two-sided tests for the slope and intercept, R2, the test statistic for the ANOVA F-test etc.
Write down the itted regression equation.
Denoting respectively by β0 , β1 the intercept and slope parameters, R carries out tests of the hypotheses H0 : β0 = 0 versus H1 : β0 0 and H0 : β1 = 0 versus H1 : β1 0. The results appear in rows labelled (Intercept) (for β0 ) and term (for β1 ) in the R output.
Report the conclusions from the two hypothesis tests, of H0 : β0 = 0 versus H1 : β0 0 and H0 : β1 = 0 versus H1 : β1 0.
Give the estimated value of the error variance σ2. Note: R outputs the estimated value of σ which is called Residual standard error in the output.
Report and interpret the R2 value. Note: We use the value Multiple R-squared. The value of Adjusted R-squared is not covered in this module, it will be covered in higher years of your studies.
3. Using the normality test for residuals, the histogram of residuals, and the normal probability plot of residuals decide if the assumption of normally distributed errors appear to be justiied here.
First we need to ind the residuals by running the command residuals(lin).
Now use these residuals to perform the normality test, to construct the histogram and the normal probability plot, using the commands discussed in Tutorial 6 and 8.
4. Plot the itted line.
First we use the command plot to plot the points and then abline with the reference to the linear regression model:
plot(term, time, col="blue", pch=19)
abline(lin, col="red")
5. Plot the plot of residuals versus itted values.
First we use the command fitted.values with the reference to the linear regression model which calculates itted values for all observed x-values:
fit <- fitted.values(lin)
Now use the command plot to create the plot with the itted values fit on the horizontal axis and the residuals (found earlier) on the vertical axis.
Discuss whether the simple linear regression model is appropriate here.
6. Compute 95% prediction and conidence intervals when the task is submitted to 50 and 70 terminals. Compute also 90% prediction intervals when the task is submitted to 50 and 70 terminals.
First paramater in the command predict tells R which model to use, which in our case we deined earlier as lin.The second parameter tells R what x0values to use for prediction; we deined them as pr.Then we deine whether we want to calculate the prediction or conidence intervals by using interval = c("prediction") or interval = c("confidence"). And inally, we deine the conidence level using the parameter level.First we create a data frame with the x0-values, in our case 50 and 70, and call it pr (for example):
pr <- data.frame(term=c(50, 70))
Now we use the command predict to construct the prediction and conidence intervals in the following way:
predict(lin, pr, interval = c("prediction"), level = 0.95) predict(lin, pr, interval = c("confidence"), level = 0.95)
R outputs columns fit for the itted values, lwr and upr for lower and upper endpoints of the intervals.
Report the R results,and explain the diferent interpretations of the prediction and conidence intervals. Discuss the factors afecting the widths of the various intervals you have computed.
Part B
Eye melanoma is a type of cancer occurring in the eye. A clinician wants to ind out if the size of a tumour depends on the age of the patient. He collected a random sample of 40 patients and recorded the total volume of the tumour (in mm3 ) and the age of the patient (in years). The data set is available on Canvas – ile Tutorial10 tumour.xlsx.
1. Find the itted regression line to predict the volume of the tumour from the age of the patient.
2. Find the 90% conidence interval for the mean tumour volume for an 80 year old patient. Find also the 90% prediction interval for the tumour volume of an 80 year old patient. Give a practical interpretation of these intervals.
3. Does the tumour volume depend on the age of the patient? Use a formal statistical test to answer this question.
4. State all assumptions about the errors in simple linear regression. Decide whether the errors follow a normal distribution.
5. Decide whether the simple linear regression model seems appropriate here.