Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
7SSGN110 Environmental Data Analysis | Practical 3 | Hypothesis testing
1. Introduction
1.1. About this practical
This practical will help you develop your skills using R to analyse data using inferential statistics. It will also require you to manipulate data, building on what you did in Practical 1 and Practical 2. At various points in the instructions it will be assumed that you can look back to those practical instructions to understand what to do. The practical is composed of two sections:
1. Pebbles: Introduction to the basics of running inferential statistics in R & testing for differences between two groups.
2. Forests: Introduction to comparing 3 or more groups
1.2. Practical data
You can access the files for this practical, ‘Pebbles.csv’, ‘Scot_Forests.csv’ via KEATS.
1.3. Getting started with R
Before you start the data analysis In R, ensure that you have:
• Created a folder on your computer for this week’s practical
• Set the working directory for the practical work in R Studio
• Loaded the required packages. We will be using the packages tidyr, ggplot & car
• Read the Pebbles.csv’ and ‘Scot_Forests.csv’ files into R.
Hint: Refer to Practical 2, section 4.1 for guidance on how to do these things.
2. Testing for differences between two groups
2.1. Background
A coastal geomorphologist postulates that bedrock geology is one of the fundamental controls over the size of the material making up beaches. To test this, they wish to determine whether pebble size differs between beaches in sandstone vs. limestone regions. They take a sample of 50 pebbles from each beach and measure the length of the B-axis for each pebble, in millimetres. They wish to use a t-test to analyse whether there is a significant difference in the mean pebble size between the two beaches.
Q1: What would be the null and alternative hypothesis of the two-tailed t-test?
2.2. Descriptive statistics
Double click on the pebbles.data file shown in the workspace window (top right hand side of the window). This will open the pebbles.data dataframe in a new tab. You will see that there are two columns: ‘beach’ describes which beach the data is from and ‘pebblesize’ gives us all the measured pebble sizes.
Q2. What type of data formatting is being used - wide or long?
Let’s first calculate descriptive statistics for the data set. We can do this using the summary() function
summary(pebbles.data)
## beach pebblesize
## Length:100 Min. :15.00
## Class :character 1st Qu.:38.00
## Mode :character Median :44.00
## Mean :45.17
## 3rd Qu.:52.00
## Max. :70.00
You will notice that the summary statistics produced are for the pebble data set as a whole, and is not taking into account the different geologies.
To produce summary statistics using the summary function we need to transform our data into wide format. We can do this using the spread() function as part of the tidyr package.
pebbles.data$row <- 1:nrow(pebbles.data) # adds a column to the pebbles data so that spread will work properly.
pebbles.wide <- spread(pebbles.data, key = beach, value = pebblesize) # spreads the data from long to wide
summary(pebbles.wide) # produces the summary statistics for each column
## row Limestone Sandstone
## Min. : 1.00 Min. :15.00 Min. :24.00
## 1st Qu.: 25.75 1st Qu.:36.00 1st Qu.:43.00
## Median : 50.50 Median :40.50 Median :49.00
## Mean : 50.50 Mean :41.06 Mean :49.28
## 3rd Qu.: 75.25 3rd Qu.:47.50 3rd Qu.:57.75
## Max. :100.00 Max. :65.00 Max. :70.00
## NA's :50 NA's :50
Q3. What is the mean pebble size for the Limestone & Sandstone beach locations?
2.3. Plotting data
Before we conduct any statistical tests, it is important to graph our data to identify any patterns or trends. To do this, create boxplots and histograms of the data. We are going to use the package ggplot2 to do this.
We are first going to produce a boxplot of our data. ggplot2 uses data in long format, so we will need to use the original pebbles.data dataframe to do this.
Q4. Complete the following code to produce a boxplot of the pebbles data. You need to insert the correct variables in our data frame, so the boxplot shows the distribution of pebble size at each location. (Hint: refer to practical 2 if you get stuck)
plot1 <- ggplot(pebbles.data, aes(x = INSERT VARIABLE HERE, y = INSERT VARIABLE HERE))
plot1 + geom_boxplot() + ylab("Pebblesize (mm)")
Your completed boxplot should look like this:
Q5. Complete the following code to produce a histogram of the pebbles data. You need to insert the correct variable and an appropriate binwidth so the distribution of the pebble size for each location is shown. (Hint: refer to practical 2 if you get stuck)
plot2 <- ggplot(pebbles.data, aes(INSERT VARIABLE HERE))
plot2 + geom_histogram(binwidth = INSERT BINWIDTH HERE) + facet_grid(beach ~ .)
Your completed histogram may look something like this:
Q6. From looking at the boxplots, do you think there is any difference in pebble size between the sites?
Q7. From looking at the histograms do you think the data are normally distributed
2.3.1. Checking assumptions
Before we run any statistical tests we should also explore the data to check if the assumptions of the tests are valid. The primary assumption of the t-test is that the data being analysed are normally distributed.
We can first check for normality visually using QQ plots.
The base R functions qqnorm() and qqplot() can be used to produce these plots
qqnorm(): produces a normal QQ plot of the variable, qqline(): adds a ‘normal’ reference line
qqnorm(pebbles.wide$Limestone, pch = 1) # Vector of data to plot, plot symbol
qqline(pebbles.wide$Limestone, col = "steelblue") # Vector of data to add reference line to, colour of line
Q8. Produce a QQ plot for the limestone data. Based on the Q-Q plots, do you think the pebble size data is normally distributed?
We can now test for normality using statistical methods. In R the Shapiro-Wilk test for normality is run on vectors of data. We have already converted our data into wide format (pebbles.wide). However, an alternative approach is to use the subset() function to subset the data into individual vectors for each variable.
In the code below we used the subset function with an expression to tell R which data to retain for the new data set. The operator == means “is equal to” (note it is two = signs together). We also needed to ensure we put the value in the beach column between speech marks (“). So the first command above tells R to retain all rows of data in which the value in the ‘beach’ column is equal to ‘Sandstone’. Similarly select =”pebblesize" means retain only the pebblesize column.
sandstone <- subset(pebbles.data, beach == "Sandstone", select = "pebblesize")
We can check the format of the subsetted data by using the str() command. This command tells us about the structure of the data of interest - so how it is organized and the type of data displayed (character, factor, numeric, etc)
str(sandstone)
## 'data.frame': 50 obs. of 1 variable:
## $ pebblesize: int 52 57 43 62 45 65 61 45 41 45 ...
The output of str() tells us that the subsetted data is still in dataframe format. We can convert this into vector format by using the following code:
sandstone.vec <- sandstone$pebblesize
str(sandstone.vec)
## int [1:50] 52 57 43 62 45 65 61 45 41 45 ...
Q9: Repeat the code above to subset the limestone beach data.
To test for normality in R you can use the shapiro.test() function to run a Shapiro-Wilk test. The shapiro.test() function should be passed a numeric vector of data values (e.g. a column from your ‘wide’ data frames’) and can be used with functions like apply().
Before we start using this function, we need to state our null and alternative hypotheses and select the alpha value (the cut off point in which you can reject the null hypothesis).
Q10: State the null and alternative hypotheses and the alpha value for the Shapiro-Wilk test
We can run the Shapiro-Wilk test in three ways:
shapiro.test(pebbles.wide$Sandstone)
shapiro.test(sandstone.vec) # Using the vector we created through subsetting the data
apply(pebbles.wide[, 2:3], 2, shapiro.test) #Using the apply function to select wide columns to run the test on
The result of the Shapiro-Wilk test for sandstone is as follows:
shapiro.test(pebbles.wide$Sandstone)
##
## Shapiro-Wilk normality test
##
## data: pebbles.wide$Sandstone
## W = 0.98197, p-value = 0.6375
Q11: Do the results of the test for normality of Sandstone data indicate they are normally distributed?
Q12: Repeat the test for limestone. Do the results of the test for normality indicate they are normally distributed?
Next, we will test the assumption of homogeneity of variances using Levene’s test. The Levene’s test is a type of F-test and is more appropriate for samples with a larger number of observations. We are going to use the ‘car’ package to do this, so make sure it is installed and loaded into R Studio.
To conduct a Levene’s test, run the following code:
leveneTest(pebblesize ~ beach, data = pebbles.data, center = mean)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = mean)
## Df F value Pr(>F)
## group 1 1.1046 0.2959
## 98
Q13: Do the results of the Levene’s test indicate homogeneity of variances?
You can also run a ‘F’ test using the function var.test().
2.4. Conducting a t-test
We will assume that the data are normally distributed and that variances are homogenous. To test whether mean pebble sizes are statistically different between the two beaches, we will use an independent t-test.
To run a two-tailed T-test in R, you need to specify the two groups of data you would like to compare, and whether we are assuming equal variances. As we are assuming equal variances for our pebble data, we add argument var.equal= TRUE to our function. If we had unequal variances and therefore wanted to conduct a Welch’s t-test then we would add the argument var.equal= FALSE.
Before we run the test, we need to state our null and alternative hypothesis and select an alpha value. Do this now.
To run the t-test, use the following line of code:
t.test(pebbles.wide$Sandstone, pebbles.wide$Limestone, var.equal = TRUE)
##
## Two Sample t-test
##
## data: pebbles.wide$Sandstone and pebbles.wide$Limestone
## t = 4.1006, df = 98, p-value = 8.513e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 4.241983 12.198017
## sample estimates:
## mean of x mean of y
## 49.28 41.06
Q14: What was the t statistic?
Q15: Which hypothesis would you accept (null or alternative) and why?
Q16: What can we conclude from these findings?
3. Working with 3 or more groups
3.1. Background
In this part of the practical, you will analyse ecological data from the Scottish Highlands to examine how different land uses have impacted ecosystem changes. The dataset we are using is from Warner et al (2021) who were interested in examining the differences in ecosystem function in native reforestation, unforested and mature forest plots in the Scottish Highlands. To do this the researchers measured a series of environmental variables in matched plots across 16 reforestation sites in the Scottish Highlands. Data was collected using a 10 × 10 m plot in each reforestation site. The same data was collected in a matched plot in an unforested, grazed area associated with each site and in a matched plot in an unforested area within the fenced site.
Warner et al. (2021) collected measurements from each plot for a range of variables. Today we are going to look at 4 of these:
1. Topsoil carbon. Carbon content estimate for topsoil in plot, calculated from % carbon and bulk density (in kg/m2)
2. Topsoil nitrogen. Nitrogen content estimate for topsoil in plot, calculated from % nitrogen and bulk density (in kg/m2)
3. Shrub height. Mean height of shrub layer (mean of shrub height measured across the three quadrats per plot(in cm)).
4. Moss cover. Mean depth of moss layer (mean of depth of moss layer measured across the three quadrats per plot(in cm)).
The aims of the project are:
· To assess the response of the ecosystem function of reforested sites in relation to the unforested state and the target state (mature forests)
· To assess how the ecosystem function differs in areas of differing grazing status
Before you start, think about how forest and grazing status may affect the variables above. What might be your provisional hypotheses for these data? What statistical tests could we use to address these hypotheses?
3.2. Import the data into R
Before we decide on the type of data analysis to undertake, we first need to import the data into R Studio. First, create a new script in R. As we are now working with a different data set this makes sense. To do this go to File>New>Script.
In your new script, enter code to set the working directory. Then load the data into R Studio:
forest.data <- read.csv("Scot_Forests.csv", header = T)
We first want to look at the dataset to see the data are organised and if we need to do any dataframe manipulation before we start our analysis. Look the at data structure and summary:
head(forest.data)
## Name Forest Grazing Carbon Nitrogen Shrub.Height Moss
## 1 Athnamulloch 1 Reforestation Ungrazed 13.71 0.56 16.13 9.93
## 2 Athnamulloch 1 Unforested Grazed 26.93 0.91 15.47 4.00
## 3 Athnamulloch 1 Unforested Ungrazed 8.04 0.30 20.33 8.40
## 4 Athnamulloch 2 Reforestation Ungrazed 27.47 1.70 27.00 3.87
## 5 Athnamulloch 2 Unforested Grazed 16.70 0.47 13.53 6.60
## 6 Athnamulloch 2 Unforested Ungrazed 42.47 1.86 28.27 3.40
summary(forest.data)
## Name Forest Grazing Carbon
## Length:52 Length:52 Length:52 Min. : 5.67
## Class :character Class :character Class :character 1st Qu.:14.42
## Mode :character Mode :character Mode :character Median :27.20
## Mean :26.87
## 3rd Qu.:38.02
## Max. :52.06
## Nitrogen Shrub.Height Moss
## Min. :0.2100 Min. : 1.07 Min. : 0.270
## 1st Qu.:0.5975 1st Qu.:16.13 1st Qu.: 2.635
## Median :1.0000 Median :24.43 Median : 4.670
## Mean :1.0642 Mean :24.15 Mean : 5.404
## 3rd Qu.:1.4125 3rd Qu.:30.53 3rd Qu.: 7.567
## Max. :2.6900 Max. :55.13 Max. :16.330
You will see that the data are in wide format and there are many variables.
We are first interested in how the environmental variables differ by forest status. At this stage the ‘Name’ and ’Grazing grouping variables are rather superfluous. We can remove these by subsetting the dataframe.
In the code below, we are telling R to drop variables x and z. The ‘-’ sign indicates dropping variables
forest.sub<-subset(forest.data, select = -c(Name,Grazing))
head(forest.sub)
## Forest Carbon Nitrogen Shrub.Height Moss
## 1 Reforestation 13.71 0.56 16.13 9.93
## 2 Unforested 26.93 0.91 15.47 4.00
## 3 Unforested 8.04 0.30 20.33 8.40
## 4 Reforestation 27.47 1.70 27.00 3.87
## 5 Unforested 16.70 0.47 13.53 6.60
## 6 Unforested 42.47 1.86 28.27 3.40
We now have a dataframe that includes forest status information for each plot and associated measurements of topsoil carbon, topsoil nitrogen, shrub height and moss cover.
3.3. Testing for differences between 3 or more groups
We are interested in how the ecological variables differ in forested, unforested and mature forested plots. On this basis we are working with 3 grouping variables (forest status). On this basis, we will need to use statistical approaches that allow for 3 or more groups to be compared - ANOVA or Kruskall-Wallis.
We first want to determine if there is a difference in topsoil carbon content between plots of differing forest status. Because we are only interested in the carbon content for the moment, we can also drop the other variables so we have a dataframe that just contains the forest status and topsoil carbon content data:
forest.carbon<-subset(forest.sub, select = c(Forest,Carbon))
head(forest.carbon)
## Forest Carbon
## 1 Reforestation 13.71
## 2 Unforested 26.93
## 3 Unforested 8.04
## 4 Reforestation 27.47
## 5 Unforested 16.70
## 6 Unforested 42.47
Because we have dropped so many variables. We can put this data back into wide format using the spread function:
forest.carbon$row <- 1:nrow(forest.carbon) # adds a column to the forest so that spread will work properly.
forest.carbon.wide <- spread(forest.carbon, key = Forest, value = Carbon) # spreads the data from long to wide
summary(forest.carbon.wide) # produces the summary statistics for each column
## row Mature Reforestation Unforested
## Min. : 1.00 Min. :12.74 Min. : 5.67 Min. : 8.04
## 1st Qu.:13.75 1st Qu.:33.47 1st Qu.:10.64 1st Qu.:17.50
## Median :26.50 Median :36.15 Median :16.64 Median :26.54
## Mean :26.50 Mean :34.64 Mean :20.87 Mean :27.10
## 3rd Qu.:39.25 3rd Qu.:39.75 3rd Qu.:27.56 3rd Qu.:36.47
## Max. :52.00 Max. :52.06 Max. :51.68 Max. :49.48
## NA's :42 NA's :38 NA's :24
Before we decide on which statistical test to use, we first need to check the assumptions of the data
Q17: Adapting the code from pebble size data analysis undertaken earlier, determine whether the soil carbon content is normally distributed
Hint: If you chose to plot the data graphically using ggplot2, the data should be in long format.
We then need to check for homogeneity of variances:
leveneTest(Carbon ~ Forest, data = forest.carbon, center = mean)
## Levene's Test for Homogeneity of Variance (center = mean)
## Df F value Pr(>F)
## group 2 0.6855 0.5086
## 49
The results of this analysis indicate that the groups have equal variances and are normally distributed. On this basis, we can use an ANOVA test to compare the means of each group.
Q18: What will the null and alternative hypothesis be for this test?
The R function aov() is used to run a one-way ANOVA. We can summarize the results of this using the summary.aov() function:
carbon.aov <- aov(Carbon ~ Forest, data = forest.carbon)
summary(carbon.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Forest 2 1108 554.0 3.403 0.0413 *
## Residuals 49 7978 162.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Q19: What was the F statistic?
Q20: Which hypothesis would you accept (null or alternative) and why?
Q21: What could we suggest about how topsoil carbon varies by forest status?
Q22: Run the above analyses again for nitrogen content, shrub cover and moss cover.
You may find that your data does not meet the assumptions needed to run ANOVA. In this case you may need to use the non-parametric equivalent Kruskall Wallis test. Example code for this is:
kruskal.test(MEASUREMENT VARIABLE ~ GROUPING VARIABLE, data = DATAFRAME)
You may also find that the results of ANOVA and Kruskall Wallis indicate there are differences between groups. In this case, you would need to undertake post-hoc pairwise comparison of groups. For data that meets the assumptions on ANOVA you can use Tukey’s HSD using the function TukeyHSD(), and for non-parametric data you can use the function pairwise.wilcox.test().
Following the analyses above, what can you conclude about how the ecosystem function of reforested sites differs in relation to the unforested state and the target state (mature forests)
3.4. Analysing grazing status
The second aim of this study was to assess how the ecosystem function differs in areas of differing grazing status. The plots used in this study have also been grouped into ‘grazed’ and ‘ungrazed’. Therefore, we can look at how the environmental variables differ between these two groups and make inferences about why this may be.
You can manipulate the code from the preceding sections of the practical to undertake this analysis. The steps you will need to take generally are as follows:
· Manipulate your dataset so it contains the appropriate data and is in the correct format for analysis
· Check the data visually (e.g. using boxplots, histograms, etc.)
· Check assumptions for tests (e.g., normality, equality of variance)
· Decide what tests to run (e.g., ANOVA, Kruskal-Wallis, t-test, Wilcoxon-Mann-Whitney)
· Set your null and alternative hypothesis based on your chosen test
· Run the test on
· Draw conclusions about the influence of grazing status on
Hint: You may come across data that does not meet the assumptions for a t-test. In this case, you can use a Wilcox test (wilcox.test()). If two samples (vectors) are supplied this is equivalent to the Mann-Whitney U test
Q23: What inferences can you make about grazing status and ecosystem responses