Lab #1 Instructions
Describing and Displaying Data
This lab will assist you in learning how to summarize and display categorical and quantitative data in R. In particular, you will learn how to obtain frequency and contingency tables for categorical data and display the data with bar charts and pie charts. You will also learn how to obtain the appropriate measures of center and spread for quantitative data and display the data with histograms andboxplots. Finally, you will study how to display data over time with a time plot. The document should be used as a reference in your work on the Lab 1 assignment. In this file, R commands are shown with red colour and their outputs (except graphs) are shown with blue colour.
1. Summarizing and Displaying Categorical Data
The categorical variables below described as gender (possible values: male, female) and smoker (possible values: smoker, non-smoker) can be summarized by providing the counts (frequencies) or proportions (relative frequencies) of observations falling into each category (distinct value of the categorical variable).
To demonstrate the graphical and numerical tools in R, we will use the Framingham Heart Study data file introduced in the Introductory Lab; however, we will add one more column: Smoker (column 4) to the Introlab-Data.txt data file. The new variable is defined below. For your convenience, we will also provide the definitions of the other three variables in the data file:
Column |
Variable |
Description of Variable |
1 |
Gender |
M-Male, F-Female, |
2 |
Age |
30-64 years, |
3 |
Systolic |
Systolic blood pressure (82-300 mmHg), |
4 |
Smoker |
0 if not a current smoker, 1 if current smoker. |
The extended data file is in the table below
Gender |
Age |
Systolic |
Smoker |
F |
59 |
170 |
1 |
M |
35 |
130 |
0 |
M |
46 |
136 |
0 |
F |
43 |
96 |
0 |
M |
53 |
120 |
0 |
M |
50 |
110 |
0 |
M |
33 |
100 |
0 |
M |
57 |
145 |
1 |
F |
41 |
132 |
0 |
F |
40 |
112 |
0 |
M |
54 |
140 |
0 |
M |
53 |
148 |
1 |
F |
53 |
165 |
1 |
M |
49 |
100 |
0 |
As it was mentioned in the introductory lab, it is recommended to save the Introlab-Data.txt data file in a folder called “Lab #1” and set the working directory to this folder (Set the working directory is explained in the
introductory lab. You may also consider an internet search on how to do this if not familiar how). To import the data
from Introlab-Data.txt into R and save it as Lab1.data, use Lab1.data <- read.table("Introlab-Data.txt", header=TRUE).
> Lab1.data <- read.table("Introlab-Data.txt", header=TRUE)
> Lab1.data
Gender Age Systolic
1 F 59 170
2 M 35 130
3 M 46 136
4 F 43 96
5 M 53 120
6 M 50 110
7 M 33 100
8 M 57 145
9 F 41 132
10 F 40 112
11 M 54 140
12 M 53 148
13 F 53 165
14 M 49 100
To add the entries in the last column (Smoker) to Lab1.data in R, use the commands in the following screenshot where the Smoker variable is defined first.
> Smoker = c(1, rep(0,6), 1, rep(0,3), rep(1,2),0)
>
> Lab1.data = cbind(Lab1.data, Smoker)
>
> Lab1.data
Gender Age Systolic Smoker
1 F 59 170 1
2 M 35 130 0
3 M 46 136 0
4 F 43 96 0
5 M 53 120 0
6 M 50 110 0
7 M 33 100 0
8 M 57 145 1
9 F 41 132 0
10 F 40 112 0
11 M 54 140 0
12 M 53 148 1
13 F 53 165 1
14 M 49 100 0
The attach() function may be used to access variables in the dataset. It shows, in the following, that the class of
Gender is “character”. It also showshow to isolate observations, as the systolic blood pressure for females as well as systolic blood pressure for females who are older than 45 years old are found.
> attach(Lab1.data)
The following object is masked _by_ .GlobalEnv:
Smoker
> Gender
[1] "F" "M" "M" "F" "M" "M" "M" "M" "F" "F" "M" "M" "F" "M"
>
> class(Gender)
[1] "character"
> Age
[1] 59 35 46 43 53 50 33 57 41 40 54 53 53 49
>
> Systolic
[1] 170 130 136 96 120 110 100 145 132 112 140 148 165 100
>
> Gender
[1] "F" "M" "M" "F" "M" "M" "M" "M" "F" "F" "M" "M" "F" "M"
>
> Gender =="F"
[1] TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE [13] TRUE FALSE
>
> Systolic[Gender =="F"]
[1] 170 96 132 112 165
>
> Systolic[Gender =="F" & Age >45]
[1] 170 165
In older versions of R, the class of Gender is “factor”. In this case, you may use the as.character() function to have elements of Gender as characters. In the following, the systolic blood pressure for females as well as systolic blood pressure for females who are older than 45 years old are found.
> Age
[1] 59 35 46 43 53 50 33 57 41 40 54 53 53 49
>
> Systolic
[1] 170 130 136 96 120 110 100 145 132 112 140 148 165 100
>
> Gender
[1] F M M F M M M M F F M M F M
Levels: F M
> as.character(Gender) =="F"
[1] TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE [13] TRUE FALSE
>
> Systolic[as.character(Gender) =="F"]
[1] 170 96 132 112 165
>
> Systolic[as.character(Gender) =="F" & Age >45]
[1] 170 165
The cut() function can be used to divide the range of a variable into intervals and codes the value of the variable according to which interval in which they fall. In the following example five class intervals or bins are defined for Systolic. Note that the bins are automatically selected by R while the intervals are open on the left and closed on the right.
> Systolic
[1] 170 130 136 96 120 110 100 145 132 112 140 148 165 100
> Bins = cut(Systolic,5) # Divide Systolic into five class intervals
> Bins
[1] (155,170] (126,140] (126,140] (95.9,111] (111,126] (95.9,111]
[7] (95.9,111] (140,155] (126,140] (111,126] (126,140] (140,155]
[13] (155,170] (95.9,111]
Levels: (95.9,111] (111,126] (126,140] (140,155] (155,170]
>
> data.frame(Systolic,Bins)
Systolic Bins
1 170 (155,170]
2 130 (126,140]
3 136 (126,140]
4 96 (95.9,111]
5 120 (111,126]
6 110 (95.9,111]
7 100 (95.9,111]
8 145 (140,155]
9 132 (126,140]
10 112 (111,126]
11 140 (126,140]
12 148 (140,155]
13 165 (155,170]
14 100 (95.9,111]
You may use the seq() function to define the starting point for the bin sequence and the interval width (or bin
width). In the following, five class intervals were created by choosing the starting point as 96 and bin width as 20. Note the intervals were set to be closed on the left and open on the right by choosing “right = FALSE” in the command..
> bins = seq(96, by =20, length=5)
> bins
[1] 96 116 136 156 176
> Bins = as.character(cut(Systolic,bins, right=FALSE))
> data.frame(Systolic, Bins)
Systolic Bins
1 170 [156,176)
2 130 [116,136)
3 136 [136,156)
4 96 [96,116)
5 120 [116,136)
6 110 [96,116)
7 100 [96,116)
8 145 [136,156)
9 132 [116,136)
10 112 [96,116)
11 140 [136,156)
12 148 [136,156)
13 165 [156,176)
14 100 [96,116)
(a) Summaries for Categorical Data: Frequency and Contingency Tables
Frequency and Relative Frequency Table: The table() function can be used to find a frequency table for a
variable. Consider also dividing frequency by the length of the variable to find the relative frequency. The following example displays the frequency and relative frequency for Gender.
> table(Gender) # Frequency table for "Gender"
Gender
F M
5 9
>
> table(Gender)/length(Gender) #Relative frequency for "Gender"
Gender
F M
0.3571429 0.6428571
Use the data.frame() function to have a nicer frequency table. The following example shows the frequency tables obtained for Gender using the table() and data.frame() functions.
> table(Gender)
Gender
F M
5 9
>
> data.frame(table(Gender))
Gender Freq
1 F 5
2 M 9
Consider also using the install.packages("gmodels") function to download the “gmodels” package and then use the CrossTable() function to find a frequency table. The following example shows the frequency table for Gender obtained using this function. Note that installing the package is not shown in the following. Use help(CrossTable) to know more about the CrossTable() function.
> library(gmodels)
>
> CrossTable(Gender)
Cell Contents
| N |
| N / Table Total |
Total Observations in Table: 14
| F | M |
5 0.357 |
9 0.643 |
The following example shows the frequency table for Gender with systolic blood pressure greater than 135 using the table(), data.frame() and CrossTable() functions.
> table(Gender[Systolic >135]) #Frequency table for Gender with systolic > 135
F M
2 4
>
> table(Gender[Systolic >135])/length(Gender[Systolic >135]) #Relative Frequency table for Gender with systolic > 135
F M
0.3333333 0.6666667
>
> data.frame(table(Gender[Systolic >135])) #Frequency table for Gender with systolic > 135
Var1 Freq
1 F 2
2 M 4
>
> CrossTable(Gender[Systolic >135])
Cell Contents
| N |
| N / Table Total |
Total Observations in Table: 6
| F | M |
2 0.333 |
4 0.667 |