STAT 151 Lab #1 Instructions Describing and Displaying Data

Lab #1 Instructions

Describing and Displaying Data

This lab will assist you in learning how to summarize and display categorical and quantitative data in R. In particular, you will learn how to obtain frequency and contingency tables for categorical data and display the data with bar charts and pie charts. You will also learn how to obtain the appropriate measures of center and spread for quantitative data and display the data with histograms andboxplots. Finally, you will study how to display data over time with a time plot. The document should be used as a reference in your work on the Lab 1 assignment. In this file, R commands are shown with red colour and their outputs (except graphs) are shown with blue colour.

1. Summarizing and Displaying Categorical Data

The categorical variables below described as gender (possible values: male, female) and smoker (possible values: smoker, non-smoker) can be summarized by providing the counts (frequencies) or proportions (relative frequencies) of observations falling into each category (distinct value of the categorical variable).

To demonstrate the graphical and numerical tools in R, we will use the Framingham Heart Study data file introduced in the Introductory Lab; however, we will add one more column: Smoker (column 4) to the Introlab-Data.txt data file. The new variable is defined below. For your convenience, we will also provide the definitions of the other three variables in the data file:

Column

Variable

Description of Variable

1

Gender

M-Male, F-Female,

2

Age

30-64 years,

3

Systolic

Systolic blood pressure (82-300 mmHg),

4

Smoker

0 if not a current smoker, if current smoker.

The extended data file is           in the table below

Gender

Age

Systolic

Smoker

F

59

170

1

M

35

130

0

M

46

136

0

F

43

96

0

M

53

120

0

M

50

110

0

M

33

100

0

M

57

145

1

F

41

132

0

F

40

112

0

M

54

140

0

M

53

148

1

F

53

165

1

M

49

100

0


As it was mentioned in the introductory lab, it is recommended to save the Introlab-Data.txdata file in a folder called “Lab #1” and set the working directory to this folder (Set the working directory is explained in the

introductory lab. You may also consider an internet search on how to do this if not familiar how). To import the data

from Introlab-Data.txt into R and save it as Lab1.data, use Lab1.data <- read.table("Introlab-Data.txt", header=TRUE).

> Lab1.data <- read.table("Introlab-Data.txt", header=TRUE)

> Lab1.data

Gender Age Systolic

1       F      59      170

2       M     35      130

3       M     46      136

4       F      43       96

5       M     53      120

6       M     50      110

7       M     33      100

8       M     57      145

9       F      41      132

10      F     40      112

11      M    54      140

12      M    53      148

13      F     53      165

14      M    49      100

To add the entries in the last column (Smoker) to Lab1.data in R, use the commands in the following screenshot where the Smoker variable is defined first.

> Smoker = c(1, rep(0,6), 1, rep(0,3), rep(1,2),0)

>

> Lab1.data = cbind(Lab1.data, Smoker)

>

> Lab1.data

Gender Age Systolic Smoker

1       F     59      170          1

2       M    35      130          0

3       M    46      136          0

4       F     43       96           0

5       M    53      120          0

6       M    50      110          0

7       M    33      100          0

8       M    57      145 1

9       F     41      132          0

10      F    40      112          0

11      M   54      140          0

12      M   53      148 1

13      F    53      165 1

14      M   49      100          0


The attach() function may be used to access variables in the dataset. It shows, in the following, that the class of

Gender is “character”. It also showshow to isolate observations, as the systolic blood pressure for females as well as systolic blood pressure for females who are older than 45 years old are found.

> attach(Lab1.data)

The following object is masked _by_ .GlobalEnv:

Smoker

> Gender

[1] "F" "M" "M" "F" "M" "M" "M" "M" "F" "F" "M" "M" "F" "M"

>

> class(Gender)

[1] "character"

> Age

[1] 59 35 46 43 53 50 33 57 41 40 54 53 53 49

>

> Systolic

[1] 170 130 136  96 120 110 100 145 132 112 140 148 165 100

>

> Gender

[1] "F" "M" "M" "F" "M" "M" "M" "M" "F" "F" "M" "M" "F" "M"

>

> Gender =="F"

[1]  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE [13]  TRUE FALSE

>

> Systolic[Gender =="F"]

[1] 170  96 132 112 165

>

> Systolic[Gender =="F" & Age >45]

[1] 170 165

In older versions of R, the class of Gender is “factor”. In this case, you may use the as.character() function to have elements of Gender as characters. In the following, the systolic blood pressure for females as well as systolic blood  pressure for females who are older than 45 years old are found.

> Age

[1] 59 35 46 43 53 50 33 57 41 40 54 53 53 49

>

> Systolic

[1] 170 130 136  96 120 110 100 145 132 112 140 148 165 100

>

> Gender

[1] F M M F M M M M  F F M M F M

Levels: F M

> as.character(Gender) =="F"

[1]  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE [13]  TRUE FALSE

>

> Systolic[as.character(Gender) =="F"]

[1] 170  96 132 112 165

>

> Systolic[as.character(Gender) =="F" & Age >45]

[1] 170 165

The cut() function can be used to divide the range of a variable into intervals and codes the value of the variable according to which interval in which they fall. In the following example five class intervals or bins are defined for Systolic. Note that the bins are automatically selected by R while the intervals are open on the left and closed on the right.

> Systolic

[1] 170 130 136  96 120 110 100 145 132 112 140 148 165 100

> Bins = cut(Systolic,5)  # Divide Systolic into five class intervals

> Bins

[1] (155,170]  (126,140]  (126,140]  (95.9,111] (111,126]  (95.9,111]

[7] (95.9,111] (140,155]  (126,140]  (111,126]  (126,140]  (140,155]

[13] (155,170]  (95.9,111]

Levels: (95.9,111] (111,126] (126,140] (140,155] (155,170]

>

> data.frame(Systolic,Bins)

Systolic       Bins

1       170     (155,170]

2       130     (126,140]

3       136     (126,140]

4        96     (95.9,111]

5       120     (111,126]

6       110    (95.9,111]

7       100    (95.9,111]

8       145     (140,155]

9       132     (126,140]

10     112     (111,126]

11     140     (126,140]

12     148     (140,155]

13     165     (155,170]

14     100    (95.9,111]

You may use the seq() function to define the starting point for the bin sequence and the interval width (or bin

width). In the following, five class intervals were created by choosing the starting point as 96 and bin width as 20. Note the intervals were set to be closed on the left and open on the right by choosing “right = FALSE” in the command..

> bins = seq(96, by =20, length=5)

> bins

[1]  96 116 136 156 176

> Bins = as.character(cut(Systolic,bins, right=FALSE))

> data.frame(Systolic,  Bins)

Systolic      Bins

1       170     [156,176)

2       130     [116,136)

3       136     [136,156)

4        96        [96,116)

5       120     [116,136)

6       110       [96,116)

7       100       [96,116)

8       145     [136,156)

9       132     [116,136)

10     112       [96,116)

11     140     [136,156)

12     148     [136,156)

13     165     [156,176)

14     100       [96,116)

(a) Summaries for Categorical Data: Frequency and Contingency Tables

Frequency and Relative Frequency Table: The table() function can be used to find a frequency table for a

variable. Consider also dividing frequency by the length of the variable to find the relative frequency. The following example displays the frequency and relative frequency for Gender.

> table(Gender)       # Frequency table for "Gender"

Gender

F M

5 9

>

> table(Gender)/length(Gender)  #Relative frequency for "Gender"

Gender

F         M

0.3571429 0.6428571

Use the data.frame() function to have a nicer frequency table. The following example shows the frequency tables obtained for Gender using the table() and data.frame() functions.

> table(Gender)

Gender

F M

5 9

>

> data.frame(table(Gender))

Gender Freq

1      F    5

2      M    9

Consider also using the install.packages("gmodels") function to download the “gmodels” package and then use the CrossTable() function to find a frequency table. The following example shows the frequency table for Gender obtained using this function. Note that installing the package is not shown in the following. Use help(CrossTable) to know more about the CrossTable() function.

> library(gmodels)

>

> CrossTable(Gender)

Cell Contents


|                            N |

|       N / Table Total |

Total Observations in Table:  14


|         F    |         M |


5

0.357

9 0.643

The following example shows the frequency table for Gender with systolic blood pressure greater than 135 using the table()data.frame() and CrossTable() functions.

> table(Gender[Systolic >135])  #Frequency table for Gender with systolic > 135

F M

2 4

>

> table(Gender[Systolic >135])/length(Gender[Systolic >135]) #Relative Frequency table for Gender with systolic > 135

F         M

0.3333333 0.6666667

>

> data.frame(table(Gender[Systolic >135])) #Frequency table for Gender with systolic > 135

Var1 Freq

1    F     2

2    M    4

>

> CrossTable(Gender[Systolic >135])

Cell Contents


|                            N |

|       N / Table Total |


Total Observations in Table:  6


|         F    |         M |

2

0.333

4 0.667



发表评论

电子邮件地址不会被公开。 必填项已用*标注