Stats 10: Introduction to Statistical Reasoning


Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due


Lab 2: Data Cleaning/Preparation and Visualization

Stats 10: Introduction to Statistical Reasoning
Fall 2024

All rights reserved, Adam Chaffee, Michael Tsiang and Maria Cha, 2017-2024.

Do not post, share, or distribute anywhere or with anyone without explicit permission.

Some exercises based on labs by Nicolas Christou.

Objectives

1. Understand logical statements and subsetting.
2. Reinforce knowledge on visualization techniques.

Collaboration Policy

In Lab you are encouraged to work in pairs or small groups to discuss the concepts on the assignments. However, DO NOT copy each other’s work as this constitutes cheating. The work you submit must be entirely your own. If you have a question in lab, feel free to reach out to other groups or talk to your TA if you get stuck.

Intro Logical Statements/Relational Operators

Logical Expressions: Type ?Comparison to see the R documentation on the list of all relational operators you can apply. Many logical expressions in R use these relational operators.

Try running the lines of code below that use the relational operators >, >=, <=, ==, !=:

4 > 3 # Is 4 greater than 3?
c(3,8)>=3 # Is 3 or 8 greater than or equal to 3?
c(3,8)<=3 # Is 3 or 8 less than or equal to 3?
c(1,4,9)==9 # Is 1, 4, or 9 exactly equal to 9?
c(1,4,9)!=9 # Is 1, 4, or 9 not (exactly) equal to 9?

Notice that the output is a logical vector (i.e., uses TRUE and FALSE) that has the length of the vector on the left of the relational statement.

Applications of logical statements: calculations

We can perform certain calculations on logical vectors because R reads TRUE as 1 and FALSE as 0. Create the NCbirths object from last lab and try these examples:

sum(NCbirths$weight > 8) #the number of babies that weighed more than 8 pounds

mean(NCbirths$weight > 8) #the proportion of babies that weighed more than 8 pounds
mean(NCbirths$gender=="female") #the proportion of female babiesmean(NCbirths$gender!="male") #gives the proportion of babies not assigned male

Applications of logical statements: subsets

We can combine logical statements with square brackets to subset data based on conditions.
Examples with NCbirths:

fem_weights<-NCbirths$weight[NCbirths$gender=="female"]

With the line above we created a vector called fem_weights that contains the weights of all the female babies. We can combine multiple conditions using &&, and |, but these will be discussed in future labs.

Good coding practices

Please consider implementing the following in your code:

1. Use the pound symbol (#) often to comment on different code sections. Consider using them to label your exercise numbers and question parts, and to help describe what your code does.

2. Use good spacing. Adding a space between arguments and inside of functions makes your code easier to read. You can also skip lines for clarity.

3. Create as many objects as you like to make it easier to follow. For example, consider my line above creating the fem_weights object. An alternative way to code this using best practices is below:

## Create an object with the baby weights from NCbirths

baby_weight <- NCbirths$weight

## Create an object with the baby genders from NCbirths

baby_gender <- NCbirths$gender

## Create a logical vector to describe if the gender is female

is_female <- baby_gender=="female"

## Create the vector of weights containing only females

fem_weights <- baby_weight[is_female]

Exercise 1

We will be working with college graduate’s data obtained from American Community Survey 2010-2012 Public Use Microdata Series. You can learn more about the data and its relevant analysis from the [The Economic Guide To Picking A College Major] (https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/).

a. Download the data ‘recent-grads.csv’ from Bruinlearn and read it into R. When you read in the data, name your object “grads”. How many variables and observations does the data have?

Hint: Try dim(grads) to find the answer.

b. The Bureau of Labor Statistics, U.S. Department of Labor reports the unemployment rate for the college graduates was 4.0% in December 2012. What proportion of the majors had lower unemployment rates than 4.0%?

c. Create a pareto chart for ‘Major_category’ variable in the data. What are the three majors with highest frequencies?

d. Report the mean and standard deviation of the ‘Median’ earnings of the majors in ‘Engineering’ major category.

e. Report the mean and standard deviation of the ‘Median’ earnings of all majors that are NOT in ‘Engineering’ major category. Compare the values with part d).

f. Create a box plot for the ‘Median’ earning of all observations in the data with a good title.
g. Based on what you see in part (f), describe the shape of the distribution. Does the mean seem to be a good measure of center for the data? Report a more useful statistic for this data.

Exercise 2

The data ‘life_expectancy.csv’ presents life expectancies at birth (Life) and GDP per capita from 166 countries between 1760 and 2018. The source of these data is United Nations and Department of Economic and Social Affairs.

a. Construct a scatterplot of Life expectancy against GDP per capita. Note: Life expectancy should be on the vertical axis. How does GDP per capita appear to be associated with life expectancy?

b. Construct the boxplot and histogram of ‘GDP per capita.’ Describe the distribution based on shape, center and variability. Are there any outliers found in the boxplot?

c. Report the center (typical value) of ‘GDP per capita’ variables. Use the appropriate measures to find the center (typical value).

d. Make a subset of the data for the year of 2012, and name the data as ‘life2012.’ Suppose that a county generally should have a GDP per capita greater than $12,000 to be considered a ‘developed’ nation in 2012. List the names of the countries (Entity) that are considered as developed nations in 2012. e. Find the average population of the ‘developed’ nations in 2012. Compare with the population of the United States in 2012 from the data.

f. Plot Life expectancy against GDP per capita of the developed nations in 2012. Also, compute the correlation coefficient. Describe the association of the two variables. Hint: use the function cor().

Exercise 3

Use R to access the Maas river data. These data contain the concentration of lead and zinc in ppm at 155 locations at the banks of the Maas river in the Netherlands. You can read the data in R as follows:
maas <-
read.table("http://www.stat.ucla.edu/~nchristo/statistics12/soil.txt",
header = TRUE)

a. Compute the summary statistics for lead and zinc using the summary() function.

b. Plot two histograms: one of lead and one of zinc. Describe the shapes of the two distributions.

c. Plot two histograms: one of log(lead) and one of log(zinc). How are they different from the results in (b)?
d. Plot log(lead) against log(zinc) and compute the correlation coefficient. Describe the association of the two variables.
e. According to CDC guideline, Lead-contaminated soil can pose a risk through direct ingestion, uptake in vegetable gardens, or tracking into homes. Soil contains lead concentrations less than 50 parts per million (ppm), but soil lead levels in many urban areas exceed 200 ppm [AAP 1993]. The EPA’s standard for lead in bare soil in play areas is 400 ppm by weight and 1200 ppm for non-play areas [EPA 2000a].

The level of risk for surface soil based on lead concentration in ppm is given on the table below:

Mean concentration (ppm) Level of risk
Below 120              Lead-free
Between 120-400    Lead-safe
Above 400              Significant environmental lead hazard

Use techniques similar to last lab to give different colors and sizes to the lead concentration at these 155 locations. You do not need to use the maps package create a map of the area. Just plot the points without a map.

Exercise 4

The data for this exercise represent approximately the centers (given by longitude and latitude) of each one of the City of Los Angeles neighborhoods. See also the Los Angeles Times project on the City of Los Angeles neighborhoods at: http://projects.latimes.com/mappingla/neighborhoods/. You can access these data at:

LA <-
read.table("http://www.stat.ucla.edu/~nchristo/statistics12/la_data.tx
t", header=TRUE)

a. Plot the data point locations. Use good formatting for the axes and title. Then add the outline of LA County by typing: 

map("county", "california", add=TRUE)

b. Construct a scatterplot of school performance against income. Discuss their relationship.

Note: Ignore the data points on the plot for which Schools = 0. Use what you learned about subsetting with logical statements to first create the objects you need for the scatter plot. Then, create the scatter plot. Alternate methods may only receive half credit.

c. Construct a scatterplot of school performance against diversity. Discuss their relationship.

发表评论

电子邮件地址不会被公开。 必填项已用*标注