Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
Lab 2: Data Cleaning/Preparation and Visualization
All rights reserved, Adam Chaffee, Michael Tsiang and Maria Cha, 2017-2024.
Do not post, share, or distribute anywhere or with anyone without explicit permission.
Objectives
Collaboration Policy
Intro Logical Statements/Relational Operators
Try running the lines of code below that use the relational operators >, >=, <=, ==, !=:
Notice that the output is a logical vector (i.e., uses TRUE and FALSE) that has the length of the vector on the left of the relational statement.
Applications of logical statements: calculations
sum(NCbirths$weight > 8) #the number of babies that weighed more than 8 pounds
Applications of logical statements: subsets
fem_weights<-NCbirths$weight[NCbirths$gender=="female"]
With the line above we created a vector called fem_weights that contains the weights of all the female babies. We can combine multiple conditions using &&, and |, but these will be discussed in future labs.
Good coding practices
1. Use the pound symbol (#) often to comment on different code sections. Consider using them to label your exercise numbers and question parts, and to help describe what your code does.
2. Use good spacing. Adding a space between arguments and inside of functions makes your code easier to read. You can also skip lines for clarity.
## Create an object with the baby weights from NCbirths
## Create an object with the baby genders from NCbirths
## Create a logical vector to describe if the gender is female
## Create the vector of weights containing only females
fem_weights <- baby_weight[is_female]
Exercise 1
We will be working with college graduate’s data obtained from American Community Survey 2010-2012 Public Use Microdata Series. You can learn more about the data and its relevant analysis from the [The Economic Guide To Picking A College Major] (https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/).
a. Download the data ‘recent-grads.csv’ from Bruinlearn and read it into R. When you read in the data, name your object “grads”. How many variables and observations does the data have?
Hint: Try dim(grads) to find the answer.
b. The Bureau of Labor Statistics, U.S. Department of Labor reports the unemployment rate for the college graduates was 4.0% in December 2012. What proportion of the majors had lower unemployment rates than 4.0%?
c. Create a pareto chart for ‘Major_category’ variable in the data. What are the three majors with highest frequencies?
d. Report the mean and standard deviation of the ‘Median’ earnings of the majors in ‘Engineering’ major category.
e. Report the mean and standard deviation of the ‘Median’ earnings of all majors that are NOT in ‘Engineering’ major category. Compare the values with part d).
Exercise 2
a. Construct a scatterplot of Life expectancy against GDP per capita. Note: Life expectancy should be on the vertical axis. How does GDP per capita appear to be associated with life expectancy?
b. Construct the boxplot and histogram of ‘GDP per capita.’ Describe the distribution based on shape, center and variability. Are there any outliers found in the boxplot?
d. Make a subset of the data for the year of 2012, and name the data as ‘life2012.’ Suppose that a county generally should have a GDP per capita greater than $12,000 to be considered a ‘developed’ nation in 2012. List the names of the countries (Entity) that are considered as developed nations in 2012. e. Find the average population of the ‘developed’ nations in 2012. Compare with the population of the United States in 2012 from the data.
Exercise 3
a. Compute the summary statistics for lead and zinc using the summary() function.
b. Plot two histograms: one of lead and one of zinc. Describe the shapes of the two distributions.
The level of risk for surface soil based on lead concentration in ppm is given on the table below:
Use techniques similar to last lab to give different colors and sizes to the lead concentration at these 155 locations. You do not need to use the maps package create a map of the area. Just plot the points without a map.
Exercise 4
The data for this exercise represent approximately the centers (given by longitude and latitude) of each one of the City of Los Angeles neighborhoods. See also the Los Angeles Times project on the City of Los Angeles neighborhoods at: http://projects.latimes.com/mappingla/neighborhoods/. You can access these data at:
a. Plot the data point locations. Use good formatting for the axes and title. Then add the outline of LA County by typing:
map("county", "california", add=TRUE)
Note: Ignore the data points on the plot for which Schools = 0. Use what you learned about subsetting with logical statements to first create the objects you need for the scatter plot. Then, create the scatter plot. Alternate methods may only receive half credit.