Midterm Exam
ECON-GA 4003,Fall 2023
Exam Rules
This exam is open book,open note,and open internet (including ChatGPT).However,the work submitted here must be your own and you may not collaborate with other students in the exam or ask "new questions"to online communities.We expect you to acknowledge this and follow these rules.A failure to comply with these expectations will result in a 0 on the exam.
Please type,"I acknowledge the exam rules listed above and will not collaborate with my classmates or others to complete the exam"in the cell below to acknowledge that you read these instructions
#Imports can go here |
PSID
The Panel Study of Income Dynamics(https://psidonline.isr.umich.edu/GettingStarted.aspx) is a
frequently used survey within economics to study intergenerational changes in income (and
other variables).The structure of this survey is unique --In contrast to the CPS which collects a
random sample of individuals month-to-month,the PSID collects data about the same
individuals year after year,additionally,as new family members come into the family (for example,through birth or marriage),they enter the survey and are also tracked over time.
You have been given some relatively raw data and the codebooks to help you interpret that raw data.The data is structured such that all of a single person's data is included in a single row of the data.
This exam will walk you through the steps of cleaning this data and will then ask you to use this data to answer a number of questions.
Question 1(5 points)
You have been given a file called J326246.csv.Read this file into a dataframe called psid_raw.
Note:You should also convert this data into the Pandas "nullable integer"type
(https://pandas.pydata.org/docs/user guide/integer na.html).You can do this by writing
something like psid_raw=psid_raw.astype("Int64")
In [ ]:
Question 2 (10 points)
As you look at the data included in this file,you should note that the column names are not particularly informative --They include things like ER30000,V1101,etc …
We have included two files that could help with this:
· J326246_codebook.pdf
· J326246_labels.txt
We ask that you replace the column names with more informative names built from the files above.
Bonus points:You will receive 10 extra points on the exam if you automate this process by matching the names .
In [ ]:
Question 3 (5 points)
In the section describing ER30001,the PSID codebook(J326246_codebook.pdf ) says
Create a unique identifier for each individual in the data.Save this as a new column named
id.
Also,you should exclude everyone who has a 1968 interview number greater than 3,000.
In [ ]:
Question 4 (35 points)
One of the strengths of the PSID is the fact that it allows us to match children with their parents.
We will build the relationships between fathers and their children during this question.
Question 4.1
We will begin by finding all of the men who were between 35 and 45 (inclusive on both ends) and listed as the "Head of Household"in 1970.
Create a dataframe of these individuals and call it hoh_70.Make sure you keep the following columns:
·The 1968 interveiw number
·The 1968 person number
· The identifier column that we created called id
·Relationship to the head in 1970
·Age of the individual in 1970
· The labor income of the head in 1970
Exclude any househould with a 1968 interview number greater than 3,000.
Hint:You should have 320 observations.
In [ ]:
Question 4.2
Let's begin by finding all of the people who were children (or step-children)of the head in 1970 who are between 35 and 45 (inclusive).Create a dataframe of the children of the 1970 head of household and call it children_95.We will only want to keep columns including the following data:
·The 1968 interview number
·The 1968 person number
· The identifier column that we created called id
·Relationship to the head in 1995
· The 1968 ID of the father
·The person number of the father
·Age of the individual in 1995
· Sex of the individual
· Labor income of the head in 1994
· Labor income of the wife in 1994
You should exclude any child who still lives with the 1970 head of household.
Hint:There should be 1,101 of them.
Bonus points(3 points):How many of the children of the 1970 head of household still live with their parents.
Question 4.3
Create an identifier that can be used to match the children to their father.Call this column father_id.
Question 4.4
Create a dataframe that matches children with their fathers.Use the suffix argument from pd.merge and the .rename method to wind up with the following columns:
· id_father:The identification number of the father
· income_father:The income of the father in 1970
· id_child:The identification number of the child
· is_head:Whether the child was the head of household or not
· sex_child:Whether the child was a male or female
· income_child:The child's income
Hint:Please remember that the child may or may not be the head of the household.For
cimnlicitu nt can aeeima that tha haad nf tha haicahald in 100A and 1005 wac tha cama
In [ ]:
Question 5(15 points)
In this question,we will compute some simple statistics on the data that you've created so far.
If you were unable to finish question 4,we have created a csv file with the data called q4_answer.csv --You can use this as a starting point if you'd like.
Question 5.1
What is the 25th,50th,and 75th percentile of income for a head of household that met our inclusion conditions in Question 4.1?
In [ ]:
Question 5.2
For this sub-question,consider only children of the heads of household in 1970 who were between 35 and 45 in the year 1995.
· What was the median income of male children?
· What percent of male children had non-zero income?
· What was the median income of female children?
·What percent of female children had non-zero income?
In [ ]:
Question 5.3
Interpret the results from 5.2.Does this raise any other questions?
Type Markdown and LaTeX;a2
Question 5.4
What is the most children any one father has?
Question 6 (20 points)
In this question,we will attempt to measure mobility by computing the transition rates between income quartiles.
Question 6.1
In [ ]:
To remove the potential effects of gender on income,we will restrict this analysis to only male sons.Create a new dataframe called df6 from the question 4 dataset and make sure it only includes the male sons.
Question 6.2
Create a new column called father_income_quartile and assign a value 1,2,3,or 4 based on whether the father's income was in the 1st,2nd,3rd,or 4th quartile of income.
Hint:Don't forget that people can have multiple children.
Question 6.3
Create a column called son_income_quartile and assign a value 1,2,3,or 4 based on whether the son's income was in the 1st,2nd,3rd,or 4th quartile of income.
Question 6.4
Create a table that counts the number of times a son who had a father in the 1st/2nd/3rd/4th quartile ended up in the 1st/2nd/3rd/4th quartile of income.
The table should have the father's income quartile on the rows and the son's income quartile on the columns.It should look something like:
Son quartiles >1st 2nd 3rd 4th
Father 1st X X X X
Father 2nd X X X X
Father 3rd X X X X
Father 4th X X X X
Question 6.5
Convert the counts from 6.4 into probabilities.
Question 6.6
What do you observe in the intra-generational transition between quartiles?Anything interesting?
Type Markdown and LaTeX:a²
In [ ]:
Question 7(15 points)
Using df6,create a compelling scatter chart that puts the income of the father on the x-axis and the income of the son on the y-axis.
This graph should be high quality and will be judged on its aesthetic.
Question 8(15 points)
Create another chart,similar to in question 7,that shows the relationship between the father's income and their child's income,but now you should find a way to highlight the difference
between the male and female children.
This graph should also be high quality and will be judged on its aesthetic.