ECON 513: Practice of Econometrics
Assignment 2
Due on February 26, 2024
Question 1 [40 points]
For this exercise, you will use the data set “cps13_construction.dta” provided with this assign-ment. This data set contains the records of 4,577 construction workers. Specifically, for each individual you observe gender (male = 1 for males; male = 0 for females), age, education (highest grade completed), hourly wage (in dollars), and an indicator for union membership.
Using Stata, answer the following questions:
1. [10 points]
What are the sample mean and median of age? Consider the sub-samples of male and female workers. Using the Stata command ttest with the option unequal, test whether average age is different between these two sub-samples. Do you reject the null hypothesis that average age is the same between men and women? Briefly explain the tests presented at the bottom of the Stata output after running the command ttest. Now, compare the age distribution for males and females using a histogram.
2. [10 points]
Run a regression of hourly wage on a linear function of age. Now run a regression of hourly wage on a quadratic function of age. Finally, run a regression of hourly wage on a cubic function of age. Which one of these 3 models allows to fit the data better? Does the fit improve as you include higher degree polynomials of age?
Use the last regression with a cubic function of age and the command rvfplot, yline(0) to visually check whether the assumption of homoskedasticity is violated. Explain what this graph does and what you make of it. Can you replicate the exact same graph without using the command rvfplot, yline(0)? (Hint: you will need to generate the fitted values and the residuals from the regression)
3. [10 points]
Consider again the regression of hourly wage on a quadratic function of age. Consider a 10-year increase change in age. By how much would hourly wage change? Briefly explain how this number is computed.
Now, regress hourly wage on age, the indicator for male, and their interaction. Interpret the estimated coefficients of age and male × age. What is the marginal effect of male on hourly wage? Interpret this coefficient and explain how Stata computes it.
4. [10 points]
Run a regression of hourly wage on a quadratic function of age and education. Now, add the variable male to the set of explanatory variables. Compare the estimated coefficient of education when gender is not controlled for (former regression) and when gender is controlled for (latter regression). What do you notice? Would there bean omitted variable bias if gender is omitted? If yes, what would the sign of this bias be?
Question 2 [60 points]
Consider the linear model
yi = xi′β + ui where ui|x ∼ N(0,σ2 ), Ai = 1,..., N (1)
and assume random sampling.
1. [5 points]
What is the expected value of yi conditional on xi? And what is the variance of yi conditional on xi?
What is the distribution of yi conditional on xi? Motivate your answer.
2. [5 points]
You decide that the model in equation (1) is suitable to study the relationship between out-of-pocket medical expenses and health. To perform this analysis, you rely on the data set “meps_p15_2011.dta” provided with this assignment. The data set includes 5,587 individuals age 18-64. For each individual in the sample you observe gender, age, education (highest grade), health (1=Poor; 2=Fair; 3=Good; 4=Very Good; 5=Excellent), annual family income, total and out-of-pocket medical expenses in a year.
Consider the variable medexp oop which records out-of-pocket medical expenses. What are the mean and the median of this variable? Why is there a large difference between the mean and the median? Explain.
3. [5 points]
Now consider the logarithm of medexp oop: log medexp oop = ln(medexp_oop). Using the command correlate, investigate pairwise correlation between log_medexp_oop and the following variables: gender, age, and education. Interpret the results that you obtain.
4. [5 points]
Now run a regression of log_medexp_oop on a constant, gender, age, and education. In- terpret the results that you get and compare the estimated coefficients with the pairwise correlations you obtained in part 3.
5. [5 points]
Run a regression of log_medexp_oop on a constant and health. Since health is a categorical variable, use the factor variable notation and omit the category health=1 (Hint: you should use ib1.health to indicate that you want to omit the category health=1. Read the Stata help by typing help fvvarlist for more information on factor variable notation). Explain why it is necessary to omit one health category in this regression. Interpret the estimated coefficients for the health categories you obtain.
6. [5 points]
Type help functions in Stata, then click on statistical functions and then again on Student’s t and noncentral Student’s t distributions.
Please read about the built-in functions t(.,.) and invt(.,.).
Now type display invt(5579,0.95). As you can see, you obtain the value 1.6451268. What does this value represent? Draw the t5579 distribution and place this value in the graph. Now type display t(5579,1.6451268). You obtain the value 0.95. What does this value represent? How would you illustrate what this value represents in the graph?
Use the command invt(5579,.) to obtain the two critical values that would allow you to carry out a two-tailed test of hypothesis at the 5% significance level. Use a graph to illustrate these two critical values and the rejection areas.
7. [10 points]
Run a regression of log_medexp_oop on a constant, gender, age, education, and health (as a categorical variable and omitting poor health).
Explain how Stata computes the t-statistic, the p-value and 95% confidence interval for the coefficient of health=fair returned by the regression output.
Use the Stata built-in functions t(.,.) and invt(.,.) to manually compute the p-value and the 95% confidence interval for the coefficient of health=fair.
Do you reject the null at the 5% significance level? What is the smallest significance level at which you would reject the null?
8. [5 points]
Test the null hypothesis that the coefficient of health=fair is equal to -0.05 against the alternative that it is different from -0.05. To do that, compute the t-statistic and use a 5% significance level. Do you reject the hypothesis?
Test the same hypothesis using the Stata command test. What type of statistic does Stata use? Is its value consistent with the one of the t-statistic you have just computed? Explain.
9. [5 points]
yi = xi′β + ui with E[ui|xi] = 0 and E[ui(2)|xi] = σ2 , Ai = 1, , N (2)
and assume random sampling.
What is the main difference with the model in equation (1)?
Under the assumptions in equation (2), what is the expected value of yi conditional on xi? And what is the variance of yi conditional on xi? What is the distribution of yi conditional on xi?
10. [10 points]
Consider again the regression in part 7. Test the null hypothesis that the coefficient of health=fair is 0 against the alternative that it is different from 0. Use approximate inference and a 5% significance level. Provide the p-value as well as the confidence interval and compare them with those provided by Stata in the regression output. Do you reject the null? Does it make any difference if you rely on exact or approximate inference? (Hint: you will need to use the Stata built-in density functions for a standard normal normal(.,.) and invnormal(.,.))