SESS0023 Applied Econometrics 2023-2024
This CW contributes 25% to your final mark
Please read Guidance on AI use in the assessment (available on Moodle)
before you start working on this CW
Group Coursework Description
NOTE: Affiliate students who are at SSEES for Term 1 only, will be assessed by a single individual Coursework, which is uploaded on Moodle under different link (see section ‘Assessment for Term 1 Affiliate Students’).
CONTENT:
1. General information.
2. Group and dataset allocation.
3. Submissionю
4. Exercise 1. Modelling Educational Attainment.
4.1 Data and description of variables.
4.2 Questions for Exercise 1.
5. Exercise 2. Modelling AR(p) process.
6. Exemplary question for Exercise 1, its answers and presentation.
1. General information
You have to prepare a short project in which you will demonstrate the ability to conduct basic empirical (comparative) analysis. The empirical part of the project has to be done with the use of the statistical package Stata. The project will consist of no more than 2500 words. The project should include all relevant Stata tables and graphs. The project mark will contribute 25% towards the final overall mark.
There are TWO EXERCISES in the Coursework. The questions to be answered are identical for all of you. Each exercise worth 50% of the total Coursework mark.
2. Group and dataset allocation
You’ll be working on the Coursework in GROUPS of up to three members. However, each of you will be allocated individual dataset which you will be ask to combine at some stage (see description of the exercises in Sections 4 and 5).
You can find the allocation of the datasets for Exercises 1 & 2 in the PDF file on the course Moodle page under “Allocation of datasets for coursework”. In this spreadsheet, you will find your student number along with the number of the dataset (data_1.dta, data_2.dta, etc.) for Exercise 1 and time series variable (y1, y2, etc.) which you should use for Exercise 2. Please download the appropriate datasets from the corresponding “Datasets for coursework” folder.
NB: If you are not on the list, please email Dr Svetlana Makarova ([email protected]) immediately. In your message indicate your Students Number, programme and year of study.
3. Submission
Download the Front Page (see Moodle, the CW-1 section) for the coursework, attach it to your coursework and:
1) indicate Students Numbers for all group members;
2) fill in the table for confirming contribution of each member.
NB: If you have concerns regarding unequal contribution of group member(s), please email Dr Svetlana Makarova ([email protected]).
There are TWO parts in the submission:
Part 1: Electronic version (in Word or PDF format) must be uploaded into Turnitin via link provided on course Moodle webpage (do not forget to attached the Front Page).
Part 2: Log-file that contains all records of your work with empirical data while preparing to the Coursework Exercises must be uploaded into Moodle via link provided on course Moodle webpage. Please note that do-file is not required, it won’t replace log-file and won’t be counted as a part of the submission.
Deadline: 3 PM on Thursday 25 January 2024.
4. Exercise 1. (50%)
The dataset dataN.dta (where N is your number on the list) contains cross section data. Variables are named identically in all datasets, but datasets are different. It means that the same questions should be answered with the use of individual dataset, however the numerical results will be different and, therefore, might lead to a different interpretation of empirical outputs.
4.1 Data and variables description
(see Ch. Dougherty, 2007, Introduction to Econometrics, 3rd ed., Oxford University Press)
The data set is a sub-set of a major US data-base, the National Longitudinal Survey of Youth (NLSY79). Each dataset contains data for each respondent on the following variables (C indicates a continuous variable, D a dummy variable):
Personal variables
ID C Respondent identification number
FEMALE D Sex of respondent (0 if male, 1 if female)
MALE D Sex of respondent (1 if male, 0 if female)
AGE C Age in 2002
HEIGHT85 C height in inches in 1985
WEIGHT85 C weight in pounds in 1985
WEIGHT02 C weight in pounds in 2002
S C years of schooling (highest grade completed as of 2002)
Ethnicity:
ETHBLACK D black
ETHHISP D hispanic
ETHWHITE non-black, non-hispanic
Highest educational qualification:
EDUCPROF D Professional degree
EDUCPHD D Doctorate
EDUCMAST D Master’s degree
EDUCBA D Bachelor’s degree
EDUCAA D Associate’s (two-year college) degree
EDUCHSD D High school diploma or equivalent
EDUCDO D High school drop-out
Marital status
SINGLE D Single, never married
MARRIED D Married, spouse present
DIVORCED D Divorced or separated
Score on a component of the ASVAB battery (scaled with mean 50, standard deviation 10):
ASVAB02 C arithmetic reasoning
ASVAB03 C word knowledge
ASVAB04 C paragraph comprehension
ASVAB05 C Numerical operations (speed test)
ASVAB06 C Coding speed (speed test)
ASVABC C composite of ASVAB2 (with double weight),ASVAB3 and ASVAB4
Faith
FAITHN D None
FAITHC D Catholic
FAITHJ D Jewish
FAITHP D Protestant
FAITHO D Other
Family background variables
SM C mother’s years of schooling
SF C father’s years of schooling
SIBLINGS C number of siblings
LIBRARY D Member of family possessed a library card when respondent was 14
POV78 D Family living in poverty in 1978
Region of residence (census classification):
URBAN D living in an urban area
REGNE D north-east
REGNC D north-central
REGW D west
REGS D south
Work-related variables
EXP C total years of work experience
EARNINGS C current hourly earnings in 1996 constant dollars
HOURS C hours worked per week
TENURE C years worked with present employer
COLLBARG D pay set by collective bargaining, 2002
Category of employment:
CATGOV D Government
CATPRI D Private sector
CATSE D
4.2 Questions for Exercise 1
In this exercise you will formulate and estimate a model explaining average hourly wage rate, named as EARNINGS. On what characteristics personal earnings might depend? In particular, do earnings depend on schooling (named S in the file), working experience (EXP) gender (MALE) and/or other factors?
Please concern all the following aspects (in brackets there is a percentage indicating by how much each aspect contributes to the overall mark for Exercise 1):
1. (5%) Plot the scatter diagrams for earnings against experience and/or schooling based on each individual dataset. Explain the scatter diagram(s) you received. What other characteristics that are available in the file might affect earnings? Explain your choice (e. g you might wish plot scatter diagrams or compute correlation matrix to support your conclusions).
2. (25%) Do earnings of the individual might be explained by working experience, years of schooling and gender? Choose one of the datasets (indicate explicitly the dataset name) and regress EARNINGS on S, EXP, MALE and interpret the regression results answering the following questions:
i. Formulate an econometric model for explaining earnings by schooling, experience and gender. Present Stata estimation output as a table and as an equation.
ii. Formulate and perform test for overall significance of the regression model. Explain your result. What is R-squared for this model? Give its interpretation. Do you think it is high or low and what does this mean?
iii. Test significance of individual coefficients. For this, formulate the appropriate null and alternative hypothesis for performing the t-test. Explain your conclusion. Give a precise interpretation of all the estimated coefficients.
iv. Perform residual analysis for your model and explain your finding. In particular, comment on whether your model suffers from heteroscedasticity or not.
v. Follow the steps below to combine datasets of all members in your group into a new dataset. Make sure that you delete duplicated observations.
a) Keep one individual data set loaded in the Stata memory. (for example, data_100.dta).
b) From the drop-down menu choose: Data -> Combine datasets -> Append datasets and fill in the related fields in the pop–up window (for example, choose data_200.dta).
c) Sort the observations by the command:
sort ID
Open data browser and explore the presence of duplicated observation.
d) If needed, delete the duplicated observations by using the command: duplicates drop ID, force
e) Save newly created file by using drop-down menu: File -> Save as (For example as combined_data.dta.)
vi. Re-estimate the original model using expanded dataset. Compare the estimation output with those obtained in point 2.iv above and explain possible differences. Analyse residuals and compare your finding(s) with those in point 2.iv. Perform statistical testing for heteroscedasticity and explain your results. What are the consequences of your findings for hypothesis testing and parameter estimates? Suggest and implement remedies if heteroscedasticity is a problem for your model.
3. (15%; continue working with combined dataset) To decide, how the model for earnings above can be improved, answer the following questions:
i. Does this model allow for testing whether marginal effect of schooling on earnings depends on gender? If your answer is ‘yes’, explain how you can test this. If your answer is ‘no’ suggest your approach to answer this question (e.g. introducing more variables in the model, changing functional from etc) and explain results.
ii. Discuss briefly what other factors (out of these given in the dataset) might affect earnings. What other functional form might be suitable for modelling earnings (e.g. log-log or log-level)? Support your conclusions with brief quantitative and/or graphical evidence.
4. (3%) What are the overall conclusions of your investigation? Are there any policy conclusions which can be drawn from it?
5. Log file for this exercise accounts for 2% of the overall mark for this Exercise.
5. Exercise 2. (50%).
File: TS_Exercise_2.dta
The file contains 92 time series, but each of you need ONLY TWO time series variables: variable time which indicates time at generic frequencies, and variable yN, where N is your number on the list of allocated dataset and variables (e.g. if your number is 100, then you will be working with variable y100).
You may wish to delete all other ‘y-variables’ that you don’t need (use Stata command ‘drop’ and save new dataset as TS_yN.dta, e.g. TS_y100.dta, if your number in the list is 100).
All ‘y-variables’ are either stationary or are the unit root process and become stationary after taking first difference. In this exercise you will be asked to perform visual analysis to decide whether a particular ‘y-variable’ that is allocated to you (denoted here as yN) is stationary or not, transform it into a stationary form if necessary and then fit an autoregressive process of a proper order to it.
Use the variable named yN1, yN2, yN3, where N1, N2, N3 are corresponding numbers of your group members on the allocation list.
1. (48%) Answer the following questions:
i. Plot (separately) time series graphs and correlograms for variables yN1, yN2, yN3 and comment on their stationarity.
ii. If at least one of the variables yN1, yN2, yN3 is nonstationary, then choose it (say explicitly which variable has been chosen), generate its first difference and name it as Z. Check Z for stationarity by using a time series graph and a correlogram; then go to question 1.iv. Otherwise, go to question 1.iii.
iii. If you decide that all the variables yN1 – yN3 are stationary, then choose one them (it can be any variable, but indicate clearly which one you have chosen), rename it as Z and go to the next question.
iv. Fit AR(p) model for variable Z and justify your choice of p.
2. Log file for this exercise accounts for 2% of the overall mark for this Exercise.
6. Exemplary question for Exercise 1, its answers and presentation
Below is the exemplary answer showing what is expected from you in terms of answering one particular question related to your results.
Question.
Fit educational attainment by regressing S on ASVABC and SM and interpret the regression coefficient on respondent’s mother schooling.
Answer:
The table below gives the regression output (remember to choose Courier New 9 font to obtain proper formatting):
reg S ASVABC SM
Source | SS df MS Number of obs = 540
-------------+------------------------------ F( 2, 537) = 139.54
Model | 1137.7605 2 568.880251 Prob > F = 0.0000
Residual | 2189.23765 537 4.07679264 R-squared = 0.3420
-------------+------------------------------ Adj R-squared = 0.3395
Total | 3326.99815 539 6.17253831 Root MSE = 2.0191
------------------------------------------------------------------------------
S | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ASVABC | .1138736 .0098421 11.57 0.000 .0945399 .1332074
SM | .2358294 .0361032 6.53 0.000 .1649085 .3067502
_cons | 5.128721 .5209361 9.85 0.000 4.105398 6.152043
1. The estimated coefficient on mother schooling is 0.24 and it is significant at 5% significance level as p-value for this coefficient is less then 0.00001 (and hence <0.005), so we can reject the null hypothesis of coefficient on SM = 0 and accept the alternative hypothesis of coefficient on SM is not equal to zero.
2. The magnitude of 0.24 indicates that schooling (on average) increases by 0.24 years for each additional year of education of the mother.