Problem Set 2- EDPA 5002 Spring 2025

Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due

Problem Set 2- EDPA 5002 Spring 2025

             Due March 31st

The general instructions for this problem set are same as the earlier one. Pls follow the ethics of working in a group, each member should have equal and significant contribution to the homework and submit one assignment for each group. Pls attach the log file and do file along with the typed-out word document.

Part 1:

Data description: State of World’s Children 2023. Google this data and its statistical report. The data is available in separate spreadsheets on various measures. The topic of this report is highly relevant in more than one aspect.

1. In the report, read the key messages section. Using bullet points, state 3 key messages that you felt to be the most influential. Why those were so impactful for you?

2. The State of World’s Children 2023 examines the role of vaccination in younger children’s health and development. Specifically, it addresses how primary healthcare can be strengthened to improve immunization services. The report includes a page long case study on few countries. Pick one country of your interest and summarize the key points, be sure to include your own interpretation on those. The case studies are just one page long in the report. You can present your answer also in a concise way, say in 2 paragraphs.

3. Now, let’s do some basic analysis. We cleaned Education data for SOWC 2015 in Class. Here let’s switch gears and look at Empowerment of Women data. This is a variable that seem to have a direct influence on education and gaining a lot of traction in recent research both with respect to collecting data and how to measure and calibrate empowerment. Navigate your way through UNICEF website and download the spreadsheet for Women Economic Empowerment excel file for 2023 which is part of SOWC 2023 data files. Clean and retain the following variables: Social Institutions and Gender Index, Educational Attainment, Labor Force Participation Rate, Mobile Ownership and Financial Inclusion. Note that some variables are categorical, some are defined for both Male and Female. You have to clean the columns in excel and make it clear.

4. Now convert this excel file to Stata data and present some overall descriptive statistics (for all countries). Note, you have to generate country level identifier etc similar to what we did in Class for 2015 data. Tab Social Institutions and Gender Index and present the output. What percent of the countries have the highest value of this index? Report the summary statistics for Educational Attainment, Labor Force Participation Rate, Mobile Ownership and Financial Inclusion for both male and female. Make a Table for this output. Consider only non-missing values for all the variables.

5. Finally, do the mean for male and female for mobile ownership and financial inclusion are same? Do they statistically differ? (We can do advanced ttest by filtering data by country or by different geographical regions as given in summary notes. Here let’s conduct the standard simple ttest for difference in means)  Stata Command for ttest if one variable is categorical:

ttest varname, by(categorical variable) 

Stata Command if both variables are continuous:

ttest v1==v2, where v1 and v2 are variable names.

Part 2: In this part, we will replicate what we did with SOWC 2015 data to 2005 data.

We will use the country level State of World’s Children (2015 and 2005) data gathered by UNICEF. We have already processed the 2015 data in Class. Part of this assignment will involve processing and merging on the 2005 data. You will use these data to investigate the aggregate country-level relationship between the outcome literacy rate (2015) and the wealth of the country, represented by gross national income (GNI in 2005), while accounting for literacy rates in 2005. The 2005 data is available in excel format in Canvas. Parts 2-4 of this assignment refer to these data.

Nothing to submit for Part 2, but we will review your do file to ensure this section has been completed. Additionally, the dataset you create in this part will be the basis of your analyses for the following parts.

1. Process the basic indicators data in Excel so that you can import it into Stata. Then, import it into Stata and go through the necessary steps to clean the data, including:

· Renaming and labeling variables

· Dealing with missing data (both absolutely missing and footnoted)

· Use the information below for the missing values, similar to what we did in class

· The SOWC document (2005) in Canvas has the information below for footnotes, the excel file didn’t. So, here it is:

· a: Range $765 or less.
b: Range $766 to $3035. c: Range $3036 to $9385. d: Range $9386 or more.

· Convert string variables to numeric

· Round to the nearest hundredth

We are going to do two additional steps that we have not done on previous data:

· Convert GNI (given in 2003 dollars) to 2013 dollars, using the BLS CPI inflation calculator, so that GNI in both years are on the same scale: https://www.bls.gov/data/inflation_calculator.htm

· Add the suffix 2005 to all the variables (except country) so that we can easily identify which variables came from the earlier year when we merge data together.

2. First, rename the variables in the class merged data file so that they have a suffix of 2015.

Merge the 2005 data to the previoulys merged (2015 data of basic indicators and education) dataset we created in Class. When you conduct the merge, you’ll realize that about a dozen countries are unable to merge because of differences in country names between 2015 and 2005. Write the Stata code to substitute the 2015 names into the 2005 data for the cases where they are different. Note that there will not be a match for South Sudan (new country) and that Serbia and Montenegro were a single country in 2005 and are two independent countries in 2015. We will ignore these 3 cases so as to not make things more complicated. Now, try your merge again. It should merge 194 cases successfully (_merge==3). We now have the dataset on which we will conduct our analyses.

Finally, before we proceed to our analysis, we are going to transform the GNI variables to make them easier to use in regression, i.e. take the log transformations in order to make relationships between GNI and other variables more linear. For each GNI variable, create a new variable that is the natural log of GNI, called logGNI2005 or logGNI2015. The natural log transformation is a non-linear transformation that pulls extreme values in, so it is often used for data that has a long right tail. We will discuss this transformation in more detail in the coming Class, for now it is enough to consider it a useful function to help linearize relationships.

3. Conduct the following checks and univariate descriptive analyses for your own reference:

· List out the first 10-30 cases in the dataset and check – by comparing to the raw spreadsheets – that you have loaded the data correctly.

· Obtain appropriate descriptive univariate statistics and/or histograms on the variables, litrate2015, litrate2005, and logGNI2005.

4. Carry out the following bivariate descriptive analyses for your own reference:

· Obtain a scatterplot of continuous outcome litrate2015 versus the predictor litrate2005 using the msize(tiny) option. Inspect this plot to get some sense of the relationship between the variables.

· Display variables litrate2015 and litrate2005 versus predictor logGNI2005, using scatterplots. Inspect these plots to get some sense of the relationship between logged GNI and literacy rates (over time). If you wish, generate the same plots using the untransformed version of GNI to see how the natural log helps to linearize relationships.

Nothing to submit here.

Part 3: Gain scores vs. Lagged Scores

1. Generate a “gain score” for literacy rates by subtracting the 2005 variable from the 2015 variable.

2.  Conduct two OLS regressions with logGNI2005 as the key question predictor, one in a gain “score” framework and one in a lagged “score” framework. Write a sentence or two for each regression describing the coefficient log GNI in terms of countries’ “growth” in literacy rates over the past decade.

Note that when describing an X variable that is in log scale, the appropriate interpretation of the regression coefficient is (when the outcome is not in log scale): a one percent change in X is associated with a β/100 change in Y (the outcome). Again, we will cover log scales in more detail in coming class.

Submit your sentences.

3.  You will have noticed that you get a different answer with the gain variable as compared to the lagged literacy rate. Do some exploratory analyses to see if you can determine why this is the case (I suggest examining the distributions of litrate2005 and litrate2015, perhaps dividing the sample between low- and high-GNI countries) and write up your conclusions in a short paragraph.

Submit your sentences and any supporting tables or figures you think appropriate.

For the remaining part, we are going to focus on the lagged literacy rate model, as the gain score model does not describe the actual relationship in the data

4. Literacy rate is measured with error. How does measurement error affect the coefficients on 2005 literacy rate and log GNI? Write a sentence or two describing the direction of the biases and a brief explanation of why they exist.

Submit your sentences.

5. Let’s consider whether or not there is an interaction between literacy rate and GNI in 2005, when predicting literacy rates in 2015. Because we will be doing an interaction on continuous variables, first, we need to center or standardize them. In this case, because the units of the variables are meaningful, we will center them. This will generate two new variables, each with the prefix c_ which are the literacy and GNI variables centered such as the mean value for each is now equal to 0. (If you’d like to reassure yourself you get the same answer when you use the centered variables in the regression, rerun your lagged score model using the centered variables. You’ll notice the coefficients on log GNI and literacy rate remain the same, only the constant is different.) Now rerun you lagged score model, adding an interaction between the centered, continuous predictor log GNI and the centered, continuous predictor literacy rate. Paste your Stata output for the regression with the interaction into your answer document.

Interactions between two continuous variables can be difficult to interpret. A graphical inspection is always useful with both binary and continuous variable interaction and interaction between two continuous variables, which is the case here. To do so, graph literacy rate in 2015 vs. log GNI graph for the fitted values from the regression you just ran. Your graph should have three lines, one being the prototypical relationship for countries with average literacy rates in 2005, and the other two being lines for the prototypical relationships for countries with +/- 20 percentage points in literacy above or below the mean (the standard deviation in literacy rates is ~20 percentage points). Make sure each of the lines is labeled, and is graphed only over an appropriate range of data.

Finally, using this graph, write a paragraph describing the relationship between GNI, literacy rates in the past, and current literacy rates, considering the interaction effect. Consider the policy implications.

Submit your Stata output, your clearly labeled graph, and your paragraph.

Part 4: Multiple Imputation

1. Set up for multiple imputation: Following the steps described in the class slides, set up your data for multiple imputation. The variables we wish to impute are GNI2005 and litrate2005. We will use lifexp2005, enrlrate2005, and IMR2005 to impute our question predictors. Be sure to mi set the data, “register” the former variables as imputed variables, and “register” the latter variables we will use to impute them as regular variables. Then, impute the two variables, using the mi impute chained command to create 5 additional datasets with imputed values, using the augment and force options, as well as rseed() option so that you can replicate your findings.

Now that you have imputed your variables, use mi passive: to generate the log version of GNI. Unfortunately, mi passive won’t accept the user written command center, so you’ll need to write some code that creates the centered versions of log GNI and literacy rate in 2005, taking into account that the mean you are centering against is different for each iteration of mi. (It’s less fun but you can also use mi passive: egen with the options for standardize.) You can also use mi passive to generate the interaction between the two centered variables.

Then use the mi xeq: summarize command to summarize the data for logGNI2005 and litrate2005. Write a few sentences describing what you see in these summary statistics.

Submit your description of the summary statistics for the imputed (and non-imputed) data. 

2. Note that while imputation has increased the number of observations, it was not possible for Stata to impute all of the missing data (thus our use of the force option above). Why is this the case? Write a few sentences and include tabulations of the data that help explain why there are still some cases missing data.

Submit your sentences and tabulation(s).

3. Using the imputed data from this section, rerun the lagged score model from Part 3, Question 5. Write a sentence or two commenting on whether or not your conclusions about the relationships change when using multiple imputation. You do not need to recreate the graph or comment on the meaning of the interaction term here; just focus on comparing the outcome from this model to the model from Part 3.

Paste in your regression output from Stata and submit your sentence(s).

Copy and paste your Stata .do and log files at the end of your problem set document, using 8pt Courier font. No need to edit these for formatting.

Submit your .do and log files.

发表评论

电子邮件地址不会被公开。 必填项已用*标注