Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
EC224 Stata commands for the empirical data analysis project Spring 2025
Instructor: Ekaterina Gnedenko
There are two commands that you can use to convert categorical variables into the set of dummy regressors:
one of them is
the generated set of new binary regressors will be called according to that command racebinary1, racebinary2, and so on, one binary regressor for each category of the original categorical variable race)
Another incorporates it into the regression command:
The former command has the advantage because you can choose manually from the set of new binary regressors the specific race type which you would like to have in your regression; the latter will drop the first one by default and use all the other binary racial regressors. Their coefficients will be interpreted as the difference between the earnings of their race and the reference (dropped) category.
You can also try the following useful steps to take care of the categorical regressors:
*If you want to see the frequencies of several variables at once, you need to use tab1 *command
– it will produce multiple, individual frequency distributions for each variable listed:
tab1 gender race
Suppose that you use secondary (from the Internet) data, and you have a variable called GENDER
When you tabulate the frequencies of the two values in that variable, all what you see from the tab output is the frequencies for 0 and 1 observations, without having a vague idea of which observation is Male and which observations is Female.
You might want to supplement the command with the nol qualifier – it will allow you to see the numeric codes of the categories of the Gender variable:
tab gender, nol
Knowing the numeric codes of your categorical variable can be helpful if you want to recode it.
For example, you would like to recode a binary variable named gender that takes on the value of 1 for male to the binary variable that takes the value of 1 for female and name this new binary variable female. Then use the following command:
recode gender (1=0) (0=1), gen(female)
You can use the same command to create one binary variable out of the categorical variable:
Creating a dummy variable from the categorical variable:
recode X (1/8=0) (9/12=1), gen(Xsummer)
***You could use any two numbers you want to represent each category. Assigning 0 and 1 to these types of indicator variables, however, is a common practice. It makes it easier to interpret the slope coefficients on binary variables this way.
Next, to check that the command did what you were intending, type tab X Xsummer //generates a cross-tabulated values of the two variables
Since the new variable does not have any value labels, we may want to attach the labels to it.
We can do it in TWO steps:
lab val Xsummer season //the defined value label season is attached to the binary variable Xsummer
Another way to create a dummy variable from the categorical variable
To create a separate dummy for each level of a categorical variable area (1- urban, 2- suburban, 3 - rural), use tabulate area, gen(area) //it generates a series of 3 dummies, area1 – area3, for each possible value of the original categorical variable area.
Creating dummy variables out of the continuous variables:
Suppose we have a continuous variable density measuring population density across cities. We will use command generate and a logical operator & to generate a binary variable that takes the value of 1 if the population density is lower than 300 people per square mile, and the value of 0 if the population density is greater than 100:
gen suburban = (100 >= density)&(density <= 300) if !missing(density)
There are three additional ways to create dummy variables: one is to use generate, which creates one dummy variable at a time; another is to use tabulate, which creates whole sets of dummies at once; and the third is to use xi, which may allow you to avoid the issue of dummy-creation altogether.
This statement does the same thing as the first two statements. age<25 is an expression, and Stata evaluates it; returning 1 if the statement is true and 0 if it is false.
If you have missing values in your data, it would be better if you type
You do not have to type the parentheses around the expression.
is good enough. Here are some more illustrations of generating dummy variables:
In the above line, enrolled is itself a dummy variable—a variable taking on values zero and one. We could have typed & enrolled==1 but typing & enrolled is good enough.
Summary stats
Once you deleted missing observations and transformed all the categorical variables (such as education and marital status) into the binary variables (also known as the dummy variables, taking on two values only: 0 or 1), you can use the following commands to complete the analysis of your data for the project paper:
sum Y X1 X2 X3 X4 X5 X6
To learn which observations (individuals or countries, depending on your topic) represent minimum and maximum values indicated in the output table for the summarize command, use command list:
list country if gdppercapita>9000
To learn how many observations (individuals or countries, depending on your topic) having values indicated in the summarize command (i.e., such that gdppercapita is greater than 9000 $ per year), use command count:
If you would like to implement your regression analysis using a subset of observations in your sample, (for example, without country of Malta), type the following command in the Command window:
Regress Y X1 if country != “Malta” //do not forget the quotes – they are necessary for string (not numerical) variables
*interpret the slope coefficient for X1 and comment on the high likelihood of the omitted variable bias in the single-regressor model
Generate a scatterplot of Y against X1 with the fitted line superimpose don it:
Transforming variables to capture possible nonlinearities in the data
If the scatter plots from graph matrix command indicate nonlinearities, you can handle them by transforming your Y and (or) X variables using the following commands:
To generate the polynomial terms and natural logarithms of the variables use:
Implement multivariate regression model utilizing the regress command and robust option for robust standard errors of the coefficients:
Note that I added the interaction between X5 and X6 into the multiple regression model above.
To generate the interaction terms, use generate command, for example:
Implement instrumental variable regression model utilizing the added instrumental variables,
Or, more practical, don’t indicate any method in the IV regress command and let Stata to choose the default method, like this: ivregress Y X2 (X1= X3 X4) X5 X6 X5X6, first
It is useful to test endogeneity of the key regressor using Hausman test in Stata by implementing command estat endog after the ivregress command has been executed.
Test for the relevance of the instruments in the context of the instrumental variable regression:
To test joint significance of the slope coefficients for all the instruments by implementing a separate command for the first stage regression (the F test reported in the right top corner of the iv regression output tests ALL slope coefficients for zero value; we need only the instruments):
Next, we need to implement the test of instruments exogeneity – note that this test needs to be implemented right after the ivregress command!
See the video by Stata’s Chuck Huber on blackboard for the details of the instrumental variable regression implementation in Stata.7
A note on missing R2 in the ivregress command output:
For two-stage least squares, some of the regressors enter the model as instruments when the parameters are estimated. However, since our goal is to estimate the structural model, the actual values, not the instruments for the endogenous right-hand-side variables, are used to determine the model sum of squares (MSS). The model’s residuals
Creating the table of all your competing regression models using either command outreg2 (check out the Internet on help on this command) or the series of commands as following: