Hello, if you have any need, please feel free to consult us, this is my wechat: wx91due
EC203 Stata Assignment #1 Fall 2024
Read and explore data, summarize variables and graph distributions and relationships
NOTE: This assignment has TWO PARTS:
1. Submitting your log file and do-file on Gradescope and
2. Answering questions on Gradescope.
TIP FOR SUCCESS:
• Use comments throughout to create a “roadmap” for both you and the TA
• You are not required to write in the answers to your do-file, but doing so may help you. It would also create a reference that combines the concepts with the actual data cleaning and analysis.
1. Open Stata
2. Begin a new do-file
3. Download the data set from Blackboard under “Stata Assignment #1”
4. Turn the Excel data set into a Stata-formatted data set.**
5. Save the data set as lastname_firstname_ps1.
6. Report the number of variables and the number of observations.**
7. Report the type of data set and the “unit of observation”.**
8. Generate new variables from existing ones.
9. Summarize a variable using the sum command; interpret a few summary statistics.**
10. Generate a histogram.
11. Calculate two z-scores.**
12. Replace 0s with missing values for certain variables.
13. Summarize the same variable after the changes made above.**
14. Generate a new binary variable based on the values of an existing variable.
15. Summarize a variable based on the values of another variable.**
16. Create a scatterplot with a trend line.
17. Clean up the scatterplot.
18. Find the correlation coefficient between two variables.**
19. Save the data set.
20. Triple-check your do-file has all (successful) commands.
21. Rerun the do-file to check for errors.
22. Start a log file (as a .log file, NOT a .smcl file) and run your do-file from beginning to end.
DOUBLE-CHECK your log file.
23. Submit your log file and your do-file in Gradescope.
**Means there is an associated Gradescope question (or multiple) with this task.
***Submit your do-file & log file on Gradescope before Monday, Nov. 4
th at 11:59pm. No exceptions!
“Late” begins at 12:05am ET (Boston time) on Tuesday, Nov. 5
th and you lose 10% per day after this deadline.
****BE SURE YOU SUBMIT THE CORRECT FILES IN THE LOG FILE ASSIGNMENT AND DO-FILE ASSIGNMENT ON GRADESCOPE!!
Getting Started with Any Stata Project
READ THIS:
• The instructions below should help you through each step; please read them carefully.
• Show all your work in the do-file (Copy/paste ALL (successful) commands into the do-file)
• You will be graded on both commands in your do-file and your answers to the questions on Gradescope.
1. Open Stata.
2. Begin a new do-file. Click on “New Do-file Editor.”
a. Use comments to place your name, BU ID, “Stata Assignment #1”, and “EC203 Fall 2024” on four separate lines. Note that you can type “help comment” in the Command window to get more info. To get information about any command in Stata, you can use the “help” command.
b. Save the do file as “lastname_firstname_ps1.do” – (I suggest inside an EC203\Stata\Assignment1 folder that you create.)
c. Copy/paste the following lines underneath your personal information (separated by at least one line) to ensure ease of running the do file from beginning to end without error:
clear all
set more off
capture log close
Note: to receive full credit, you must use comments to provide a “roadmap” for all of the different commands that will be in your do file. For example, use “Getting the mean and standard deviation” as a comment when using the sum command. Comments can help you follow along with what each group of lines are designed to do. This can be very helpful. Also, chunks of commands can be separated by a few empty lines to be able to easily browse and find certain parts of your do file(s). Note that sometimes the command is obvious (no comment required) like when you are using the rename command to rename a variable. In this case, please at least put the question # in your do-file as a comment.
3. Download the Stata Assignment #1 Data Excel file from Blackboard.
• Save this data set in the same folder in which you saved your do file.
4. Bring the data into Stata: You need to bring the Excel file that you downloaded into Stata to use Stata for your data analysis. Use the help command and Stata documentation to understand the difference between the use command and the import command. Pick the correct command and use it to bring the Excel file into Stata. Make sure this command is included in your do file. (Hint: you’ll need to add/click an option to tell Stata to treat the first row in Excel as variable names).
a. In GRADESCOPE, answer a question about the difference between use and import.
5. You should now have your data in Stata. It can be extremely helpful to save your data at this point as a Stata data set. (Remember it is only an Excel file being viewed within Stata at this point.) To save the data in memory as a Stata data set, use the command save. Save this data set by using the drop down menu (File→) and save it as lastname_firstname_ps1_data.
a. Be sure to add , replace to the end of this line when you copy/paste it to your do-file. It should look like below on your do-file. DO NOT accidentally also copy/paste the “result” after the save command. (The command has the . in front, the “result” is below it.) save [path]/lastname_firstname_ps1_data, replace
• Notice that the file type by default is .dta i. This is the file extension for Stata-formatted data!
• Also notice that the first time you do this, it will show the “result” that “[path]/Gelsheimer_Stacey_ps1_data” not found. (Note: this is the result you do NOT copy/paste into your do-file.) This result is because you have said replace if it exists so that you can run your do-file again, but it does not exist the first time yourun it. You do not need to be alarmed by this result, but the command should run successfully without error. (There should be no red result.)
6. Notice the Properties window inside the Stata interface. In GRADESCOPE, answer the following:
a. How many variables are there in this data set?
b. How many observations are there in this data set?
7. Now browse your data using the command browse or using the Browse button above the results window (the one that has a magnifying glass looking at a spreadsheet). This data set is full of students’ grades that took EC204 during the spring semester.
In GRADESCOPE, answer the following questions:
a. What type of data set is this? (Cross-sectional, Time Series, Panel or Pooled Cross-sectional)
b. What is the “unit of observation” in the data set?
8. Generate two new variables:
a. One which equals the percent score for midterm 1 (call it midterm1perc).
i. Midterm 1 was out of 43.5 points.
b. One that equals the percent score for the midterm 2 (call it midterm2perc).
i. Midterm 2 was out of 55.75 points.
c. For example, someone who earned 40 points on midterm 1 should have a value of 91.95 (40/43.5*100) for the new midterm variable. Note: don’t worry about rounding these values.
i. Note that generating a new variable that equals some mathematical expression involving another variable will calculate the value for each and every observation all at once.
ii. Example: gen x=y+2 will take every observation’s value of y, add 2 and that new value will be the value for that observation’s x (and this will happen for every observation all at once)!
9. In GRADESCOPE, answer a few questions related to the directions below:
a. Use the sum command to see the average midterm 1 and midterm 2 scores (the original variables) across the entire data set. In GRADESCOPE, report the min, max and mean.
b. Use the sum command to see the average midterm 1 and midterm 2 exam percentages (using your newly created variables) across the entire data set. In GRADESCOPE, report
the min, max and mean.
*******For the remainder of this assignment, you will be using the midterm 1 and midterm 2 grades as measured in percent (the newly created variables). You can ignore the score versions
for the rest of the assignment.
c. In GRADESCOPE, interpret the standard deviation of midterm2perc.
d. Now look at the detailed summary statistics for midterm2perc and answer the corresponding questions in GRADESCOPE.
10. Generate a histogram of the midterm2perc variable, using frequencies (instead of densities) as the Y-axis variable and make the “width” of the bins = 5. Be sure that your graph has an appropriate title (such as “Midterm 2 Grades EC204 Spring 2024”) and clean up the X-axis by changing it to “Midterm 2 Grades” (both without quotes). Be sure that the title and X-axis label end up in the code you copy/paste into your do-file (by modifying them in the dialogue box, NOT using the Graph Editor).
11. Use the display command to calculate two z-scores, one for an observation that earned a 90% on midterm 2 and one that earned a 65%.
a. Before calculating the z-score, round all pieces involved in the formula to 1 decimal place.
b. In GRADESCOPE, answer a few questions.
12. Notice (from the reported minimum and/or the histogram) that there are students that earned a 0 on midterm 1 and students who earned a 0 on midterm 2. Remembering from class that
“extreme values” can have large impacts on averages, let’s replace the values of 0 to missing so that they are not included in our calculation. (In reality, no one earned 0 points. Rather, they missed the exam, so their score should probably be reported as missing for any real analyses.)
a. That is, replace the values of midterm1perc that equal 0 with a . (dot/period)
i. The symbol . (dot/period) in Stata represents a missing value for a numeric variable.
ii. Hint: This can be done in one line with a replace … if command.
iii. Hint #2: Don’t forget about the difference between = and ==
b. Do the same for midterm2perc
13. Now (after replacing the 0s with .) repeat Step #9b from above. In GRADESCOPE, answer a few questions. (Also, notice the difference in the averages compared to before. You might need to think about this for your final project!)
14. Now generate a new binary variable (called above_med_midterm2) that equals 1 for all students who earned above the median midterm 2 (percentage) grade and a 0 for all students
who earned at or below the median.
NOTE: When you use an if statement in Stata that includes the greater than inequality (“>”), all missing values (incorrectly) satisfy this condition. This means that using:
gen above_med_midterm2 = 1 if midterm2perc>80 (for example, this isn’t using the correct median) would assign a 1 to all students that earned above an 80 on midterm 2 AND all the students for which there was no midterm 2 grade (those that have a missing value). You can imagine why this might throw off any analyses you might do involving this new variable! Assigning the student to the “above median midterm 2 grade” would be the opposite of the truth! (See more details at the end of this document.) Here is how you fix it: gen above_med_midterm2 = 1 if midterm2perc>80 & midterm2perc!=.
/*the “.” is the missing value*/
As you may already know, != means “does NOT equal”, so the last piece of this statement is saying assign a 1 for this new variable to any student who have a final exam grade above 80 AND
NOT equal to . (missing). Note that you could also achieve the same with: gen above_med_midterm2 = 1 if midterm2perc>80 & midterm2perc < .
Don’t forget to make sure to replace the values of your new variable equal to 0 for the second group if your code doesn’t do so automatically!
15. Use the sum command to see the average midterm 1 (%) grade ONLY for students who have a value of 1 for above_med_midterm2. Then use the sum command to see the average midterm 1 (%) grade for the other group. In GRADESCOPE, report these means.
a. Note: There should be only 45 observations in each group if you did the task above correctly.
16. Make a scatterplot with midterm2perc as the Y variable and midterm1perc as the X variable. and add a line of best fit (under “fit plots” and “linear prediction”).
a. Take a look at what it looks like in its most raw version. Is it ready to be presented to an audience, or could we clean it up and make it more presentable? You guessed it! Let’s clean it up…
17. Add “Midterm 2 Grades” to the Y-axis using the dialogue box. Again, make sure you have a clear and concise title. (Use “Relationship between Midterm 1 and Midterm 2 Grades”, without quotes.) Improve the x-axis by labeling it “Midterm 1 Grades” and hide the legend.
a. Be sure the command from Part #16 and Part #17 both end up in your do-file! Run the commands one at a time to notice how much better your graph looks!
18. Use the corr command to find the correlation between midterm1_perc and midterm2_perc. In GRADESCOPE, report the correlation coefficient.
19. Save your data. Save the revised dataset once again using the same name as before and include the appropriate code in your do file.
20. Now triple-check that your do-file has the commands from every step listed in this assignment.
21. Rerun your do-file from beginning to end and confirm it runs without error. (Be sure to check the results window!)
22. Start a log file using the drop-down menu. (File→Log→Begin)
a. Name the file firstname_lastname_log1 and CHOOSE A .LOG FILE TYPE (NOT A .SMCL)
b. Add the log using line that Stata creates to the top of your do file UNDERNEATH THE THREE LINES OF CODE THAT ended with capture log close.
c. Add , replace at the end of the log using line so that Stata knows you’re willing to overwrite any version that currently exists if you rerun your do file at a later time.
23. Your do file should now have EVERY successful command that you ran, plus comments acting as a “roadmap” of what you are doing at various stages. Run your do file one last time from beginning to end and make sure there are no errors. The results window should show “end of do file” with no red errors if it successfully runs. Your log should also be completely clear of any errors. (Your log using line will need to have , replace at the end of it to overwrite any previous versions of your log file. BE SURE TO CHECK YOUR LOG FILE FOR COMPLETION AND ACCURACY
PRIOR TO SUBMITTING IT!)
• Congratulations! You’re done with Assignment 1! You should have familiarized yourself with various commands within Stata, how to load data into Stata, how to browse your data to better understand it, how to generate new variables, and how to visualize some of your variables using histograms and scatterplots. Well done! Congratulations!
***Submit your do-file & log file on Gradescope before Monday, Nov. 4
th at 11:59pm. No exceptions!
“Late” begins at 12:05am ET (Boston time) on Tuesday, Nov. 5
th and you lose 10% per day after this deadline.
*****BE SURE YOU SUBMIT THE CORRECT FILES IN GRADESCOPE!! (Don’t accidentally submit your data set, or confuse the files and submit them into the wrong “dropbox”!)