DS-210 Programming for Data Science

Homework 1

DS-210

Before you start. . .

Deadline: Assignments are due on Wednesday nights at 11pm. You can submit up to 48 hours late with a 10% penalty.

Collaboration policy: You may verbally collaborate on required homework problems. However, you  must  write  your solutions  independently without showing them to other students.  If you choose to collaborate on a problem, you are allowed to discuss it with at most 2 other students currently enrolled in the class.

The header of each assignment you submit must include the field “Collaborators:” with the names of the students with whom you have had discussions concerning your solutions. If you didn’t collaborate with anyone, write “Collaborators:  none.“ A failure to  list collaborators  may result in credit deduction.

You may use external resources such as software documentation, textbooks, lecture notes, and  videos  to  supplement  your  general  understanding  of  the  course  topics.  You  may  use references such as books and online resources for well known facts. However, you must always cite the source.

You may not look up answers to a homework assignment in the published literature or on the web. You may not share written work with anyone else.

ChatGPT/Generative AI policy: You may use ChatGPT/Generative AI as a resource to help you complete the assignment. However, it must be used constructively to help you understand things you are unsure of, and be built upon with original code. You must cite your interaction by providing a screenshot of your prompt and the corresponding response. In addition, you must   explain all code from the AI that you implement in your assignment. Touch upon how the code   works and how it helped you. Failure to do so could result in credit deduction. The official GAIA Policy can be found here: https://www.bu.edu/cds-faculty/culture-community/gaia-policy/

Submitting: Solutions should be submitted via Gradescope. More details will be provided on Piazza. Please submit a solution to this homework as a single IPython notebook (.ipynb).

Grading: Whenever we ask for a solution, you may receive partial credit if your solution is not sufficiently efficient or close to optimal. For instance, if we ask you to solve a specific problem that has a polynomial– time algorithm that is easy to implement, but the solution you provide is exponentially slower, you are likely to receive partial credit. You may also lose credit for code that is unorganized, difficult to read, and/or redundant.

Explaining: Always explain your work clearly in your writeup and with comments/markdown, even if it is correct. A good explanation can help you get points back if there are mistakes in your code.  A missing or bad explanation can result in points being deducted.

Questions

Please submit a solution to this homework as a single IPython notebook (.ipynb).

1. ( 10 points) Read the Markdown guide at

https://medium.com/analytics-vidhya/the-ultimate-markdown-guide-forjupyter-notebook-d5e5abf728fd

Create a Markdown cell that roughly looks like the content of the following box (that is, do not include the box):

1

Title

Section 1: Different fonts

Regular. BoldItalic.

Section 2: Enumeration

 First bullet

• Second bullet 1. A

2. B

• Third bullet

 Sub-bullet

 Sub-bullet

Section 3: Code

This is inline code: [x*x for x in X] , and this is a block of code (note the syntax highlighting!):

# comment

def foo (x,y,z):

return x + 10 * y + 100 * z

2. (30 points) Execute a simple data pipeline that involves:

• Basic data validation (i.e., make sure no relevant attributes are missing) and— if needed—data cleansing (changing data types, removing entries with empty properties, etc).

• Partitioning the data set into a training and test set.

• Selection of the set of features that will be used in the learning process.

• Training a decision tree.

• Estimation of the quality of predictions by the final decision tree.

Execute this pipeline for different target decision tree sizes and different sizes of the set of features used for learning and prediction. For the former, you can try various numbers of nodes that are multiples of 5. For the latter, you can select 3, 6, 9, etc. that you believe should be most important for what you are trying to predict. You may find using a nested for-loop helpful in avoiding duplicated code.

Compare the outcomes of your decision trees and plot a graph that displays the prediction accuracy vs the  number of nodes and features used. You may want to use Seaborn or Matplotlib depending on the graph you want to display. In general, Seaborn is the more powerful  graphing  library  and  excels  at spatial and complex graphs where Matplotlib provides a simpler  interface to create  basic graphs.  It may be familiar to those of you coming from DS 110. For this assignment, we recommend using a 3D graph (i.e. different color  points on a 2D graph or a spatial 3D graph) to display your data more compactly.

Your code should be in separate code blocks which each output useful information.

Suggested data set:

https://archive.ics.uci.edu/ml/datasets/Student+Performance

Feel free to use a different data set if you find it more interesting for personal reasons, but if you do so, explain why you made this choice. Otherwise, if you use the suggested data set, predict attribute G3 and do not use G1 and G2. Additionally, this data set has grades for two subjects (Mathematics and Portuguese). Select just one of them.

Since in this data set the goal is to predict a numerical value, measure your accuracy as the expected square of the difference between your prediction and the actual value on the test set, or another similar quality measure

Summary: Write a short summary of what you learned. How did the accuracy depend on the size of a decision tree? How did the accuracy depend on the number of features you selected?  Did you  learn anything interesting about applying decision trees for predictive data analysis? Did you learn anything interesting about the data set? (This must be done in Markdown.)

Note: Please briefly explain all your design choices and what you do in the notebook whenever it is not obviously clear. Please use Markdown, which you learned in Question 1, whenever you create a Jupyter notebook in this and later homeworks.

3. (Optional, no credit) How much time did you spend on this homework? The answer will   have no impact on the credit you receive, but it may help us adjust the difficulty of future homework assignments.

发表评论

电子邮件地址不会被公开。 必填项已用*标注