DSCI 310: Reproducible and trustworthy workflows for data science.

DSCI 310: Reproducible and trustworthy workflows for data science.

Course description

Data science methods to automate the running and testing of code and analytic reports, manage data analysis software dependencies, package and deploy software for data analysis, and collaborate with others using version control.

Pre-reqs: DSCI 100 and either (a) one of CPSC 203, CPSC 210, CPEN 221 or (b) one of MATH 210, ECON 323 and one of CPSC 107, CPSC 110.

See the Faculty of Science Credit Exclusion Lists: www.calendar.ubc.ca/vancouver/index.cfm?tree=12,215,410,414

Long version: Data Science skills and tools are increasingly in demand across a large variety of disciplines. DSCI 310 aims to further students’ existing data science knowledge with reproducible and trustworthy workflows in the areas of creating and deploying data analysis, reports, and software. Particular focus will be placed on teaching the skills and tools currently used in academic research and industry settings.

Without deliberate and conscious effort towards project organization, tool choice, and workflows, complex and large data science projects can quickly grow out-of-hand and become irreproducible and untrustworthy. This course will focus on reproducible and trustworthy workflows for writing computer scripts, analytic reports and data analysis pipelines, as well as packaging, automated testing and deployment of software written for data analysis. An emphasis is also placed on how to collaborate effectively with others using version control tools, such as Git and GitHub. Such workflows act to mitigate chaos and maximize transparency, reproducibility, and productivity.

While the course will be based on the use of the two leading languages in data science, Python and R, and related current tools (conda, Docker, Git, GitHub, Jupyter, etc.), the concepts and skills taught in the course aim to be discipline and tool agnostic, focussing on the importance of reproducible and trustworthy workflows for data analysis and the implications of failing to implement these when performing a data analysis.

Students who have completed this course will be able to complete complex data analysis projects with minimal technical debt – meaning that others can transparently follow how the analysis was done, reproduce the analysis for themselves if desired, and easily pickup on, and further extend the analysis in new areas. Strategies for collaboration on data science projects will also be emphasized.

Textbook

We will be using a collection of resources available online. These include:

Hardware & software

Students are required to bring a laptop to both lectures and tutorials. Students who do not own a laptop, chromebook, or tablet may be able to loan a laptop from the UBC library.

Course-level learning outcomes

By the end of the course, students will be able to:

  • Defend and justify the importance of creating data science workflows that are reproducible and trustworthy and the elements that go into such a workflow (e.g., writing clear, robust, accurate and reproducible code, managing and sharing compute environments, defined collaboration strategies, etc).
  • Constructively criticize the workflows and data analysis of others in regards to its reproducibility and trustworthiness.
  • Develop a data science project (including code and non-code documents such as reports) that uses reproducible and trustworthy workflows
  • Demonstrate how to effectively share and collaborate on data science projects and software by creating robust code packages, using reproducible compute environments, and leveraging collaborative development tools.
  • Defend and justify the benefit of, and employ automated testing regimes, continuous integration and continuous deployment for managing and maintaining data science projects and packages.
  • Demonstrate strong communication, teamwork, and collaborative skills by working on a significant data science project with peers throughout the course.

Teaching team

Note that your TAs are students too; they may have class right before their office hours, and they may run a few minutes late. Please be patient!

Position Name Email Office Hours Office Location
TA Tony Liang —- Monday 3:30 PM- 4:30 PM SCRF 200
Instructor Daniel Chen daniel.chen[-at-]stat.ubc.ca Tuesday 8:00 AM - 12:30 PM ORCH 4074
TA Amy Kong —- Tuesday 10:30 AM - 11:30 AM ORCH 4074
TA Zizhen Guo —- Thursday 1:00 PM - 2:00PM ORCH 4062

Assessment

Course breakdown

Deliverable Grade Learning objectives addressed
Individual assignments 5% 1, 2, 4, 5
Project milestone 1 10% 3, 6
Project milestone 2 10% 3, 4, 6
Project milestone 3 10% 3, 4, 5, 6
Final project 20% 3, 4, 5, 6
Peer review 4.5% 2
Teamwork 10% 6
GitHub username quiz 0.5% NA
Mid-term Exam 10% 1, 2, part of 4
Final Exam 20% 1, 2, 4, 5

Schedule at a glance

Week Date Topic Reading Assessments due Notes
1 2023/01/10 How do reproducible and trustworthy workflows impact data analysis? Start working on your installation instructions
2 2023/01/17 Version control for transparency and collaboration Collaboration with version control Individual assignment 1 due & GitHub username quiz
3 2023/01/24 Integrated development environments, filenames and data science project organization Individual assignment 2 and Version Control “quiz” Team assignment for group projects & drafting of team work contract
4 2023/01/31 Managing dependencies using virtual environments
5 2023/02/07 Managing dependencies using containerization Individual assignment 3 due
6 2023/02/14 Introduction to testing code for data science Milestone 1 due
7 2023/02/21 Reading Break
8 2023/02/28 Non-interactive scripts and data analysis pipelines Mid-term exam
9 2023/03/07 Reproducible reports Milestone 2 due
10 2023/03/14 Advanced version control workflows Individual assignment 4 due
11 2023/03/21 Packaging and documenting code Milestone 3 due
12 2023/03/28 Automated testing and continuous integration Individual assignment 5 & Peer review due
13 2023/04/04 Deploying and publishing packages, copyright and licenses Final project & Team work reflection due

Assessment schedule

In general, assignments will be due 11:59 PM on Saturdays

Assessment Description Due date Due Week
Individual assignment 0 Computer Setup 2023/01/12 1
Individual assignment 1 Setting up your computer 2023/01/21 23:59 2
GitHub username quiz 2023/01/21 23:59 2
Individual assignment 2 Version control practice (merge conflict) Worth 0.5%, the 1% is split with the version contorl quiz 2023/01/28 23:59 3
Version control Quiz Not a real quiz, it’s just a set of questions you answer via the canvas quiz section Worth 0.5% the 1 point is split with Indivitual Assignment 2 2023/01/28 23:59 3
Individual assignment 3 Dockerfile practice 2023/02/11 23:59 5
Milestone 1 Question, data & rough draft of analysis in one monolithic literate code document, reproducible environment (full.ipynb, Dockerfile, docker-compose.yml) 2023/02/18 23:59 6
Mid-term exam 2023/03/03 09:00 8
Milestone 2 functions abstracted to a file/module & tests (reduced.ipynb, .R & test_*.R, function documentation) 2023/03/11 23:59 9
Individual assignment 4 Reproducible reports practice 2023/03/18 23:59 10
Milestone 3 literate code document broken into scripts and a report & data analysis pipeline to stitch everything together (.R files & Make pipeline, bookdown or rticle report) 2023/03/25 23:59 11
Peer review review of another group’s project 2023/04/01 23:59 12
Individual assignment 5 Packaging practice 2023/04/01 23:59 12
Final project package & CI (the full monty package - including docs) 2023/04/08 23:59 13
Team work Reflection of how the group worked together, as well as individual performance 2023/04/09 11:59 13
Final exam TBD

Policies

Late registration

Students who register for the class late have 1 week from their registration date on Canvas to complete all prior assignments.

Late assignments / mid-term exam absence

Students must be present at the invigilation venue (in class, on Zoom, examination centre, etc) to take the mid-term exam; otherwise they will be considered to have missed the mid-term exam and will be assigned a grade of zero.

Students who will miss the mid-term exam must provide a self-declaration prior to the mid-term exam and make arrangements (e.g., schedule an oral make-up mid-term exam) with the Instructor. Failing to present a declaration within a reasonable timeframe before the mid-term exam will result in a grade of zero.

A late submission is defined as any work submitted after the deadline. For a late submission, the student will receive a 75% scaling of their grade for the first occurrence, 50% scaling of their grade for the second occurrence, and will receive a grade of 0 for subsequent occurrences.

Students who miss an assignment or quiz can request an academic concession. From the UBC Senate policy on academic concession, grounds for academic concession can be illness, conflicting responsibilities, or compassionate grounds. Examples of compassionate grounds, from the above policy, include “a traumatic event experienced by the student, a family member, or a close friend; an act of sexual assault or other sexual misconduct experienced by the student, a family member, or a close friend; a death in the family or of a close friend.”

To request an academic concession, students should immediately email a completed and signed academic concession form to the course Instructor. Upon receiving the form, the Instructor will make a decision about how to proceed. Failure to present valid documentation may result in a failing grade.

Re-grading

If you have concerns about the way your work was graded, please contact the TA who graded it within one week of having the grade returned to you through Slack. After this one-week window, we may deny your request for re-evaluation. Also, please keep in mind that your grade may go up or down as a result of re-grading.

Missed final exam

Students who miss the final quiz must report to their faculty advising office within 72 hours of the missed exam, and must supply supporting documentation. Only your faculty advising office can grant deferred standing in a course. You must also notify your instructor prior to (if possible) or immediately after the exam. Your instructor will let you know when you are expected to write your deferred exam. Deferred exams will ONLY be provided to students who have applied for and received deferred standing from their faculty.

Academic concession policy

Please see UBC’s concession policy for detailed information on dealing with missed coursework, quizzes, and exams under circumstances of an acute and unanticipated nature.

Academic integrity

The academic enterprise is founded on honesty, civility, and integrity. As members of this enterprise, all students are expected to know, understand, and follow the codes of conduct regarding academic integrity. At the most basic level, this means submitting only original work done by you and acknowledging all sources of information or ideas and attributing them to others as required. This also means you should not cheat, copy, or mislead others about what is your work. Violations of academic integrity (i.e., misconduct) lead to the breakdown of the academic enterprise, and therefore serious consequences arise and harsh sanctions are imposed. For example, incidences of plagiarism or cheating may result in a mark of zero on the assignment or exam and more serious consequences may apply if the matter is referred to the President’s Advisory Committee on Student Discipline. Careful records are kept in order to monitor and prevent recurrences.

A more detailed description of academic integrity, including the University’s policies and procedures, may be found in the Academic Calendar at http://calendar.ubc.ca/vancouver/index.cfm?tree=3,54,111,0.

Plagiarism

Students must correctly cite any code or text that has been authored by someone else or by the student themselves for other assignments. Cases of plagiarism may include, but are not limited to:

  • the reproduction (copying and pasting) of code or text with none or minimal reformatting (e.g., changing the name of the variables)
  • the translation of an algorithm or a script from a language to another
  • the generation of code by automatic code-generation software

An “adequate acknowledgement” requires a detailed identification of the (parts of the) code or text reused and a full citation of the original source code that has been reused.

The above attribution policy applies only to assignments. No code or text may be copied (with or without attribution) from any source during a quiz or exam. Answers must always be in your own words. At a minimum, copying will result in a grade of 0 for the related question.

Repeated plagiarism of any form could result in larger penalties, including failure of the course.

Attribution

Parts of this syllabus (particularly the policies) have been copied and derived from the UBC MDS Policies.

发表评论

电子邮件地址不会被公开。 必填项已用*标注