Python for Data Science Course


A 7-week, 2-credit hour course focused on using Python for data science. Topics include data wrangling, interaction with data sources, visualization, running scripts, the Python ecosystem, functions, and modeling.

If you are looking for information on the associated bootcamp, click here.

Lecture Slides | Labs | Lab Solutions

Page Contents

Objectives
Pre-work
Agenda
Grading
Final Project

Primary Objectives

  1. Expose students to the Python data science ecosystem’s libraries, capabilities, and vocabulary.
  2. Build students’ proficiency in the core data wrangling skills: importing data, reshaping data, transforming data, and exporting data.
  3. Develop students’ ability to use Python within both interactive (Jupyter, REPL) and non-interactive (scripts) environments.
  4. Explore various methods of producing output in Python: plotting, exporting various data formats, converting notebooks to static files as deliverables, and writing to a SQL database.
  5. Expose students to modeling via scikit-learn and discuss the fundamentals of building models in Python.
  6. Teach students how and when to teach themselves, through a discussion of widely-available Python resources.

Pre-work

I try to limit pre-work as much as possible, but having Python, Jupyter, and the relevant packages installed is an unavoidable necessity. Below are instructions to do so via Anaconda (a popular Python distribution):

  1. Visit the Anaconda download page
  2. Select your appropriate operating system
  3. Click the “Download” button for Python 3.7 - this will begin to download the Anaconda installer
  4. Open the installer when the download completes, and then follow the prompts. If you are prompted about installing PyCharm, elect not to do so.
  5. Once installed, open the Anaconda Navigator and launch a Jupyter Notebook to ensure it works.
  6. Follow the package installation instructions to ensure pandas and seaborn packages are installed.

I would also suggest, but not mandate, reading the preface and Chapter 1 of the Python Data Science Handbook for an overview of data science and to gain familiarity with the Python environment and Jupyter notebooks.

Course Agenda

Class sessions will be structured as 110 minutes of lecture, a 10-minute break, and 110 minutes of lab. Additional breaks will be given if time permits.

Course Grading

Course Project

50% of the course grade is based on a final project: apply the data science skills covered in the course (and optionally, beyond what’s covered) to real datasets. The project must include data wrangling and the creation of a deliverable to explain the group’s findings. Students may complete the project alone or in teams of up to four; however, larger groups will be held to a higher standard.

Data

Teams must use at least two datasets (and more is better!) in their project. The data used is at each team’s discretion, but I highly recommend selecting data from a domain in which your team has interest and/or expertise, as domain knowledge is an important aspect of using data science in industry – and it will make the project more enjoyable.

Some suggested data sets:

Rubric

The project grading system allows for a great deal of flexibility on the students’ side. Teams earn points by completing tasks from a list, with different tasks worth various point values based primarily on two things: the difficulty and how integral the task is to the data science workflow. Larger teams are expected to earn more points for an equivalent grade.

See the course rubric page for complete details of how projects will be graded.