Python for Data Science Course

A 7-week, 2-credit hour course focused on using Python for data science. Topics include data wrangling, interaction with data sources, visualization, running scripts, the Python ecosystem, functions, and modeling.

If you are looking for information on the associated bootcamp, click here.

Lecture Slides | Labs | Lab Solutions

Page Contents

Objectives Pre-work Agenda Grading Final Project

Primary Objectives

  1. Expose students to the Python data science ecosystem’s libraries, capabilities, and vocabulary.
  2. Build students’ proficiency in the core data wrangling skills: importing data, reshaping data, transforming data, and exporting data.
  3. Develop students’ ability to use Python within both interactive (Jupyter, REPL) and non-interactive (scripts) environments.
  4. Explore various methods of producing output in Python: plotting, exporting various data formats, converting notebooks to static files as deliverables, and writing to a SQL database.
  5. Expose students to modeling via scikit-learn and discuss the fundamentals of building models in Python.
  6. Teach students how and when to teach themselves, through a discussion of widely-available Python resources.

Pre-work

I try to limit pre-work as much as possible, but having Python, Jupyter, and the relevant packages installed is an unavoidable necessity. Below are instructions to do so via Anaconda (a popular Python distribution):

  1. Visit the Anaconda download page.
  2. Click the “Download” button for Python 3.7 - this will begin to download the Anaconda installer.
  3. Open the installer when the download completes, and then follow the prompts. If you are prompted about installing PyCharm, elect not to do so.
  4. Once installed, open the Anaconda Navigator and launch a Jupyter Notebook to ensure it works.
  5. Follow the package installation instructions to ensure the following packages are installed:
  • pandas
  • seaborn
  • altair
  • scikit-learn
  • requests
  • tensorflow
  • keras
  • gensim

I would also suggest, but not mandate, reading the preface and Chapter 1 of the Python Data Science Handbook for an overview of data science and to gain familiarity with the Python environment and Jupyter notebooks.

Course Agenda

Class sessions will be structured as 110 minutes of lecture, a 10-minute break, and 110 minutes of lab. Additional breaks will be given if time permits.

Session 1: Refresher on Python and DataFrames; Importing and Exporting; Joining
Session 2: Abbreviated Due to Switching to Remote – Introduction of Final Project
  • Supplemental Reading: None
  • Lab: None
  • Project Work: Look over the project rubric. Decide on a topic/dataset. Choose a partner if you want one.
  • Due: Last week’s lab.
Session 3: Grouping and Reshaping; Control Flow; Functions
Session 4: Applying Functions to Data; Version Control with Git and GitHub; Visualization with Matplotlib, Seaborn, and Altair
Session 5: Classical Modeling and Feature Engineering; Deep Learning and NLP Tools (High-level)
Session 6: Topics I Wish We Had More Time for

Conda, Python Package Ecosystem, Spark, Requests & APIs, Docker, Cloud Computing

  • Supplemental Reading: None
  • Lab: None – work on your project
  • Project Work: Finish your project!
  • Due: Last week’s lab.
Session 7: Final Exam; Student-suggested Topics

No one will pay attention if I teach before the final, so we’ll reserve the first two hours of class for the test and then the remaining time for Q&A and fun topics.

  • Due: Final Project (due the Sunday before class)

Course Grading

  • 30% Assignments – completion-based, 5% each
  • 15% Final Exam
  • 5% Course Project Check-in – have code in a GitHub repo that imports your data, due at the beginning of session 5
  • 50% Course Project – see below

Course Project

The final project requires students to apply the data science skills covered in the course (and optionally, beyond what’s covered) to real datasets.

A total of 55% of the course grade is related to the final project. The project must include data wrangling and the creation of a deliverable to explain the findings. Students may complete the project alone or in teams of two; however, groups of two will need to earn more points to score the same grade as an individual (more details in rubric).

Data

Teams must use at least two datasets in their project. The data used is at each team’s discretion, but I highly recommend selecting data from a domain in which your team has interest and/or expertise, as domain knowledge is an important aspect of using data science in industry – and it will make the project more enjoyable.

Some suggested data sets:

  • World economic data, such as this.
  • Something from the FiveThirtyEight website data or GitHub repo.
  • Anything else you can find on the internet, within reasonable ethical boundaries. It’s straightforward to find CSVs of baby names over time or an API to fetch NBA data. You could even scrape a website using Python (see packages like Scrapy and Beautiful Soup). Acquiring the data via an API or scraping will net you extra points, as those are beyond the scope of what we discussed in this course.
Grading Details

The first project-related grade is the Check-in, due for Session 5. You must show that you have a GitHub repo with code in it that imports your data and does some data wrangling – joining, filtering, cleaning, etc. This is worth 5% of your course grade, so take advantage of the easy points.

The other 50 percentage points come from your final submission, and are graded via the rubric. This rubric for a great deal of flexibility on your side. Points are earned by completing tasks from the list, with different tasks worth various point values based primarily on two things: the difficulty and how integral the task is to the data science workflow. Larger teams are expected to earn more points for an equivalent grade.