UC Berkeley Lecture Notes. Data 8: The Foundations of Data Science

Data 8: The Foundations of Data Science

The UC Berkeley Foundations of Data Science course combines three perspectives: inferential thinking, computational thinking, and real-world relevance. Given data arising from some real-world phenomenon, how does one analyze that data so as to understand that phenomenon? The course teaches critical concepts and skills in computer programming and statistical inference, in conjunction with hands-on analysis of real-world datasets, including economic data, document collections, geographical data, and social networks. It delves into social issues surrounding data analysis such as privacy and design.


Each offering site includes links to assignments, slides, and readings. You are welcome to use any of the materials you find.


All materials for the course, including the textbook and assignments, are available for free online under a Creative Commons license.

Textbook: Computational and Inferential Thinking: The Foundations of Data Science is a free online textbook that includes interactive Jupyter notebooks and public data sets for all examples. The textbook source is maintained as an open source project.

Assignments: All assignments from the current course offering, as well as assignments from the Fall 2016 offering are available as Jupyter notebooks. The notebooks assume a Python 3 installation with the standard modules from an Anaconda installation such as Numpy and Matplotlib, as well as the datascience and okpy modules.

Lecture Materials: All lecture videos from Fall 2016 are hosted by Youtube. Slides as PDF from Fall 2016 and Google Slides and Jupyter notebooks from Spring 2017 are linked from the respective course calendars. To request access to the source of the slides for instructional purposes, please fill out our Data 8 Instructor Interest form.


All of the software components of the course are maintained as open-source projects. We encourage you to contact us if you want any help using them.

The datascience module: The course uses a module for table manipulation, charts, and maps that provides an interface appropriate for an introductory course. The Table class is similar to a DataFrame in Pandas, but explicitly does not support row indexes, hierarchical indexes, time series data, missing values, slicing, and many other advanced features that can complicate table manipulation for novices. The charting features use Matplotlib, but customize the output to match the pedagogical goals of the course. The mapping features are implemented by Folium, but aim to simplify working with tables and geojson files. While the datascience module can certainly be used outside the context of the course, it was specifically designed to support the Data 8 curriculum, while setting up students to transition to more standard tools such as Pandas.

The OK autograder and submission system: The assignments depend on a Python-based autograder that includes client-side tests available to students at any time and server-side tests intended for correctness-based grading. Assignments are distributed with a folder of named tests, which include test cases. These test cases are invoked from within a notebook.

Hosted Computing Environment: We provide a hosted environment for our students to edit and execute their Notebooks. It includes two components, a Kubernetes-based deployment of JupyterHub that we have specifically designed for courses, and an assignment server that loads assignments into the students’ environment.

If you want more information about any of these tools, please fill out our Data 8 Instructor Interest form or email denero@berkeley.edu.

You may also like...