This module has two themes related to big data. In the lectures and notebooks, we’ll introduce some ways that you can scale up  your analysis. We’ll focus on how to make your data fit in memory, and how to identify and speed up bottlenecks in your code. In the readings and class discussions, we’ll consider broader ethical questions.

Learning objectives

By the end of this module, you should be able to:

  1. Reduce the size of a dataset by optimizing data types
  2. Identify bottlenecks in your code
  3. Implement parallel processing to make use of multiple cores
  4. Evaluate when to use a database rather than pandas
  5. Critically evaluate the ethical issues raised by big data

Video 10a: Fitting data in memory

In this video, we’ll discuss some strategies to handle large datasets, making more types of analysis feasible with a regular laptop computer. We’ll focus on economizing with data types, and on how to sample a dataset.

As you watch the video, follow along with the code here.

Video 10b: Profiling

This video introduces profiling, or how to assess the speed of different sections of your code. By identifying bottlenecks, we can prioritize improving the sections of code that run the slowest.

As you watch the video, follow along with the code here.

Video 10c: Parallel processing

Parallel processing makes use of multiple cores on a computer. In this lecture, we’ll show how to parallelize functions on a regular laptop computer, and we’ll discuss in conceptual terms the MapReduce framework, which provides more advanced parallel processing for cloud computing.

As you watch the video, follow along with the code here.

Video 10d: Databases and SQL

The pandas library is wonderful for many purposes, but sometimes a database is a better solution. In this lecture, we’ll discuss when and why you would want to shift to a database, how to set one up, and how to integrate a database into Python-based workflows. We’ll also explore the basics of SQL, which is the standard language of databases.

As you watch the video, follow along with the code here.

Please take the quiz below to check your understanding of this module.

Quiz for currently enrolled UCLA students

Quiz for other learners

The homework assignment is here. Please submit on GitHub.

Here are some tips on submitting via GitHub. They are from last year’s course; the URLs are different but the steps are the same.

Class practice

This notebook practices the concepts that we’ve developed in the lecture notebooks. We’ll work through it in class.

It’s the Module 10 class activity in your GitHub repository here. 

Think of your submission as something that you can include in your post-graduation portfolio. Your final submission should include:

  • All the code that creates your results (we should be able to run it and get the same results)
  • The maps, charts, and other visuals
  • A brief narrative that integrates these visuals, and explains your process

Note that you will submit a self- and peer evaluation separately.

In most cases, a Jupyter notebook or (preferably) a slide deck produced through the notebook will make most sense. Use markdown cells to write the narrative. If you think an alternative format makes more sense, talk to us first.

If you produce a large data file (e.g. through text analysis or web scraping) that takes a long time to generate, you’ll probably want to have separate notebooks for each stage of your analysis. The first notebook(s) would save your data file(s), and a subsequent notebook would do the analysis.

In week 6 or 7, each group will do an informal lightning presentation. This is a chance to cross-fertilize ideas. Share your general approach, and ask for ideas in places where you are stuck. We’ll do signups closer to the time:

The process and milestones will be different for each group, but here’s what we suggest.

By early in Week 5:

  • Confirm that you’ll have access to all the data you thought you would
  • Refine your project plan taking into account our comments and data availability
  • Agree communications protocols and accountability mechanisms, and schedule check ins
  • Figure out how you will share code. I suggest GitHub, of course :) Note: sharing notebooks on git can be cumbersome because any change to the outputs is marked as a change. You can clear the cell output before pushing to git, or use iPython rather than Jupyter.
  • Set milestones and a rough schedule

After that, there are fewer rules of thumb, but here are some suggestions to get started:

  • Explore a small piece of your data. If you are doing text analysis, read many examples to figure out what you are looking to extract.
  • Get some code working on a small part of your data before scaling up
  • Sketch what type of visuals you might want to aim for

Most importantly, talk to both of us early and often! For quick questions, Slack is normally the easiest.

Grading criteria:

  • Depth of analysis: how extensively you use the concepts and tools introduced in class
  • Quality of maps/charts/other visualizations and overall communication and presentation
  • Creativity (whether in framing the question, coding, or anything else)

How to submit: submit a link to your GitHub repository (make sure to invite us both). This should include the notebook and any other files. Ideally, you’d have shared the GitHub repository with us long before so we can help you in the process. Just one submission per group is fine. If you have large data files, you might want to share them separately (e.g. on Box or Google Drive), but it’s easiest to have everything in your git repository.

Optional deadline: Submit a draft by Friday of week 8 (May 26). We’ll try and turn around comments in a few days (although the timeline depends on how many groups submit a draft).

Deadline:
Friday of week 10 (June 9) at 5pm, but if you’d like an extension through finals week, you just have to ask. No need to justify your request.

You have now completed Module 10 – the last module in the course. Congratulations! Please check out the “What’s Next” section.