In this module, we’ll focus on the cleaning, joining, and similar tasks that are essential to creating a usable dataset. These types of “data wrangling” tasks are perhaps the least glamorous part of data science, but the most time consuming. Don’t be surprised if they take up 80% or more of your time.
We’ll use an example from the California traffic collisions database, which is a well-documented and structured source. In the next module, we’ll look at some different sources, including spatial data and joins in geopandas.
Learning objectives
By the end of this module, you should be able to:
- Evaluate whether a left, right, inner, and/or outer join is required
- Implement a tabular join in pandas
- Interpret a codebook or data dictionary
- Use the pandas groupby functions to compute means and other aggregated quantities
Required readings
This week, the readings are more technical “how to” examples, and go beyond what we’ll cover in the lectures. I suggest reading them*after you watch the video lectures and use them to (i) reinforce the topics from the videos, and (ii) keep a mental note of more advanced techniques that you might want to come back to for your own projects.
Both of these books are available via the publisher’s website. If you click the links below, you should see an option to get access via the UCLA Library (use your SSO login).
McKinney, Wes. 2022. Python for Data Analysis, 3rd Edition. O’Reilly.↓
- Chapter 8, section 8.2
- Chapter 10, especially sections 10.1, 10.2, and 10.3
Kazil, Jacqueline and Katharine Jarmul. 2016. Data Wrangling With Python. O’Reilly.↓
- Skim Chapters 6 and 7. This treatment is more advanced, but will give you ideas about the types of steps to follow.
Optional readings
This optional reading focuses on data integration efforts by municipalities and other public agencies.
Video 3a: Joins
This lecture introduces different ways to join two datasets, using the example of the California Transportation Injury Mapping System.
As you watch the video, follow along with the code here.
Video 3b: Aggregation
We’ll explore how to aggregate and create group-level summary statistics such as means, counts, and standard deviations.
Please take the quiz below to check your understanding of this module.
This notebook practices the concepts that we’ve developed in the lecture notebooks. We’ll work through it in class.
It’s the Module 3 class activity in your GitHub repository here.
Final Project Proposal
Note: Sign up for a group before you submit the assignment. Just one submission per group is needed.
Please submit a brief proposal (about 1 page) regarding what you’d like to do for the course project. This is a not a graded assignment, but is meant to (i) get you started, and (ii) let me provide suggestions, resources, etc. Your overall question is the most important – what question do you want to answer? It’s fine (and expected) that you won’t know exactly how you will complete the analysis, but the more preliminary ideas you have, the more that we I can help.
The project should (i) be broadly related to urban planning/studies, and (ii) use the data science techniques that we’ll cover this quarter. In practice, that means some combination of web scraping, data processing, analysis techniques such as machine learning, and visualization. But the balance between these will depend on your project—e.g. if you have a fairly “clean” input data set, I’ll expect more on the analysis and visualization side.
Use the #projects channel in Slack to discuss ideas and recruit team members.
Your proposal should include:
- Names of team members
- Research question
- Proposed data sources
- Sketch of how you think you’ll go about the analysis
Please work in teams of 2-4 people. For PhD students only: feel free to propose a solo project that is related to your own dissertation research.
Please upload to BruinLearn a PDF or Word document, or link to a Google Doc.
You have now completed Module 3. Please navigate to the homepage or to the next module by using the green navigation bar at the top.