In this module, we’ll focus on the cleaning, joining, and similar tasks that are essential to creating a usable dataset. These types of “data wrangling” tasks are perhaps the least glamorous part of data science, but the most time consuming. Don’t be surprised if they take up 80% or more of your time.
We’ll use an example from the California traffic collisions database, which is a well-documented and structured source. In the next module, we’ll look at some different sources, including spatial data and joins in geopandas.
By the end of this module, you should be able to:
- Evaluate whether a left, right, inner, and/or outer join is required
- Implement a tabular join in pandas
- Interpret a codebook or data dictionary
- Use the pandas groupby functions to compute means and other aggregated quantities
This week, the readings are more technical “how to” examples, and go beyond what we’ll cover in the lectures. I suggest reading them*after you watch the video lectures and use them to (i) reinforce the topics from the videos, and (ii) keep a mental note of more advanced techniques that you might want to come back to for your own projects.
Both of these books are available via the publisher’s website. If you click the links below, you should see an option to get access via the UCLA Library (use your SSO login).
McKinney, Wes. 2022. Python for Data Analysis, 3rd Edition. O’Reilly.↓
- Chapter 8, section 8.2
- Chapter 10, especially sections 10.1, 10.2, and 10.3
Kazil, Jacqueline and Katharine Jarmul. 2016. Data Wrangling With Python. O’Reilly.↓
- Skim Chapters 6 and 7. This treatment is more advanced, but will give you ideas about the types of steps to follow.
This optional reading focuses on data integration efforts by municipalities and other public agencies.
Kitchin, Rob and Niamh Moore-Cherry. 2021. “Fragmented governance, the urban data ecosystem and smart city-regions: the case of Metropolitan Boston,” Regional Studies 55(12): 1913-1923↓
Final Project Proposal
Note: Sign up for a group before you submit the assignment. Just one submission per group is needed.
Please submit a brief proposal (about 1 page) regarding what you’d like to do for the course project. This is a not a graded assignment, but is meant to (i) get you started, and (ii) let me provide suggestions, resources, etc. Your overall question is the most important – what question do you want to answer? It’s fine (and expected) that you won’t know exactly how you will complete the analysis, but the more preliminary ideas you have, the more that we I can help.
The project should (i) be broadly related to urban planning/studies, and (ii) use the data science techniques that we’ll cover this quarter. In practice, that means some combination of web scraping, data processing, analysis techniques such as machine learning, and visualization. But the balance between these will depend on your project—e.g. if you have a fairly “clean” input data set, I’ll expect more on the analysis and visualization side.
Use the #projects channel in Slack to discuss ideas and recruit team members.
Your proposal should include:
- Names of team members
- Research question
- Proposed data sources
- Sketch of how you think you’ll go about the analysis
Please work in teams of 2-4 people. For PhD students only: feel free to propose a solo project that is related to your own dissertation research.
Please upload to BruinLearn a PDF or Word document, or link to a Google Doc.
You have now completed Module 3. Please navigate to the homepage or to the next module by using the green navigation bar at the top.