This week we’ll focus on how to scrape web pages.

We’ll use examples from the City of Seattle permits database, and from craigslist.

As before, the course videos will provide the basic concepts and backgrounds. As you watch them, follow along in the notebook (which is in your GitHub repository). Pause the video to explore the objects, try different things, and experiment. Don’t worry if you don’t get every last detail – we’ll use the class time to practice and introduce more examples.

Learning objectives

By the end of this module, you should be able to:

  1. Evaluate when web scraping is needed, rather than a simpler solution
  2. Design a scraping approach that considers the structure of a web page
  3. Implement a web scraper for a given page
  4. Critically analyze ethical, legal, and representational concerns (e.g. who is excluded) around webscraping

Required readings

Boeing, Geoff and Paul Waddell. 2016. New Insights into Rental Housing Markets across the United States: Web Scraping and Analyzing Craigslist Rental Listings, Journal of Planning Education and Research 37(4): 457-476 ↓ 

(Here is an update about some of the legal questions raised in the Boeing and Waddell paper.)

Think about the following questions as you read the paper:

  • What are the biases in the Craigslist housing data? Are they more or less severe than in other housing market data? How should planners handle this?
  • What are any ethical or legal concerns in scraping craigslist data? Does this change if you are a planner in city government rather than an outside researcher?
  • What other questions might you be able to explore through scraping Craigslist (or similar websites)?

Optional readings

Combs, Jennifer, Danielle Kerrigan and David Wachsmuth. 2020. Short-term rentals in Canada, Canadian Journal of Urban Research 29(1): 119-134 ↓

Video 2a: Scraping permits

We’ll use the BeautifulSoup library to scrape web pages using the example of land use permits in Seattle.

As you watch the video, follow along with the code here.

Video 2b: Parsing text

Sometimes, web scraping returns a mass of unstructured text. We’ll take a first look at parsing text to extract critical information.

As you watch the video, follow along with the code here.

Video 2c: Scraping Craigslist

This lecture uses the example of Craigslist housing posts to explore scraping more complex webpages.

As you watch the video, follow along with the code here.

Video 2d: Parsing Craigslist

We’ll continue to practice scraping web pages and parsing text, and also show how to handle errors (“exceptions”) gracefully.

As you watch the video, follow along with the code here.

Please take the quiz below to check your understanding of this module.

Quiz for currently enrolled UCLA students

Quiz for other learners

This notebook practices the concepts that we’ve developed in the lecture notebooks. We’ll work through it in class.

It’s the Module 2 class activity in your GitHub repository here.

Access the homework assignment here. Please submit on GitHub.

Here are some tips on submitting via GitHub. They are from last year’s course; the URLs are different but the steps are the same.

You have now completed Module 2. Please navigate to the homepage or to the next module by using the green navigation bar at the top.