This week we’ll focus on how to scrape web pages.
We’ll use examples from the City of Seattle permits database, and from craigslist.
As before, the course videos will provide the basic concepts and backgrounds. As you watch them, follow along in the notebook (which is in your GitHub repository). Pause the video to explore the objects, try different things, and experiment. Don’t worry if you don’t get every last detail – we’ll use the class time to practice and introduce more examples.
Learning objectives
By the end of this module, you should be able to:
- Evaluate when web scraping is needed, rather than a simpler solution
- Design a scraping approach that considers the structure of a web page
- Implement a web scraper for a given page
- Critically analyze ethical, legal, and representational concerns (e.g. who is excluded) around webscraping
Required readings
(Here is an update about some of the legal questions raised in the Boeing and Waddell paper.)
Think about the following questions as you read the paper:
- What are the biases in the Craigslist housing data? Are they more or less severe than in other housing market data? How should planners handle this?
- What are any ethical or legal concerns in scraping craigslist data? Does this change if you are a planner in city government rather than an outside researcher?
- What other questions might you be able to explore through scraping Craigslist (or similar websites)?
Optional readings
Video 2a: Scraping permits
We’ll use the BeautifulSoup library to scrape web pages using the example of land use permits in Seattle.
As you watch the video, follow along with the code here.
Video 2b: Parsing text
Sometimes, web scraping returns a mass of unstructured text. We’ll take a first look at parsing text to extract critical information.
As you watch the video, follow along with the code here.
Video 2c: Scraping Craigslist
This lecture uses the example of Craigslist housing posts to explore scraping more complex webpages.
As you watch the video, follow along with the code here.
Video 2d: Parsing Craigslist
We’ll continue to practice scraping web pages and parsing text, and also show how to handle errors (“exceptions”) gracefully.
Please take the quiz below to check your understanding of this module.
This notebook practices the concepts that we’ve developed in the lecture notebooks. We’ll work through it in class.
It’s the Module 2 class activity in your GitHub repository here.
Here are some tips on submitting via GitHub. They are from last year’s course; the URLs are different but the steps are the same.
You have now completed Module 2. Please navigate to the homepage or to the next module by using the green navigation bar at the top.