In this module and the next, we’ll consider how to analyze natural language – any block of text – in Python. For example, we could analyze general plans, tweets, or journal articles.

This module will cover the parsing side – how to take a chunk of text, perhaps from a PDF, and split it into sentences and words. 

Learning objectives

By the end of this module, you should be able to:

  1. Read in a PDF or other text file and clean and simplify the text
  2. Analyze and visualize a text document using bags of words models
  3. Apply techniques such as stopword removal, tokenizing, and lemmatizing to process text data and prepare it for more detailed analysis

Required Readings

These two papers use text analysis to understand urban planning documents. Choose one of them to read in depth, and skim the other. What do they do, and to what extent do you find the analysis helpful in understanding the contents of a plan? Note that we’ll discuss sentiment analysis and topic modeling in the next module.

Xinyu Fu, Chaosu Li and Wei Zhai (2023) Using Natural Language Processing to Read Plans, Journal of the American Planning Association, 89:1, 107-119, DOI: 10.1080/01944363.2022.2038659

Brinkley, Catherine and Carl Stahmer. 2021. What Is in a Plan? Using Natural Language Processing to Read 461 California City General Plans. Journal of Planning Education and Research, Online First. DOI: 10.1177/0739456X21995890

Video 8a: Reading PDFs

Text often comes in the form of PDFs, which are nice to look at but hard to extract the text from in a usable format. In this lecture, we’ll use the pdfminer library to read a PDF into Python, and explore the use of regex to clean up the text.

As you watch the video, follow along with the code here.

Video 8b: Bags of words

In this lecture, we’ll discuss the simplest type of text analysis – word counts, or “bags of words” models. 

As you watch the video, follow along with the code here.

Video 8c: Tokenizing

This lecture walks through the final steps to clean up a text document. We’ll look at how to remove stopwords – the small words such as “the” or “with” that aren’t normally relevant to an analysis. We’ll also discuss tokenizing – splitting a document into sentences or words. Finally, we’ll consider lemmatizing – reducing a group of words to a common stem, such as “construction” and “constructing” to “construct.”

As you watch the video, follow along with the code here.

Please take the quiz below to check your understanding of this module.

Quiz for currently enrolled UCLA students

Quiz for other learners

This notebook practices the concepts that we’ve developed in the lecture notebooks. We’ll work through it in class.

It’s the Module 8 class activity in your GitHub repository here.

The homework assignment is here. Please submit on GitHub.

Here are some tips on submitting via GitHub. They are from last year’s course; the URLs are different but the steps are the same.

You have now completed Module 8. Please navigate to the homepage or to the next module by using the green navigation bar at the top.