In this module and the next, we’ll consider how to analyze natural language – any block of text – in Python. For example, we could analyze general plans, tweets, or journal articles.
This module will cover the parsing side – how to take a chunk of text, perhaps from a PDF, and split it into sentences and words.
Learning objectives
By the end of this module, you should be able to:
- Read in a PDF or other text file and clean and simplify the text
- Analyze and visualize a text document using bags of words models
- Apply techniques such as stopword removal, tokenizing, and lemmatizing to process text data and prepare it for more detailed analysis
Required Readings
These two papers use text analysis to understand urban planning documents. Choose one of them to read in depth, and skim the other. What do they do, and to what extent do you find the analysis helpful in understanding the contents of a plan? Note that we’ll discuss sentiment analysis and topic modeling in the next module.
Video 8a: Reading PDFs
Text often comes in the form of PDFs, which are nice to look at but hard to extract the text from in a usable format. In this lecture, we’ll use the pdfminer library to read a PDF into Python, and explore the use of regex to clean up the text.
As you watch the video, follow along with the code here.
Video 8b: Bags of words
In this lecture, we’ll discuss the simplest type of text analysis – word counts, or “bags of words” models.
As you watch the video, follow along with the code here.
Video 8c: Tokenizing
This lecture walks through the final steps to clean up a text document. We’ll look at how to remove stopwords – the small words such as “the” or “with” that aren’t normally relevant to an analysis. We’ll also discuss tokenizing – splitting a document into sentences or words. Finally, we’ll consider lemmatizing – reducing a group of words to a common stem, such as “construction” and “constructing” to “construct.”
Please take the quiz below to check your understanding of this module.
Quiz for currently enrolled UCLA students
This notebook practices the concepts that we’ve developed in the lecture notebooks. We’ll work through it in class.
It’s the Module 8 class activity in your GitHub repository here.
Here are some tips on submitting via GitHub. They are from last year’s course; the URLs are different but the steps are the same.
You have now completed Module 8. Please navigate to the homepage or to the next module by using the green navigation bar at the top.