In this module and the next, we’ll look at one common use of machine learning models – classification. For example, which neighborhoods are likely to gentrify? On which parcels are Accessory Dwelling Units (e.g. backyard units) likely to be built? Which polluters are likely to exceed their permitted discharges?
These are examples of supervised machine learning problems. In other words, we know the right answer for at least a subset of the data, but want to predict which observations fall into which categories.
Machine learning is a very large field, and there are entire courses on the theory and applications. Here, we will give a very high-level overview. We’ll skate over the theoretical underpinnings and focus on implementing the models in Python.
In this module, we’ll walk through data preparation, and the process of estimating a common machine learning model: random forests.
By the end of this module, you should be able to:
- Perform more complex joins and other data wrangling operations
- Split a dataset into training and testing portions
- Estimate a random forests model
- Interpret a random forests decision tree