In this module, we’ll delve deeper into random forest models. We’ll introduce confusion matrices and other ways to assess their predictive performance – how do their predictions compare to the true values? We’ll explore how to interpret the results. For example, which variables turn out to be most important, and in which direction do they affect the outcome? And we’ll extend the concepts from random forests to other types of machine learning models, particularly neural networks. The scikit-learn library has a standardized syntax, so once you are familiar with random forests, other models are relatively simple.
Learning Objectives
By the end of this module, you should be able to:
- Critically assess the predictive accuracy of a machine learning model
- Interpret the results of a random forest model using feature importances and partial dependence plots
- Standardize data
- Apply the random forests concepts to other types of machine learning models, such as neural networks
Required Readings
These two articles consider some of the ethical challenges with predictive modeling. Think about whether these problems are inherent to machine learning, and/or how they might be mitigated.
Video 6a: Assessing performance
In the last module, we learned how to estimate a random forest model. But we skimmed over how to assess its predictive performance. Here, we investigate confusion matrices and other ways to evaluate a model’s predictions.
As you watch the video, follow along with the code here.
Video 6b: Interpreting results
A machine learning model can make use of dozens or even thousands of predictors. But which are the most important, and in which direction does a variable change the outcome? For example, does increasing the value of a parcel of land make it more or less likely to have an ADU? This lecture shows how to use feature importances and partial dependence plots to interpret a model’s results.
As you watch the video, follow along with the code here.
Video 6c: Neural networks and logistic regression
Random forests are just one type of machine learning model. This lecture explores two more – neural networks and logistic regression. It also demonstrates how to standardize data – a necessary step for a neural network and also, as we’ll see in the next module, for cluster analysis.
Please take the quiz below to check your understanding of this module.
This notebook practices the concepts that we’ve developed in the lecture notebooks. We’ll work through it in class.
It’s the Module 6 class activity in your GitHub repository here.
Here are some tips on submitting via GitHub. They are from last year’s course; the URLs are different but the steps are the same.
You have now completed Module 6. Please navigate to the homepage or to the next module by using the green navigation bar at the top.