MAINT: regularize notebook titles

website
Jake VanderPlas 2017-08-14 13:01:07 -07:00
parent de0cc6bd31
commit e3a225725c
21 changed files with 2758 additions and 758 deletions

View File

@ -23,8 +23,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Keyboard Shortcuts in the IPython Shell\n",
"\n",
"# Keyboard Shortcuts in the IPython Shell"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you spend any amount of time on the computer, you've probably found a use for keyboard shortcuts in your workflow.\n",
"Most familiar perhaps are the Cmd-C and Cmd-V (or Ctrl-C and Ctrl-V) for copying and pasting in a wide variety of programs and systems.\n",
"Power-users tend to go even further: popular text editors like Emacs, Vim, and others provide users an incredible range of operations through intricate combinations of keystrokes.\n",

View File

@ -23,8 +23,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# IPython Magic Commands\n",
"\n",
"# IPython Magic Commands"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The previous two sections showed how IPython lets you use and explore Python efficiently and interactively.\n",
"Here we'll begin discussing some of the enhancements that IPython adds on top of the normal Python syntax.\n",
"These are known in IPython as *magic commands*, and are prefixed by the ``%`` character.\n",

View File

@ -23,8 +23,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Input and Output History\n",
"\n",
"# Input and Output History"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Previously we saw that the IPython shell allows you to access previous commands with the up and down arrow keys, or equivalently the Ctrl-p/Ctrl-n shortcuts.\n",
"Additionally, in both the shell and the notebook, IPython exposes several ways to obtain the output of previous commands, as well as string versions of the commands themselves.\n",
"We'll explore those here."

View File

@ -23,8 +23,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# IPython and Shell Commands\n",
"\n",
"# IPython and Shell Commands"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When working interactively with the standard Python interpreter, one of the frustrations is the need to switch between multiple windows to access Python tools and system command-line tools.\n",
"IPython bridges this gap, and gives you a syntax for executing shell commands directly from within the IPython terminal.\n",
"The magic happens with the exclamation point: anything appearing after ``!`` on a line will be executed not by the Python kernel, but by the system command-line.\n",

View File

@ -23,8 +23,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Errors and Debugging\n",
"\n",
"# Errors and Debugging"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Code development and data analysis always require a bit of trial and error, and IPython contains tools to streamline this process.\n",
"This section will briefly cover some options for controlling Python's exception reporting, followed by exploring tools for debugging errors in code."
]

View File

@ -23,8 +23,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Profiling and Timing Code\n",
"\n",
"# Profiling and Timing Code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the process of developing code and creating data processing pipelines, there are often trade-offs you can make between various implementations.\n",
"Early in developing your algorithm, it can be counterproductive to worry about such things. As Donald Knuth famously quipped, \"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.\"\n",
"\n",

View File

@ -23,8 +23,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# More IPython Resources\n",
"\n",
"# More IPython Resources"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this chapter, we've just scratched the surface of using IPython to enable data science tasks.\n",
"Much more information is available both in print and on the Web, and here we'll list some other resources that you may find helpful."
]

View File

@ -2,7 +2,10 @@
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--BOOK_INFORMATION-->\n",
"<img align=\"left\" style=\"padding-right:10px;\" src=\"figures/PDSH-cover-small.png\">\n",
@ -13,7 +16,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--NAVIGATION-->\n",
"< [More IPython Resources](01.08-More-IPython-Resources.ipynb) | [Contents](Index.ipynb) | [Understanding Data Types in Python](02.01-Understanding-Data-Types.ipynb) >"
@ -21,14 +27,20 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"# Introduction to NumPy\n"
"# Introduction to NumPy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"This chapter, along with chapter 3, outlines techniques for effectively loading, storing, and manipulating in-memory data in Python.\n",
"The topic is very broad: datasets can come from a wide range of sources and a wide range of formats, including be collections of documents, collections of images, collections of sound clips, collections of numerical measurements, or nearly anything else.\n",
@ -56,7 +68,9 @@
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -77,7 +91,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"For the pieces of the package discussed here, I'd recommend NumPy version 1.8 or later.\n",
"By convention, you'll find that most people in the SciPy/PyData world will import NumPy using ``np`` as an alias:"
@ -87,7 +104,9 @@
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -96,14 +115,20 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Throughout this chapter, and indeed the rest of the book, you'll find that this is the way we will import and use NumPy."
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Reminder about Built In Documentation\n",
"\n",
@ -126,7 +151,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--NAVIGATION-->\n",
"< [More IPython Resources](01.08-More-IPython-Resources.ipynb) | [Contents](Index.ipynb) | [Understanding Data Types in Python](02.01-Understanding-Data-Types.ipynb) >"

File diff suppressed because it is too large Load Diff

View File

@ -2,7 +2,10 @@
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--BOOK_INFORMATION-->\n",
"<img align=\"left\" style=\"padding-right:10px;\" src=\"figures/PDSH-cover-small.png\">\n",
@ -13,7 +16,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--NAVIGATION-->\n",
"< [High-Performance Pandas: eval() and query()](03.12-Performance-Eval-and-Query.ipynb) | [Contents](Index.ipynb) | [Visualization with Matplotlib](04.00-Introduction-To-Matplotlib.ipynb) >"
@ -23,8 +29,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Further Resources\n",
"\n",
"# Further Resources"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"In this chapter, we've covered many of the basics of using Pandas effectively for data analysis.\n",
"Still, much has been omitted from our discussion.\n",
"To learn more about Pandas, I recommend the following resources:\n",
@ -42,7 +56,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--NAVIGATION-->\n",
"< [High-Performance Pandas: eval() and query()](03.12-Performance-Eval-and-Query.ipynb) | [Contents](Index.ipynb) | [Visualization with Matplotlib](04.00-Introduction-To-Matplotlib.ipynb) >"

View File

@ -2,7 +2,10 @@
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--BOOK_INFORMATION-->\n",
"<img align=\"left\" style=\"padding-right:10px;\" src=\"figures/PDSH-cover-small.png\">\n",
@ -13,7 +16,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--NAVIGATION-->\n",
"< [Machine Learning](05.00-Machine-Learning.ipynb) | [Contents](Index.ipynb) | [Introducing Scikit-Learn](05.02-Introducing-Scikit-Learn.ipynb) >"
@ -23,8 +29,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# What Is Machine Learning?\n",
"\n",
"# What Is Machine Learning?"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Before we take a look at the details of various machine learning methods, let's start by looking at what machine learning is, and what it isn't.\n",
"Machine learning is often categorized as a subfield of artificial intelligence, but I find that categorization can often be misleading at first brush.\n",
"The study of machine learning certainly arose from research in this context, but in the data science application of machine learning methods, it's more helpful to think of machine learning as a means of *building models of data*.\n",
@ -39,7 +53,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Categories of Machine Learning\n",
"\n",
@ -60,7 +77,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Qualitative Examples of Machine Learning Applications\n",
"\n",
@ -72,7 +92,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Classification: Predicting discrete labels\n",
"\n",
@ -83,7 +106,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"![](figures/05.01-classification-1.png)\n",
"[figure source in Appendix](06.00-Figure-Code.ipynb#Classification-Example-Figure-1)"
@ -91,7 +117,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Here we have two-dimensional data: that is, we have two *features* for each point, represented by the *(x,y)* positions of the points on the plane.\n",
"In addition, we have one of two *class labels* for each point, here represented by the colors of the points.\n",
@ -106,7 +135,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"![](figures/05.01-classification-2.png)\n",
"[figure source in Appendix](06.00-Figure-Code.ipynb#Classification-Example-Figure-2)"
@ -114,7 +146,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Now that this model has been trained, it can be generalized to new, unlabeled data.\n",
"In other words, we can take a new set of data, draw this model line through it, and assign labels to the new points based on this model.\n",
@ -123,7 +158,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"![](figures/05.01-classification-3.png)\n",
"[figure source in Appendix](06.00-Figure-Code.ipynb#Classification-Example-Figure-3)"
@ -131,7 +169,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"This is the basic idea of a classification task in machine learning, where \"classification\" indicates that the data has discrete class labels.\n",
"At first glance this may look fairly trivial: it would be relatively easy to simply look at this data and draw such a discriminatory line to accomplish this classification.\n",
@ -151,7 +192,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Regression: Predicting continuous labels\n",
"\n",
@ -162,7 +206,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"![](figures/05.01-regression-1.png)\n",
"[figure source in Appendix](06.00-Figure-Code.ipynb#Regression-Example-Figure-1)"
@ -170,7 +217,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"As with the classification example, we have two-dimensional data: that is, there are two features describing each data point.\n",
"The color of each point represents the continuous label for that point.\n",
@ -184,7 +234,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"![](figures/05.01-regression-2.png)\n",
"[figure source in Appendix](06.00-Figure-Code.ipynb#Regression-Example-Figure-2)"
@ -192,7 +245,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Notice that the *feature 1-feature 2* plane here is the same as in the two-dimensional plot from before; in this case, however, we have represented the labels by both color and three-dimensional axis position.\n",
"From this view, it seems reasonable that fitting a plane through this three-dimensional data would allow us to predict the expected label for any set of input parameters.\n",
@ -201,7 +257,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"![](figures/05.01-regression-3.png)\n",
"[figure source in Appendix](06.00-Figure-Code.ipynb#Regression-Example-Figure-3)"
@ -209,7 +268,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"This plane of fit gives us what we need to predict labels for new points.\n",
"Visually, we find the results shown in the following figure:"
@ -217,7 +279,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"![](figures/05.01-regression-4.png)\n",
"[figure source in Appendix](06.00-Figure-Code.ipynb#Regression-Example-Figure-4)"
@ -225,7 +290,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"As with the classification example, this may seem rather trivial in a low number of dimensions.\n",
"But the power of these methods is that they can be straightforwardly applied and evaluated in the case of data with many, many features.\n",
@ -244,7 +312,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Clustering: Inferring labels on unlabeled data\n",
"\n",
@ -257,7 +328,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"![](figures/05.01-clustering-1.png)\n",
"[figure source in Appendix](06.00-Figure-Code.ipynb#Clustering-Example-Figure-2)"
@ -265,7 +339,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"By eye, it is clear that each of these points is part of a distinct group.\n",
"Given this input, a clustering model will use the intrinsic structure of the data to determine which points are related.\n",
@ -274,7 +351,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"![](figures/05.01-clustering-2.png)\n",
"[figure source in Appendix](06.00-Figure-Code.ipynb#Clustering-Example-Figure-2)"
@ -282,7 +362,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"*k*-means fits a model consisting of *k* cluster centers; the optimal centers are assumed to be those that minimize the distance of each point from its assigned center.\n",
"Again, this might seem like a trivial exercise in two dimensions, but as our data becomes larger and more complex, such clustering algorithms can be employed to extract useful information from the dataset.\n",
@ -293,7 +376,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Dimensionality reduction: Inferring structure of unlabeled data\n",
"\n",
@ -306,7 +392,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"![](figures/05.01-dimesionality-1.png)\n",
"[figure source in Appendix](06.00-Figure-Code.ipynb#Dimensionality-Reduction-Example-Figure-1)"
@ -314,7 +403,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Visually, it is clear that there is some structure in this data: it is drawn from a one-dimensional line that is arranged in a spiral within this two-dimensional space.\n",
"In a sense, you could say that this data is \"intrinsically\" only one dimensional, though this one-dimensional data is embedded in higher-dimensional space.\n",
@ -325,7 +417,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"![](figures/05.01-dimesionality-2.png)\n",
"[figure source in Appendix](06.00-Figure-Code.ipynb#Dimensionality-Reduction-Example-Figure-2)"
@ -333,7 +428,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Notice that the colors (which represent the extracted one-dimensional latent variable) change uniformly along the spiral, which indicates that the algorithm did in fact detect the structure we saw by eye.\n",
"As with the previous examples, the power of dimensionality reduction algorithms becomes clearer in higher-dimensional cases.\n",
@ -345,7 +443,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Summary\n",
"\n",
@ -371,7 +472,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--NAVIGATION-->\n",
"< [Machine Learning](05.00-Machine-Learning.ipynb) | [Contents](Index.ipynb) | [Introducing Scikit-Learn](05.02-Introducing-Scikit-Learn.ipynb) >"

View File

@ -2,7 +2,10 @@
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--BOOK_INFORMATION-->\n",
"<img align=\"left\" style=\"padding-right:10px;\" src=\"figures/PDSH-cover-small.png\">\n",
@ -13,7 +16,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--NAVIGATION-->\n",
"< [What Is Machine Learning?](05.01-What-Is-Machine-Learning.ipynb) | [Contents](Index.ipynb) | [Hyperparameters and Model Validation](05.03-Hyperparameters-and-Model-Validation.ipynb) >"
@ -23,8 +29,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introducing Scikit-Learn\n",
"\n",
"# Introducing Scikit-Learn"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"There are several Python libraries which provide solid implementations of a range of machine learning algorithms.\n",
"One of the best known is [Scikit-Learn](http://scikit-learn.org), a package that provides efficient versions of a large number of common algorithms.\n",
"Scikit-Learn is characterized by a clean, uniform, and streamlined API, as well as by very useful and complete online documentation.\n",
@ -37,14 +51,20 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Data Representation in Scikit-Learn"
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Machine learning is about creating models from data: for that reason, we'll start by discussing how data can be represented in order to be understood by the computer.\n",
"The best way to think about data within Scikit-Learn is in terms of tables of data."
@ -52,7 +72,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Data as table\n",
"\n",
@ -65,7 +88,9 @@
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -150,7 +175,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Here each row of the data refers to a single observed flower, and the number of rows is the total number of flowers in the dataset.\n",
"In general, we will refer to the rows of the matrix as *samples*, and the number of rows as ``n_samples``.\n",
@ -161,7 +189,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"#### Features matrix\n",
"\n",
@ -178,7 +209,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"#### Target array\n",
"\n",
@ -197,7 +231,9 @@
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -219,7 +255,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"For use in Scikit-Learn, we will extract the features matrix and target array from the ``DataFrame``, which we can do using some of the Pandas ``DataFrame`` operations discussed in the [Chapter 3](03.00-Introduction-to-Pandas.ipynb):"
]
@ -228,7 +267,9 @@
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -251,7 +292,9 @@
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -272,14 +315,20 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"To summarize, the expected layout of features and target values is visualized in the following diagram:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"![](figures/05.02-samples-features.png)\n",
"[figure source in Appendix](06.00-Figure-Code.ipynb#Features-and-Labels-Grid)"
@ -287,21 +336,30 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"With this data properly formatted, we can move on to consider the *estimator* API of Scikit-Learn:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Scikit-Learn's Estimator API"
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The Scikit-Learn API is designed with the following guiding principles in mind, as outlined in the [Scikit-Learn API paper](http://arxiv.org/abs/1309.0238):\n",
"\n",
@ -324,7 +382,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Basics of the API\n",
"\n",
@ -344,7 +405,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Supervised learning example: Simple linear regression\n",
"\n",
@ -356,7 +420,9 @@
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -382,14 +448,20 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"With this data in place, we can use the recipe outlined earlier. Let's walk through the process: "
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"#### 1. Choose a class of model\n",
"\n",
@ -401,7 +473,9 @@
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -410,14 +484,20 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Note that other more general linear regression models exist as well; you can read more about them in the [``sklearn.linear_model`` module documentation](http://Scikit-Learn.org/stable/modules/linear_model.html)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"#### 2. Choose model hyperparameters\n",
"\n",
@ -444,7 +524,9 @@
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -465,7 +547,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Keep in mind that when the model is instantiated, the only action is the storing of these hyperparameter values.\n",
"In particular, we have not yet applied the model to any data: the Scikit-Learn API makes very clear the distinction between *choice of model* and *application of model to data*."
@ -473,7 +558,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"#### 3. Arrange data into a features matrix and target vector\n",
"\n",
@ -486,7 +574,9 @@
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -507,7 +597,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"#### 4. Fit the model to your data\n",
"\n",
@ -519,7 +612,9 @@
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -539,7 +634,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"This ``fit()`` command causes a number of model-dependent internal computations to take place, and the results of these computations are stored in model-specific attributes that the user can explore.\n",
"In Scikit-Learn, by convention all model parameters that were learned during the ``fit()`` process have trailing underscores; for example in this linear model, we have the following:"
@ -549,7 +647,9 @@
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -571,7 +671,9 @@
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -591,7 +693,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"These two parameters represent the slope and intercept of the simple linear fit to the data.\n",
"Comparing to the data definition, we see that they are very close to the input slope of 2 and intercept of -1.\n",
@ -604,7 +709,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"#### 5. Predict labels for unknown data\n",
"\n",
@ -617,7 +725,9 @@
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -626,7 +736,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"As before, we need to coerce these *x* values into a ``[n_samples, n_features]`` features matrix, after which we can feed it to the model:"
]
@ -635,7 +748,9 @@
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -645,7 +760,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Finally, let's visualize the results by plotting first the raw data, and then this model fit:"
]
@ -654,7 +772,9 @@
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -675,14 +795,20 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Typically the efficacy of the model is evaluated by comparing its results to some known baseline, as we will see in the next example"
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Supervised learning example: Iris classification\n",
"\n",
@ -700,7 +826,9 @@
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -711,7 +839,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"With the data arranged, we can follow our recipe to predict the labels:"
]
@ -720,7 +851,9 @@
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -732,7 +865,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Finally, we can use the ``accuracy_score`` utility to see the fraction of predicted labels that match their true value:"
]
@ -741,7 +877,9 @@
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -762,14 +900,20 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"With an accuracy topping 97%, we see that even this very naive classification algorithm is effective for this particular dataset!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Unsupervised learning example: Iris dimensionality\n",
"\n",
@ -789,7 +933,9 @@
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -801,7 +947,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Now let's plot the results. A quick way to do this is to insert the results into the original Iris ``DataFrame``, and use Seaborn's ``lmplot`` to show the results:"
]
@ -810,7 +959,9 @@
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -832,7 +983,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We see that in the two-dimensional representation, the species are fairly well separated, even though the PCA algorithm had no knowledge of the species labels!\n",
"This indicates to us that a relatively straightforward classification will probably be effective on the dataset, as we saw before."
@ -840,7 +994,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Unsupervised learning: Iris clustering\n",
"\n",
@ -856,7 +1013,9 @@
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -869,7 +1028,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"As before, we will add the cluster label to the Iris ``DataFrame`` and use Seaborn to plot the results:"
]
@ -878,7 +1040,9 @@
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -900,7 +1064,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"By splitting the data by cluster number, we see exactly how well the GMM algorithm has recovered the underlying label: the *setosa* species is separated perfectly within cluster 0, while there remains a small amount of mixing between *versicolor* and *virginica*.\n",
"This means that even without an expert to tell us the species labels of the individual flowers, the measurements of these flowers are distinct enough that we could *automatically* identify the presence of these different groups of species with a simple clustering algorithm!\n",
@ -909,14 +1076,20 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Application: Exploring Hand-written Digits"
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"To demonstrate these principles on a more interesting problem, let's consider one piece of the optical character recognition problem: the identification of hand-written digits.\n",
"In the wild, this problem involves both locating and identifying characters in an image. Here we'll take a shortcut and use Scikit-Learn's set of pre-formatted digits, which is built into the library."
@ -924,7 +1097,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Loading and visualizing the digits data\n",
"\n",
@ -935,7 +1111,9 @@
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -957,7 +1135,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The images data is a three-dimensional array: 1,797 samples each consisting of an 8 × 8 grid of pixels.\n",
"Let's visualize the first hundred of these:"
@ -967,7 +1148,9 @@
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -996,7 +1179,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"In order to work with this data within Scikit-Learn, we need a two-dimensional, ``[n_samples, n_features]`` representation.\n",
"We can accomplish this by treating each pixel in the image as a feature: that is, by flattening out the pixel arrays so that we have a length-64 array of pixel values representing each digit.\n",
@ -1008,7 +1194,9 @@
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -1031,7 +1219,9 @@
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -1052,14 +1242,20 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We see here that there are 1,797 samples and 64 features."
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Unsupervised learning: Dimensionality reduction\n",
"\n",
@ -1072,7 +1268,9 @@
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -1096,7 +1294,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We see that the projected data is now two-dimensional.\n",
"Let's plot this data to see if we can learn anything from its structure:"
@ -1106,7 +1307,9 @@
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -1130,7 +1333,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"This plot gives us some good intuition into how well various numbers are separated in the larger 64-dimensional space. For example, zeros (in black) and ones (in purple) have very little overlap in parameter space.\n",
"Intuitively, this makes sense: a zero is empty in the middle of the image, while a one will generally have ink in the middle.\n",
@ -1142,7 +1348,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Classification on digits\n",
"\n",
@ -1154,7 +1363,9 @@
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -1165,7 +1376,9 @@
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -1177,7 +1390,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Now that we have predicted our model, we can gauge its accuracy by comparing the true values of the test set to the predictions:"
]
@ -1186,7 +1402,9 @@
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -1207,7 +1425,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"With even this extremely simple model, we find about 80% accuracy for classification of the digits!\n",
"However, this single number doesn't tell us *where* we've gone wrong—one nice way to do this is to use the *confusion matrix*, which we can compute with Scikit-Learn and plot with Seaborn:"
@ -1217,7 +1438,9 @@
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -1243,7 +1466,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"This shows us where the mis-labeled points tend to be: for example, a large number of twos here are mis-classified as either ones or eights.\n",
"Another way to gain intuition into the characteristics of the model is to plot the inputs again, with their predicted labels.\n",
@ -1254,7 +1480,9 @@
"cell_type": "code",
"execution_count": 32,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -1284,7 +1512,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Examining this subset of the data, we can gain insight regarding where the algorithm might be not performing optimally.\n",
"To go beyond our 80% classification rate, we might move to a more sophisticated algorithm such as support vector machines (see [In-Depth: Support Vector Machines](05.07-Support-Vector-Machines.ipynb)), random forests (see [In-Depth: Decision Trees and Random Forests](05.08-Random-Forests.ipynb)) or another classification approach."
@ -1292,14 +1523,20 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Summary"
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"In this section we have covered the essential features of the Scikit-Learn data representation, and the estimator API.\n",
"Regardless of the type of estimator, the same import/instantiate/fit/predict pattern holds.\n",
@ -1310,7 +1547,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--NAVIGATION-->\n",
"< [What Is Machine Learning?](05.01-What-Is-Machine-Learning.ipynb) | [Contents](Index.ipynb) | [Hyperparameters and Model Validation](05.03-Hyperparameters-and-Model-Validation.ipynb) >"

View File

@ -2,7 +2,10 @@
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--BOOK_INFORMATION-->\n",
"<img align=\"left\" style=\"padding-right:10px;\" src=\"figures/PDSH-cover-small.png\">\n",
@ -13,20 +16,30 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--NAVIGATION-->\n",
"< [Introducing Scikit-Learn](05.02-Introducing-Scikit-Learn.ipynb) | [Contents](Index.ipynb) | [Feature Engineering](05.04-Feature-Engineering.ipynb) >"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Hyperparameters and Model Validation"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"source": [
"# Hyperparameters and Model Validation\n",
"\n",
"In the previous section, we saw the basic recipe for applying a supervised machine learning model:\n",
"\n",
"1. Choose a class of model\n",
@ -41,7 +54,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Thinking about Model Validation\n",
"\n",
@ -54,7 +70,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Model validation the wrong way\n",
"\n",
@ -66,7 +85,9 @@
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -78,7 +99,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Next we choose a model and hyperparameters. Here we'll use a *k*-neighbors classifier with ``n_neighbors=1``.\n",
"This is a very simple and intuitive model that says \"the label of an unknown point is the same as the label of its closest training point:\""
@ -88,7 +112,9 @@
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -98,7 +124,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Then we train the model, and use it to predict labels for data we already know:"
]
@ -107,7 +136,9 @@
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -117,7 +148,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Finally, we compute the fraction of correctly labeled points:"
]
@ -126,7 +160,9 @@
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -147,7 +183,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We see an accuracy score of 1.0, which indicates that 100% of points were correctly labeled by our model!\n",
"But is this truly measuring the expected accuracy? Have we really come upon a model that we expect to be correct 100% of the time?\n",
@ -159,7 +198,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Model validation the right way: Holdout sets\n",
"\n",
@ -172,7 +214,9 @@
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -202,7 +246,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We see here a more reasonable result: the nearest-neighbor classifier is about 90% accurate on this hold-out set.\n",
"The hold-out set is similar to unknown data, because the model has not \"seen\" it before."
@ -210,7 +257,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Model validation via cross-validation\n",
"\n",
@ -232,7 +282,9 @@
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -254,7 +306,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"What comes out are two accuracy scores, which we could combine (by, say, taking the mean) to get a better measure of the global model performance.\n",
"This particular form of cross-validation is a *two-fold cross-validation*—that is, one in which we have split the data into two sets and used each in turn as a validation set.\n",
@ -272,7 +327,9 @@
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -293,7 +350,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Repeating the validation across different subsets of the data gives us an even better idea of the performance of the algorithm.\n",
"\n",
@ -306,7 +366,9 @@
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -339,7 +401,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Because we have 150 samples, the leave one out cross-validation yields scores for 150 trials, and the score indicates either successful (1.0) or unsuccessful (0.0) prediction.\n",
"Taking the mean of these gives an estimate of the error rate:"
@ -349,7 +414,9 @@
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -369,7 +436,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Other cross-validation schemes can be used similarly.\n",
"For a description of what is available in Scikit-Learn, use IPython to explore the ``sklearn.cross_validation`` submodule, or take a look at Scikit-Learn's online [cross-validation documentation](http://scikit-learn.org/stable/modules/cross_validation.html)."
@ -377,7 +447,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Selecting the Best Model\n",
"\n",
@ -399,7 +472,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### The Bias-variance trade-off\n",
"\n",
@ -422,7 +498,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"To look at this in another light, consider what happens if we use these two models to predict the y-value for some new data.\n",
"In the following diagrams, the red/lighter points indicate data that is omitted from the training set:\n",
@ -439,7 +518,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"If we imagine that we have some ability to tune the model complexity, we would expect the training score and validation score to behave as illustrated in the following figure:\n",
"\n",
@ -459,7 +541,9 @@
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"source": [
"### Validation curves in Scikit-Learn\n",
@ -487,7 +571,9 @@
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -503,7 +589,9 @@
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"source": [
"Now let's create some data to which we will fit our model:"
@ -513,7 +601,9 @@
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -533,7 +623,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We can now visualize our data, along with polynomial fits of several degrees:"
]
@ -542,7 +635,9 @@
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -575,7 +670,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The knob controlling model complexity in this case is the degree of the polynomial, which can be any non-negative integer.\n",
"A useful question to answer is this: what degree of polynomial provides a suitable trade-off between bias (under-fitting) and variance (over-fitting)?\n",
@ -588,7 +686,9 @@
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -618,7 +718,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"This shows precisely the qualitative behavior we expect: the training score is everywhere higher than the validation score; the training score is monotonically improving with increased model complexity; and the validation score reaches a maximum before dropping off as the model becomes over-fit.\n",
"\n",
@ -629,7 +732,9 @@
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -653,14 +758,20 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Notice that finding this optimal model did not actually require us to compute the training score, but examining the relationship between the training score and validation score can give us useful insight into the performance of the model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Learning Curves\n",
"\n",
@ -672,7 +783,9 @@
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -693,7 +806,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We will duplicate the preceding code to plot the validation curve for this larger dataset; for reference let's over-plot the previous results as well:"
]
@ -702,7 +818,9 @@
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -733,7 +851,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The solid lines show the new results, while the fainter dashed lines show the results of the previous smaller dataset.\n",
"It is clear from the validation curve that the larger dataset can support a much more complicated model: the peak here is probably around a degree of 6, but even a degree-20 model is not seriously over-fitting the data—the validation and training scores remain very close.\n",
@ -753,7 +874,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"![](figures/05.03-learning-curve.png)\n",
"[figure source in Appendix](06.00-Figure-Code.ipynb#Learning-Curve)"
@ -761,7 +885,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The notable feature of the learning curve is the convergence to a particular score as the number of training samples grows.\n",
"In particular, once you have enough points that a particular model has converged, *adding more training data will not help you!*\n",
@ -770,7 +897,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Learning curves in Scikit-Learn\n",
"\n",
@ -781,7 +911,9 @@
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -821,7 +953,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"This is a valuable diagnostic, because it gives us a visual depiction of how our model responds to increasing training data.\n",
"In particular, when your learning curve has already converged (i.e., when the training and validation curves are already close to each other) *adding more training data will not significantly improve the fit!*\n",
@ -836,7 +971,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Validation in Practice: Grid Search\n",
"\n",
@ -854,7 +992,9 @@
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -869,7 +1009,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Notice that like a normal estimator, this has not yet been applied to any data.\n",
"Calling the ``fit()`` method will fit the model at each grid point, keeping track of the scores along the way:"
@ -879,7 +1022,9 @@
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -888,7 +1033,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Now that this is fit, we can ask for the best parameters as follows:"
]
@ -897,7 +1045,9 @@
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -919,7 +1069,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Finally, if we wish, we can use the best model and show the fit to our data using code from before:"
]
@ -928,7 +1081,9 @@
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -954,7 +1109,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The grid search provides many more options, including the ability to specify a custom scoring function, to parallelize the computations, to do randomized searches, and more.\n",
"For information, see the examples in [In-Depth: Kernel Density Estimation](05.13-Kernel-Density-Estimation.ipynb) and [Feature Engineering: Working with Images](05.14-Image-Features.ipynb), or refer to Scikit-Learn's [grid search documentation](http://Scikit-Learn.org/stable/modules/grid_search.html)."
@ -962,7 +1120,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Summary\n",
"\n",
@ -975,7 +1136,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--NAVIGATION-->\n",
"< [Introducing Scikit-Learn](05.02-Introducing-Scikit-Learn.ipynb) | [Contents](Index.ipynb) | [Feature Engineering](05.04-Feature-Engineering.ipynb) >"

View File

@ -2,7 +2,10 @@
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--BOOK_INFORMATION-->\n",
"<img align=\"left\" style=\"padding-right:10px;\" src=\"figures/PDSH-cover-small.png\">\n",
@ -13,7 +16,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--NAVIGATION-->\n",
"< [Hyperparameters and Model Validation](05.03-Hyperparameters-and-Model-Validation.ipynb) | [Contents](Index.ipynb) | [In Depth: Naive Bayes Classification](05.05-Naive-Bayes.ipynb) >"
@ -23,8 +29,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Feature Engineering\n",
"\n",
"# Feature Engineering"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The previous sections outline the fundamental ideas of machine learning, but all of the examples assume that you have numerical data in a tidy, ``[n_samples, n_features]`` format.\n",
"In the real world, data rarely comes in such a form.\n",
"With this in mind, one of the more important steps in using machine learning in practice is *feature engineering*: that is, taking whatever information you have about your problem and turning it into numbers that you can use to build your feature matrix.\n",
@ -36,7 +50,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Categorical Features\n",
"\n",
@ -49,7 +66,9 @@
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -63,7 +82,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"You might be tempted to encode this data with a straightforward numerical mapping:"
]
@ -72,7 +94,9 @@
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -81,7 +105,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"It turns out that this is not generally a useful approach in Scikit-Learn: the package's models make the fundamental assumption that numerical features reflect algebraic quantities.\n",
"Thus such a mapping would imply, for example, that *Queen Anne < Fremont < Wallingford*, or even that *Wallingford - Queen Anne = Fremont*, which (niche demographic jokes aside) does not make much sense.\n",
@ -94,7 +121,9 @@
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -119,7 +148,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Notice that the 'neighborhood' column has been expanded into three separate columns, representing the three neighborhood labels, and that each row has a 1 in the column associated with its neighborhood.\n",
"With these categorical features thus encoded, you can proceed as normal with fitting a Scikit-Learn model.\n",
@ -131,7 +163,9 @@
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -155,7 +189,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"There is one clear disadvantage of this approach: if your category has many possible values, this can *greatly* increase the size of your dataset.\n",
"However, because the encoded data contains mostly zeros, a sparse output can be a very efficient solution:"
@ -165,7 +202,9 @@
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -187,14 +226,20 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Many (though not yet all) of the Scikit-Learn estimators accept such sparse inputs when fitting and evaluating models. ``sklearn.preprocessing.OneHotEncoder`` and ``sklearn.feature_extraction.FeatureHasher`` are two additional tools that Scikit-Learn includes to support this type of encoding."
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Text Features\n",
"\n",
@ -209,7 +254,9 @@
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -220,7 +267,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"For a vectorization of this data based on word count, we could construct a column representing the word \"problem,\" the word \"evil,\" the word \"horizon,\" and so on.\n",
"While doing this by hand would be possible, the tedium can be avoided by using Scikit-Learn's ``CountVectorizer``:"
@ -230,7 +280,9 @@
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -255,7 +307,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The result is a sparse matrix recording the number of times each word appears; it is easier to inspect if we convert this to a ``DataFrame`` with labeled columns:"
]
@ -264,7 +319,9 @@
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -330,7 +387,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"There are some issues with this approach, however: the raw word counts lead to features which put too much weight on words that appear very frequently, and this can be sub-optimal in some classification algorithms.\n",
"One approach to fix this is known as *term frequency-inverse document frequency* (*TFIDF*) which weights the word counts by a measure of how often they appear in the documents.\n",
@ -341,7 +401,9 @@
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -409,14 +471,20 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"For an example of using TF-IDF in a classification problem, see [In Depth: Naive Bayes Classification](05.05-Naive-Bayes.ipynb)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Image Features\n",
"\n",
@ -430,7 +498,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Derived Features\n",
"\n",
@ -446,7 +517,9 @@
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -472,7 +545,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Still, we can fit a line to the data using ``LinearRegression`` and get the optimal result:"
]
@ -481,7 +557,9 @@
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -506,7 +584,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"It's clear that we need a more sophisticated model to describe the relationship between $x$ and $y$.\n",
"\n",
@ -518,7 +599,9 @@
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -542,7 +625,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The derived feature matrix has one column representing $x$, and a second column representing $x^2$, and a third column representing $x^3$.\n",
"Computing a linear regression on this expanded input gives a much closer fit to our data:"
@ -552,7 +638,9 @@
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -575,7 +663,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"This idea of improving a model not by changing the model, but by transforming the inputs, is fundamental to many of the more powerful machine learning methods.\n",
"We explore this idea further in [In Depth: Linear Regression](05.06-Linear-Regression.ipynb) in the context of *basis function regression*.\n",
@ -584,7 +675,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Imputation of Missing Data\n",
"\n",
@ -597,7 +691,9 @@
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -612,7 +708,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"When applying a typical machine learning model to such data, we will need to first replace such missing data with some appropriate fill value.\n",
"This is known as *imputation* of missing values, and strategies range from simple (e.g., replacing missing values with the mean of the column) to sophisticated (e.g., using matrix completion or a robust model to handle such data).\n",
@ -625,7 +724,9 @@
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -652,7 +753,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We see that in the resulting data, the two missing values have been replaced with the mean of the remaining values in the column. This imputed data can then be fed directly into, for example, a ``LinearRegression`` estimator:"
]
@ -661,7 +765,9 @@
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -682,7 +788,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Feature Pipelines\n",
"\n",
@ -700,7 +809,9 @@
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -713,7 +824,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"This pipeline looks and acts like a standard Scikit-Learn object, and will apply all the specified steps to any input data."
]
@ -722,7 +836,9 @@
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -742,7 +858,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"All the steps of the model are applied automatically.\n",
"Notice that for the simplicity of this demonstration, we've applied the model to the data it was trained on; this is why it was able to perfectly predict the result (refer back to [Hyperparameters and Model Validation](05.03-Hyperparameters-and-Model-Validation.ipynb) for further discussion of this).\n",
@ -752,7 +871,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--NAVIGATION-->\n",
"< [Hyperparameters and Model Validation](05.03-Hyperparameters-and-Model-Validation.ipynb) | [Contents](Index.ipynb) | [In Depth: Naive Bayes Classification](05.05-Naive-Bayes.ipynb) >"

View File

@ -2,7 +2,10 @@
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--BOOK_INFORMATION-->\n",
"<img align=\"left\" style=\"padding-right:10px;\" src=\"figures/PDSH-cover-small.png\">\n",
@ -13,7 +16,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--NAVIGATION-->\n",
"< [Feature Engineering](05.04-Feature-Engineering.ipynb) | [Contents](Index.ipynb) | [In Depth: Linear Regression](05.06-Linear-Regression.ipynb) >"
@ -23,8 +29,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# In Depth: Naive Bayes Classification\n",
"\n",
"# In Depth: Naive Bayes Classification"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The previous four sections have given a general overview of the concepts of machine learning.\n",
"In this section and the ones that follow, we will be taking a closer look at several specific algorithms for supervised and unsupervised learning, starting here with naive Bayes classification.\n",
"\n",
@ -35,7 +49,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Bayesian Classification\n",
"\n",
@ -69,7 +86,9 @@
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -81,7 +100,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Gaussian Naive Bayes\n",
"\n",
@ -94,7 +116,9 @@
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -116,7 +140,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"One extremely fast way to create a simple model is to assume that the data is described by a Gaussian distribution with no covariance between dimensions.\n",
"This model can be fit by simply finding the mean and standard deviation of the points within each label, which is all you need to define such a distribution.\n",
@ -125,7 +152,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"![(run code in Appendix to generate image)](figures/05.05-gaussian-NB.png)\n",
"[figure source in Appendix](06.00-Figure-Code.ipynb#Gaussian-Naive-Bayes)"
@ -134,7 +164,9 @@
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"source": [
"The ellipses here represent the Gaussian generative model for each label, with larger probability toward the center of the ellipses.\n",
@ -147,7 +179,9 @@
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -158,7 +192,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Now let's generate some new data and predict the label:"
]
@ -167,7 +204,9 @@
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -178,7 +217,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Now we can plot this new data to get an idea of where the decision boundary is:"
]
@ -187,7 +229,9 @@
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -210,7 +254,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We see a slightly curved boundary in the classifications—in general, the boundary in Gaussian naive Bayes is quadratic.\n",
"\n",
@ -221,7 +268,9 @@
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -249,7 +298,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The columns give the posterior probabilities of the first and second label, respectively.\n",
"If you are looking for estimates of uncertainty in your classification, Bayesian approaches like this can be a useful approach.\n",
@ -260,7 +312,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Multinomial Naive Bayes\n",
"\n",
@ -273,7 +328,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Example: Classifying Text\n",
"\n",
@ -287,7 +345,9 @@
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -329,7 +389,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"For simplicity here, we will select just a few of these categories, and download the training and testing set:"
]
@ -338,7 +401,9 @@
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -350,7 +415,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Here is a representative entry from the data:"
]
@ -359,7 +427,9 @@
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -393,7 +463,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"In order to use this data for machine learning, we need to be able to convert the content of each string into a vector of numbers.\n",
"For this we will use the TF-IDF vectorizer (discussed in [Feature Engineering](05.04-Feature-Engineering.ipynb)), and create a pipeline that attaches it to a multinomial naive Bayes classifier:"
@ -403,7 +476,9 @@
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -416,7 +491,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"With this pipeline, we can apply the model to the training data, and predict labels for the test data:"
]
@ -425,7 +503,9 @@
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -435,7 +515,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Now that we have predicted the labels for the test data, we can evaluate them to learn about the performance of the estimator.\n",
"For example, here is the confusion matrix between the true and predicted labels for the test data:"
@ -445,7 +528,9 @@
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -470,7 +555,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Evidently, even this very simple classifier can successfully separate space talk from computer talk, but it gets confused between talk about religion and talk about Christianity.\n",
"This is perhaps an expected area of confusion!\n",
@ -483,7 +571,9 @@
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -494,7 +584,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Let's try it out:"
]
@ -503,7 +596,9 @@
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -525,7 +620,9 @@
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -547,7 +644,9 @@
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -567,7 +666,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Remember that this is nothing more sophisticated than a simple probability model for the (weighted) frequency of each word in the string; nevertheless, the result is striking.\n",
"Even a very naive algorithm, when used carefully and trained on a large set of high-dimensional data, can be surprisingly effective."
@ -575,7 +677,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## When to Use Naive Bayes\n",
"\n",
@ -604,7 +709,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--NAVIGATION-->\n",
"< [Feature Engineering](05.04-Feature-Engineering.ipynb) | [Contents](Index.ipynb) | [In Depth: Linear Regression](05.06-Linear-Regression.ipynb) >"

View File

@ -2,7 +2,10 @@
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--BOOK_INFORMATION-->\n",
"<img align=\"left\" style=\"padding-right:10px;\" src=\"figures/PDSH-cover-small.png\">\n",
@ -13,7 +16,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--NAVIGATION-->\n",
"< [In Depth: Naive Bayes Classification](05.05-Naive-Bayes.ipynb) | [Contents](Index.ipynb) | [In-Depth: Support Vector Machines](05.07-Support-Vector-Machines.ipynb) >"
@ -23,8 +29,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# In Depth: Linear Regression\n",
"\n",
"# In Depth: Linear Regression"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Just as naive Bayes (discussed earlier in [In Depth: Naive Bayes Classification](05.05-Naive-Bayes.ipynb)) is a good starting point for classification tasks, linear regression models are a good starting point for regression tasks.\n",
"Such models are popular because they can be fit very quickly, and are very interpretable.\n",
"You are probably familiar with the simplest form of a linear regression model (i.e., fitting a straight line to data) but such models can be extended to model more complicated data behavior.\n",
@ -38,7 +52,9 @@
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -50,7 +66,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Simple Linear Regression\n",
"\n",
@ -68,7 +87,9 @@
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -91,7 +112,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We can use Scikit-Learn's ``LinearRegression`` estimator to fit this data and construct the best-fit line:"
]
@ -100,7 +124,9 @@
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -129,7 +155,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The slope and intercept of the data are contained in the model's fit parameters, which in Scikit-Learn are always marked by a trailing underscore.\n",
"Here the relevant parameters are ``coef_`` and ``intercept_``:"
@ -139,7 +168,9 @@
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -158,14 +189,20 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We see that the results are very close to the inputs, as we might hope."
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The ``LinearRegression`` estimator is much more capable than this, however—in addition to simple straight-line fits, it can also handle multidimensional linear models of the form\n",
"$$\n",
@ -181,7 +218,9 @@
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -205,7 +244,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Here the $y$ data is constructed from three random $x$ values, and the linear regression recovers the coefficients used to construct the data.\n",
"\n",
@ -215,7 +257,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Basis Function Regression\n",
"\n",
@ -238,7 +283,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Polynomial basis functions\n",
"\n",
@ -249,7 +297,9 @@
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -274,7 +324,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We see here that the transformer has converted our one-dimensional array into a three-dimensional array by taking the exponent of each value.\n",
"This new, higher-dimensional data representation can then be plugged into a linear regression.\n",
@ -287,7 +340,9 @@
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -298,7 +353,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"With this transform in place, we can use the linear model to fit much more complicated relationships between $x$ and $y$. \n",
"For example, here is a sine wave with noise:"
@ -308,7 +366,9 @@
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -336,14 +396,20 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Our linear model, through the use of 7th-order polynomial basis functions, can provide an excellent fit to this non-linear data!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Gaussian basis functions\n",
"\n",
@ -354,7 +420,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"![](figures/05.06-gaussian-basis.png)\n",
"[figure source in Appendix](#Gaussian-Basis)"
@ -362,7 +431,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The shaded regions in the plot are the scaled basis functions, and when added together they reproduce the smooth curve through the data.\n",
"These Gaussian basis functions are not built into Scikit-Learn, but we can write a custom transformer that will create them, as shown here and illustrated in the following figure (Scikit-Learn transformers are implemented as Python classes; reading Scikit-Learn's source is a good way to see how they can be created):"
@ -372,7 +444,9 @@
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -423,14 +497,20 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We put this example here just to make clear that there is nothing magic about polynomial basis functions: if you have some sort of intuition into the generating process of your data that makes you think one basis or another might be appropriate, you can use them as well."
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Regularization\n",
"\n",
@ -442,7 +522,9 @@
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -470,7 +552,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"With the data projected to the 30-dimensional basis, the model has far too much flexibility and goes to extreme values between locations where it is constrained by data.\n",
"We can see the reason for this if we plot the coefficients of the Gaussian bases with respect to their locations:"
@ -480,7 +565,9 @@
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -517,7 +604,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The lower panel of this figure shows the amplitude of the basis function at each location.\n",
"This is typical over-fitting behavior when basis functions overlap: the coefficients of adjacent basis functions blow up and cancel each other out.\n",
@ -527,7 +617,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Ridge regression ($L_2$ Regularization)\n",
"\n",
@ -544,7 +637,9 @@
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -566,7 +661,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The $\\alpha$ parameter is essentially a knob controlling the complexity of the resulting model.\n",
"In the limit $\\alpha \\to 0$, we recover the standard linear regression result; in the limit $\\alpha \\to \\infty$, all model responses will be suppressed.\n",
@ -575,7 +673,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Lasso regression ($L_1$ regularization)\n",
"\n",
@ -592,7 +693,9 @@
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -614,7 +717,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"With the lasso regression penalty, the majority of the coefficients are exactly zero, with the functional behavior being modeled by a small subset of the available basis functions.\n",
"As with ridge regularization, the $\\alpha$ parameter tunes the strength of the penalty, and should be determined via, for example, cross-validation (refer back to [Hyperparameters and Model Validation](05.03-Hyperparameters-and-Model-Validation.ipynb) for a discussion of this)."
@ -622,7 +728,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Example: Predicting Bicycle Traffic"
]
@ -630,7 +739,9 @@
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"source": [
"As an example, let's take a look at whether we can predict the number of bicycle trips across Seattle's Fremont Bridge based on weather, season, and other factors.\n",
@ -650,7 +761,9 @@
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -661,7 +774,9 @@
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -672,7 +787,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Next we will compute the total daily bicycle traffic, and put this in its own dataframe:"
]
@ -681,7 +799,9 @@
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -692,7 +812,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We saw previously that the patterns of use generally vary from day to day; let's account for this in our data by adding binary columns that indicate the day of the week:"
]
@ -701,7 +824,9 @@
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -712,7 +837,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Similarly, we might expect riders to behave differently on holidays; let's add an indicator of this as well:"
]
@ -721,7 +849,9 @@
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -734,7 +864,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We also might suspect that the hours of daylight would affect how many people ride; let's use the standard astronomical calculation to add this information:"
]
@ -743,7 +876,9 @@
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -782,7 +917,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We can also add the average temperature and total precipitation to the data.\n",
"In addition to the inches of precipitation, let's add a flag that indicates whether a day is dry (has zero precipitation):"
@ -792,7 +930,9 @@
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -810,7 +950,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Finally, let's add a counter that increases from day 1, and measures how many years have passed.\n",
"This will let us measure any observed annual increase or decrease in daily crossings:"
@ -820,7 +963,9 @@
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -829,7 +974,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Now our data is in order, and we can take a look at it:"
]
@ -838,7 +986,9 @@
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -1001,7 +1151,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"With this in place, we can choose the columns to use, and fit a linear regression model to our data.\n",
"We will set ``fit_intercept = False``, because the daily flags essentially operate as their own day-specific intercepts:"
@ -1011,7 +1164,9 @@
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -1030,7 +1185,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Finally, we can compare the total and predicted bicycle traffic visually:"
]
@ -1039,7 +1197,9 @@
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -1059,7 +1219,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"It is evident that we have missed some key features, especially during the summer time.\n",
"Either our features are not complete (i.e., people decide whether to ride to work based on more than just these) or there are some nonlinear relationships that we have failed to take into account (e.g., perhaps people ride less at both high and low temperatures).\n",
@ -1070,7 +1233,9 @@
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -1104,7 +1269,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"These numbers are difficult to interpret without some measure of their uncertainty.\n",
"We can compute these uncertainties quickly using bootstrap resamplings of the data:"
@ -1114,7 +1282,9 @@
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -1126,7 +1296,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"With these errors estimated, let's again look at the results:"
]
@ -1135,7 +1308,9 @@
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -1166,7 +1341,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We first see that there is a relatively stable trend in the weekly baseline: there are many more riders on weekdays than on weekends and holidays.\n",
"We see that for each additional hour of daylight, 129 ± 9 more people choose to ride; a temperature increase of one degree Celsius encourages 65 ± 4 people to grab their bicycle; a dry day means an average of 548 ± 33 more riders, and each inch of precipitation means 665 ± 62 more people leave their bike at home.\n",
@ -1179,7 +1357,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--NAVIGATION-->\n",
"< [In Depth: Naive Bayes Classification](05.05-Naive-Bayes.ipynb) | [Contents](Index.ipynb) | [In-Depth: Support Vector Machines](05.07-Support-Vector-Machines.ipynb) >"

View File

@ -2,7 +2,10 @@
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--BOOK_INFORMATION-->\n",
"<img align=\"left\" style=\"padding-right:10px;\" src=\"figures/PDSH-cover-small.png\">\n",
@ -13,7 +16,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--NAVIGATION-->\n",
"< [In-Depth: Decision Trees and Random Forests](05.08-Random-Forests.ipynb) | [Contents](Index.ipynb) | [In-Depth: Manifold Learning](05.10-Manifold-Learning.ipynb) >"
@ -23,8 +29,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# In Depth: Principal Component Analysis\n",
"\n",
"# In Depth: Principal Component Analysis"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Up until now, we have been looking in depth at supervised learning estimators: those estimators that predict labels based on labeled training data.\n",
"Here we begin looking at several unsupervised estimators, which can highlight interesting aspects of the data without reference to any known labels.\n",
"\n",
@ -39,7 +53,9 @@
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -51,7 +67,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Introducing Principal Component Analysis\n",
"\n",
@ -64,7 +83,9 @@
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -87,7 +108,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"By eye, it is clear that there is a nearly linear relationship between the x and y variables.\n",
"This is reminiscent of the linear regression data we explored in [In Depth: Linear Regression](05.06-Linear-Regression.ipynb), but the problem setting here is slightly different: rather than attempting to *predict* the y values from the x values, the unsupervised learning problem attempts to learn about the *relationship* between the x and y values.\n",
@ -100,7 +124,9 @@
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -122,7 +148,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The fit learns some quantities from the data, most importantly the \"components\" and \"explained variance\":"
]
@ -131,7 +160,9 @@
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -151,7 +182,9 @@
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -168,7 +201,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"To see what these numbers mean, let's visualize them as vectors over the input data, using the \"components\" to define the direction of the vector, and the \"explained variance\" to define the squared-length of the vector:"
]
@ -177,7 +213,9 @@
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -209,7 +247,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"These vectors represent the *principal axes* of the data, and the length of the vector is an indication of how \"important\" that axis is in describing the distribution of the data—more precisely, it is a measure of the variance of the data when projected onto that axis.\n",
"The projection of each data point onto the principal axes are the \"principal components\" of the data.\n",
@ -219,7 +260,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"![](figures/05.09-PCA-rotation.png)\n",
"[figure source in Appendix](06.00-Figure-Code.ipynb#Principal-Components-Rotation)"
@ -227,7 +271,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"This transformation from data axes to principal axes is an *affine transformation*, which basically means it is composed of a translation, rotation, and uniform scaling.\n",
"\n",
@ -236,7 +283,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### PCA as dimensionality reduction\n",
"\n",
@ -249,7 +299,9 @@
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -271,7 +323,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The transformed data has been reduced to a single dimension.\n",
"To understand the effect of this dimensionality reduction, we can perform the inverse transform of this reduced data and plot it along with the original data:"
@ -281,7 +336,9 @@
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -304,7 +361,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The light points are the original data, while the dark points are the projected version.\n",
"This makes clear what a PCA dimensionality reduction means: the information along the least important principal axis or axes is removed, leaving only the component(s) of the data with the highest variance.\n",
@ -315,7 +375,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### PCA for visualization: Hand-written digits\n",
"\n",
@ -329,7 +392,9 @@
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -351,7 +416,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Recall that the data consists of 8×8 pixel images, meaning that they are 64-dimensional.\n",
"To gain some intuition into the relationships between these points, we can use PCA to project them to a more manageable number of dimensions, say two:"
@ -361,7 +429,9 @@
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -382,7 +452,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We can now plot the first two principal components of each point to learn about the data:"
]
@ -391,7 +464,9 @@
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -416,7 +491,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Recall what these components mean: the full data is a 64-dimensional point cloud, and these points are the projection of each data point along the directions with the largest variance.\n",
"Essentially, we have found the optimal stretch and rotation in 64-dimensional space that allows us to see the layout of the digits in two dimensions, and have done this in an unsupervised manner—that is, without reference to the labels."
@ -424,7 +502,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### What do the components mean?\n",
"\n",
@ -450,7 +531,9 @@
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"source": [
"![](figures/05.09-digits-pixel-components.png)\n",
@ -459,7 +542,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The upper row of panels shows the individual pixels, and the lower row shows the cumulative contribution of these pixels to the construction of the image.\n",
"Using only eight of the pixel-basis components, we can only construct a small portion of the 64-pixel image.\n",
@ -468,7 +554,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"But the pixel-wise representation is not the only choice of basis. We can also use other basis functions, which each contain some pre-defined contribution from each pixel, and write something like\n",
"\n",
@ -484,7 +573,9 @@
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"source": [
"![](figures/05.09-digits-pca-components.png)\n",
@ -493,7 +584,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Unlike the pixel basis, the PCA basis allows us to recover the salient features of the input image with just a mean plus eight components!\n",
"The amount of each pixel in each component is the corollary of the orientation of the vector in our two-dimensional example.\n",
@ -502,7 +596,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Choosing the number of components\n",
"\n",
@ -514,7 +611,9 @@
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -537,7 +636,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"This curve quantifies how much of the total, 64-dimensional variance is contained within the first $N$ components.\n",
"For example, we see that with the digits the first 10 components contain approximately 75% of the variance, while you need around 50 components to describe close to 100% of the variance.\n",
@ -547,7 +649,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## PCA as Noise Filtering\n",
"\n",
@ -563,7 +668,9 @@
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -591,7 +698,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Now lets add some random noise to create a noisy dataset, and re-plot it:"
]
@ -600,7 +710,9 @@
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -622,7 +734,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"It's clear by eye that the images are noisy, and contain spurious pixels.\n",
"Let's train a PCA on the noisy data, requesting that the projection preserve 50% of the variance:"
@ -632,7 +747,9 @@
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -653,7 +770,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Here 50% of the variance amounts to 12 principal components.\n",
"Now we compute these components, and then use the inverse of the transform to reconstruct the filtered digits:"
@ -663,7 +783,9 @@
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -685,14 +807,20 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"This signal preserving/noise filtering property makes PCA a very useful feature selection routine—for example, rather than training a classifier on very high-dimensional data, you might instead train the classifier on the lower-dimensional representation, which will automatically serve to filter out random noise in the inputs."
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Example: Eigenfaces\n",
"\n",
@ -705,7 +833,9 @@
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -727,7 +857,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Let's take a look at the principal axes that span this dataset.\n",
"Because this is a large dataset, we will use ``RandomizedPCA``—it contains a randomized method to approximate the first $N$ principal components much more quickly than the standard ``PCA`` estimator, and thus is very useful for high-dimensional data (here, a dimensionality of nearly 3,000).\n",
@ -738,7 +871,9 @@
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -761,7 +896,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"In this case, it can be interesting to visualize the images associated with the first several principal components (these components are technically known as \"eigenvectors,\"\n",
"so these types of images are often called \"eigenfaces\").\n",
@ -772,7 +910,9 @@
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -796,7 +936,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The results are very interesting, and give us insight into how the images vary: for example, the first few eigenfaces (from the top left) seem to be associated with the angle of lighting on the face, and later principal vectors seem to be picking out certain features, such as eyes, noses, and lips.\n",
"Let's take a look at the cumulative variance of these components to see how much of the data information the projection is preserving:"
@ -806,7 +949,9 @@
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -828,7 +973,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We see that these 150 components account for just over 90% of the variance.\n",
"That would lead us to believe that using these 150 components, we would recover most of the essential characteristics of the data.\n",
@ -839,7 +987,9 @@
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -853,7 +1003,9 @@
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -882,7 +1034,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The top row here shows the input images, while the bottom row shows the reconstruction of the images from just 150 of the ~3,000 initial features.\n",
"This visualization makes clear why the PCA feature selection used in [In-Depth: Support Vector Machines](05.07-Support-Vector-Machines.ipynb) was so successful: although it reduces the dimensionality of the data by nearly a factor of 20, the projected images contain enough information that we might, by eye, recognize the individuals in the image.\n",
@ -891,7 +1046,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Principal Component Analysis Summary\n",
"\n",
@ -910,7 +1068,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--NAVIGATION-->\n",
"< [In-Depth: Decision Trees and Random Forests](05.08-Random-Forests.ipynb) | [Contents](Index.ipynb) | [In-Depth: Manifold Learning](05.10-Manifold-Learning.ipynb) >"

View File

@ -2,7 +2,10 @@
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--BOOK_INFORMATION-->\n",
"<img align=\"left\" style=\"padding-right:10px;\" src=\"figures/PDSH-cover-small.png\">\n",
@ -13,7 +16,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--NAVIGATION-->\n",
"< [In Depth: k-Means Clustering](05.11-K-Means.ipynb) | [Contents](Index.ipynb) | [In-Depth: Kernel Density Estimation](05.13-Kernel-Density-Estimation.ipynb) >"
@ -23,8 +29,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# In Depth: Gaussian Mixture Models\n",
"\n",
"# In Depth: Gaussian Mixture Models"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The *k*-means clustering model explored in the previous section is simple and relatively easy to understand, but its simplicity leads to practical challenges in its application.\n",
"In particular, the non-probabilistic nature of *k*-means and its use of simple distance-from-cluster-center to assign cluster membership leads to poor performance for many real-world situations.\n",
"In this section we will take a look at Gaussian mixture models (GMMs), which can be viewed as an extension of the ideas behind *k*-means, but can also be a powerful tool for estimation beyond simple clustering.\n",
@ -36,7 +50,9 @@
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -48,7 +64,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Motivating GMM: Weaknesses of k-Means\n",
"\n",
@ -62,7 +81,9 @@
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -77,7 +98,9 @@
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -101,7 +124,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"From an intuitive standpoint, we might expect that the clustering assignment for some points is more certain than others: for example, there appears to be a very slight overlap between the two middle clusters, such that we might not have complete confidence in the cluster assigment of points between them.\n",
"Unfortunately, the *k*-means model has no intrinsic measure of probability or uncertainty of cluster assignments (although it may be possible to use a bootstrap approach to estimate this uncertainty).\n",
@ -116,7 +142,9 @@
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -143,7 +171,9 @@
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -164,7 +194,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"An important observation for *k*-means is that these cluster models *must be circular*: *k*-means has no built-in way of accounting for oblong or elliptical clusters.\n",
"So, for example, if we take the same data and transform it, the cluster assignments end up becoming muddled:"
@ -174,7 +207,9 @@
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -198,7 +233,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"By eye, we recognize that these transformed clusters are non-circular, and thus circular clusters would be a poor fit.\n",
"Nevertheless, *k*-means is not flexible enough to account for this, and tries to force-fit the data into four circular clusters.\n",
@ -214,7 +252,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Generalizing EM: Gaussian Mixture Models\n",
"\n",
@ -226,7 +267,9 @@
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -249,7 +292,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"But because GMM contains a probabilistic model under the hood, it is also possible to find probabilistic cluster assignments—in Scikit-Learn this is done using the ``predict_proba`` method.\n",
"This returns a matrix of size ``[n_samples, n_clusters]`` which measures the probability that any point belongs to the given cluster:"
@ -259,7 +305,9 @@
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -281,7 +329,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We can visualize this uncertainty by, for example, making the size of each point proportional to the certainty of its prediction; looking at the following figure, we can see that it is precisely the points at the boundaries between clusters that reflect this uncertainty of cluster assignment:"
]
@ -290,7 +341,9 @@
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -311,7 +364,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Under the hood, a Gaussian mixture model is very similar to *k*-means: it uses an expectationmaximization approach which qualitatively does the following:\n",
"\n",
@ -332,7 +388,9 @@
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -372,7 +430,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"With this in place, we can take a look at what the four-component GMM gives us for our initial data:"
]
@ -381,7 +442,9 @@
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -402,7 +465,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Similarly, we can use the GMM approach to fit our stretched dataset; allowing for a full covariance the model will fit even very oblong, stretched-out clusters:"
]
@ -411,7 +477,9 @@
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -432,14 +500,20 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"This makes clear that GMM addresses the two main practical issues with *k*-means encountered before."
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Choosing the covariance type\n",
"\n",
@ -454,7 +528,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"![(Covariance Type)](figures/05.12-covariance-type.png)\n",
"[figure source in Appendix](06.00-Figure-Code.ipynb#Covariance-Type)"
@ -462,7 +539,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## GMM as *Density Estimation*\n",
"\n",
@ -476,7 +556,9 @@
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -498,7 +580,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"If we try to fit this with a two-component GMM viewed as a clustering model, the results are not particularly useful:"
]
@ -507,7 +592,9 @@
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -528,7 +615,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"But if we instead use many more components and ignore the cluster labels, we find a fit that is much closer to the input data:"
]
@ -537,7 +627,9 @@
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -558,7 +650,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Here the mixture of 16 Gaussians serves not to find separated clusters of data, but rather to model the overall *distribution* of the input data.\n",
"This is a generative model of the distribution, meaning that the GMM gives us the recipe to generate new random data distributed similarly to our input.\n",
@ -569,7 +664,9 @@
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -590,14 +687,20 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"GMM is convenient as a flexible means of modeling an arbitrary multi-dimensional distribution of data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### How many components?\n",
"\n",
@ -613,7 +716,9 @@
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -640,7 +745,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The optimal number of clusters is the value that minimizes the AIC or BIC, depending on which approximation we wish to use. The AIC tells us that our choice of 16 components above was probably too many: around 8-12 components would have been a better choice.\n",
"As is typical with this sort of problem, the BIC recommends a simpler model.\n",
@ -651,7 +759,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Example: GMM for Generating New Data\n",
"\n",
@ -665,7 +776,9 @@
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -687,7 +800,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Next let's plot the first 100 of these to recall exactly what we're looking at:"
]
@ -696,7 +812,9 @@
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -723,7 +841,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We have nearly 1,800 digits in 64 dimensions, and we can build a GMM on top of these to generate more.\n",
"GMMs can have difficulty converging in such a high dimensional space, so we will start with an invertible dimensionality reduction algorithm on the data.\n",
@ -734,7 +855,9 @@
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -757,7 +880,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The result is 41 dimensions, a reduction of nearly 1/3 with almost no information loss.\n",
"Given this projected data, let's use the AIC to get a gauge for the number of GMM components we should use:"
@ -767,7 +893,9 @@
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -791,7 +919,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"It appears that around 110 components minimizes the AIC; we will use this model.\n",
"Let's quickly fit this to the data and confirm that it has converged:"
@ -801,7 +932,9 @@
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -820,7 +953,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Now we can draw samples of 100 new points within this 41-dimensional projected space, using the GMM as a generative model:"
]
@ -829,7 +965,9 @@
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -850,7 +988,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Finally, we can use the inverse transform of the PCA object to construct the new digits:"
]
@ -859,7 +1000,9 @@
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -880,7 +1023,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The results for the most part look like plausible digits from the dataset!\n",
"\n",
@ -890,7 +1036,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--NAVIGATION-->\n",
"< [In Depth: k-Means Clustering](05.11-K-Means.ipynb) | [Contents](Index.ipynb) | [In-Depth: Kernel Density Estimation](05.13-Kernel-Density-Estimation.ipynb) >"

View File

@ -2,7 +2,10 @@
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--BOOK_INFORMATION-->\n",
"<img align=\"left\" style=\"padding-right:10px;\" src=\"figures/PDSH-cover-small.png\">\n",
@ -13,20 +16,30 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--NAVIGATION-->\n",
"< [In Depth: Gaussian Mixture Models](05.12-Gaussian-Mixtures.ipynb) | [Contents](Index.ipynb) | [Application: A Face Detection Pipeline](05.14-Image-Features.ipynb) >"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# In-Depth: Kernel Density Estimation"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"source": [
"# In-Depth: Kernel Density Estimation\n",
"\n",
"In the previous section we covered Gaussian mixture models (GMM), which are a kind of hybrid between a clustering estimator and a density estimator.\n",
"Recall that a density estimator is an algorithm which takes a $D$-dimensional dataset and produces an estimate of the $D$-dimensional probability distribution which that data is drawn from.\n",
"The GMM algorithm accomplishes this by representing the density as a weighted sum of Gaussian distributions.\n",
@ -40,7 +53,9 @@
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -52,7 +67,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Motivating KDE: Histograms\n",
"\n",
@ -67,7 +85,9 @@
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -82,7 +102,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We have previously seen that the standard count-based histogram can be created with the ``plt.hist()`` function.\n",
"By specifying the ``normed`` parameter of the histogram, we end up with a normalized histogram where the height of the bins does not reflect counts, but instead reflects probability density:"
@ -92,7 +115,9 @@
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -112,7 +137,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Notice that for equal binning, this normalization simply changes the scale on the y-axis, leaving the relative heights essentially the same as in a histogram built from counts.\n",
"This normalization is chosen so that the total area under the histogram is equal to 1, as we can confirm by looking at the output of the histogram function:"
@ -122,7 +150,9 @@
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -144,7 +174,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"One of the issues with using a histogram as a density estimator is that the choice of bin size and location can lead to representations that have qualitatively different features.\n",
"For example, if we look at a version of this data with only 20 points, the choice of how to draw the bins can lead to an entirely different interpretation of the data!\n",
@ -155,7 +188,9 @@
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -167,7 +202,9 @@
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -195,7 +232,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"On the left, the histogram makes clear that this is a bimodal distribution.\n",
"On the right, we see a unimodal distribution with a long tail.\n",
@ -210,7 +250,9 @@
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -249,7 +291,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The problem with our two binnings stems from the fact that the height of the block stack often reflects not on the actual density of points nearby, but on coincidences of how the bins align with the data points.\n",
"This mis-alignment between points and their blocks is a potential cause of the poor histogram results seen here.\n",
@ -262,7 +307,9 @@
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -288,7 +335,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The result looks a bit messy, but is a much more robust reflection of the actual data characteristics than is the standard histogram.\n",
"Still, the rough edges are not aesthetically pleasing, nor are they reflective of any true properties of the data.\n",
@ -300,7 +350,9 @@
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -327,7 +379,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"This smoothed-out plot, with a Gaussian distribution contributed at the location of each input point, gives a much more accurate idea of the shape of the data distribution, and one which has much less variance (i.e., changes much less in response to differences in sampling).\n",
"\n",
@ -337,7 +392,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Kernel Density Estimation in Practice\n",
"\n",
@ -356,7 +414,9 @@
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -397,14 +457,20 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The result here is normalized such that the area under the curve is equal to 1."
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Selecting the bandwidth via cross-validation\n",
"\n",
@ -422,7 +488,9 @@
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -438,7 +506,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Now we can find the choice of bandwidth which maximizes the score (which in this case defaults to the log-likelihood):"
]
@ -447,7 +518,9 @@
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -467,14 +540,20 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The optimal bandwidth happens to be very close to what we used in the example plot earlier, where the bandwidth was 1.0 (i.e., the default width of ``scipy.stats.norm``)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Example: KDE on a Sphere\n",
"\n",
@ -491,7 +570,9 @@
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -508,7 +589,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"With this data loaded, we can use the Basemap toolkit (mentioned previously in [Geographic Data with Basemap](04.13-Geographic-Data-With-Basemap.ipynb)) to plot the observed locations of these two species on the map of South America."
]
@ -517,7 +601,9 @@
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -553,7 +639,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Unfortunately, this doesn't give a very good idea of the density of the species, because points in the species range may overlap one another.\n",
"You may not realize it by looking at this plot, but there are over 1,600 points shown here!\n",
@ -568,7 +657,9 @@
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -623,14 +714,20 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Compared to the simple scatter plot we initially used, this visualization paints a much clearer picture of the geographical distribution of observations of these two species."
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Example: Not-So-Naive Bayes\n",
"\n",
@ -662,7 +759,9 @@
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -705,14 +804,20 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### The anatomy of a custom estimator"
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Let's step through this code and discuss the essential features:\n",
"\n",
@ -738,7 +843,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Next comes the class initialization method:\n",
"\n",
@ -756,7 +864,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Next comes the ``fit()`` method, where we handle training data:\n",
"\n",
@ -783,7 +894,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Finally, we have the logic for predicting labels on new data:\n",
"```python\n",
@ -804,7 +918,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Using our custom estimator\n",
"\n",
@ -816,7 +933,9 @@
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
@ -834,7 +953,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Next we can plot the cross-validation score as a function of bandwidth:"
]
@ -843,7 +965,9 @@
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -876,7 +1000,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We see that this not-so-naive Bayesian classifier reaches a cross-validation accuracy of just over 96%; this is compared to around 80% for the naive Bayesian classification:"
]
@ -885,7 +1012,9 @@
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
@ -907,7 +1036,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"One benefit of such a generative classifier is interpretability of results: for each unknown sample, we not only get a probabilistic classification, but a *full model* of the distribution of points we are comparing it to!\n",
"If desired, this offers an intuitive window into the reasons for a particular classification that algorithms like SVMs and random forests tend to obscure.\n",
@ -922,7 +1054,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--NAVIGATION-->\n",
"< [In Depth: Gaussian Mixture Models](05.12-Gaussian-Mixtures.ipynb) | [Contents](Index.ipynb) | [Application: A Face Detection Pipeline](05.14-Image-Features.ipynb) >"

View File

@ -2,7 +2,10 @@
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--BOOK_INFORMATION-->\n",
"<img align=\"left\" style=\"padding-right:10px;\" src=\"figures/PDSH-cover-small.png\">\n",
@ -13,7 +16,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--NAVIGATION-->\n",
"< [Application: A Face Detection Pipeline](05.14-Image-Features.ipynb) | [Contents](Index.ipynb) | [Appendix: Figure Code](06.00-Figure-Code.ipynb) >"
@ -23,8 +29,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Further Machine Learning Resources\n",
"\n",
"# Further Machine Learning Resources"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"This chapter has been a quick tour of machine learning in Python, primarily using the tools within the Scikit-Learn library.\n",
"As long as the chapter is, it is still too short to cover many interesting and important algorithms, approaches, and discussions.\n",
"Here I want to suggest some resources to learn more about machine learning for those who are interested."
@ -32,7 +46,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Machine Learning in Python\n",
"\n",
@ -49,7 +66,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## General Machine Learning\n",
"\n",
@ -67,7 +87,10 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--NAVIGATION-->\n",
"< [Application: A Face Detection Pipeline](05.14-Image-Features.ipynb) | [Contents](Index.ipynb) | [Appendix: Figure Code](06.00-Figure-Code.ipynb) >"

File diff suppressed because it is too large Load Diff