add notebook and images for video 4 and update README
parent
09e58ad686
commit
9aad784fb9
|
@ -0,0 +1,530 @@
|
|||
{
|
||||
"metadata": {
|
||||
"name": "",
|
||||
"signature": "sha256:787cafbc85447c6885f074cc2d7b714baad44bcb52b51f8dc0f24816215d083d"
|
||||
},
|
||||
"nbformat": 3,
|
||||
"nbformat_minor": 0,
|
||||
"worksheets": [
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Training a machine learning model with scikit-learn\n",
|
||||
"*From the video series: [Introduction to machine learning with scikit-learn](https://github.com/justmarkham/scikit-learn-videos)*"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Agenda\n",
|
||||
"\n",
|
||||
"- What is the **K-nearest neighbors** classification model?\n",
|
||||
"- What are the four steps for **model training and prediction** in scikit-learn?\n",
|
||||
"- How can I apply this pattern to **other machine learning models**?"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Reviewing the iris dataset"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"collapsed": false,
|
||||
"input": [
|
||||
"from IPython.display import HTML\n",
|
||||
"HTML('<iframe src=http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data width=300 height=200></iframe>')"
|
||||
],
|
||||
"language": "python",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"html": [
|
||||
"<iframe src=http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data width=300 height=200></iframe>"
|
||||
],
|
||||
"metadata": {},
|
||||
"output_type": "pyout",
|
||||
"prompt_number": 2,
|
||||
"text": [
|
||||
"<IPython.core.display.HTML at 0x3e1bf98>"
|
||||
]
|
||||
}
|
||||
],
|
||||
"prompt_number": 2
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- 150 **observations**\n",
|
||||
"- 4 **features** (sepal length, sepal width, petal length, petal width)\n",
|
||||
"- **Response** variable is the iris species\n",
|
||||
"- **Classification** problem since response is categorical\n",
|
||||
"- More information in the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Iris)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## K-nearest neighbors (KNN) classification"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"1. Pick a value for K.\n",
|
||||
"2. Search for the K observations in the training data that are \"nearest\" to the measurements of the unknown iris.\n",
|
||||
"3. Use the most popular response value from the K nearest neighbors as the predicted response value for the unknown iris."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Example training data\n",
|
||||
"\n",
|
||||
"![Training data](images/04_knn_dataset.png)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### KNN classification map (K=1)\n",
|
||||
"\n",
|
||||
"![1NN classification map](images/04_1nn_map.png)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### KNN classification map (K=5)\n",
|
||||
"\n",
|
||||
"![5NN classification map](images/04_5nn_map.png)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"*Image Credits: [Data3classes](http://commons.wikimedia.org/wiki/File:Data3classes.png#/media/File:Data3classes.png), [Map1NN](http://commons.wikimedia.org/wiki/File:Map1NN.png#/media/File:Map1NN.png), [Map5NN](http://commons.wikimedia.org/wiki/File:Map5NN.png#/media/File:Map5NN.png) by Agor153. Licensed under CC BY-SA 3.0*"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Loading the data"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"collapsed": false,
|
||||
"input": [
|
||||
"# import load_iris function from datasets module\n",
|
||||
"from sklearn.datasets import load_iris\n",
|
||||
"\n",
|
||||
"# save \"bunch\" object containing iris dataset and its attributes\n",
|
||||
"iris = load_iris()\n",
|
||||
"\n",
|
||||
"# store feature matrix in \"X\"\n",
|
||||
"X = iris.data\n",
|
||||
"\n",
|
||||
"# store response vector in \"y\"\n",
|
||||
"y = iris.target"
|
||||
],
|
||||
"language": "python",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"prompt_number": 3
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"collapsed": false,
|
||||
"input": [
|
||||
"# print the shapes of X and y\n",
|
||||
"print X.shape\n",
|
||||
"print y.shape"
|
||||
],
|
||||
"language": "python",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"output_type": "stream",
|
||||
"stream": "stdout",
|
||||
"text": [
|
||||
"(150L, 4L)\n",
|
||||
"(150L,)\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"prompt_number": 4
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## scikit-learn 4-step modeling pattern"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**Step 1:** Import the class you plan to use"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"collapsed": false,
|
||||
"input": [
|
||||
"from sklearn.neighbors import KNeighborsClassifier"
|
||||
],
|
||||
"language": "python",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"prompt_number": 5
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**Step 2:** \"Instantiate\" the \"estimator\"\n",
|
||||
"\n",
|
||||
"- \"Estimator\" is scikit-learn's term for model\n",
|
||||
"- \"Instantiate\" means \"make an instance of\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"collapsed": false,
|
||||
"input": [
|
||||
"knn = KNeighborsClassifier(n_neighbors=1)"
|
||||
],
|
||||
"language": "python",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"prompt_number": 6
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Name of the object does not matter\n",
|
||||
"- Can specify tuning parameters (aka \"hyperparameters\") during this step\n",
|
||||
"- All parameters not specified are set to their defaults"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"collapsed": false,
|
||||
"input": [
|
||||
"print knn"
|
||||
],
|
||||
"language": "python",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"output_type": "stream",
|
||||
"stream": "stdout",
|
||||
"text": [
|
||||
"KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n",
|
||||
" metric_params=None, n_neighbors=1, p=2, weights='uniform')\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"prompt_number": 7
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**Step 3:** Fit the model with data (aka \"model training\")\n",
|
||||
"\n",
|
||||
"- Model is learning the relationship between X and y\n",
|
||||
"- Occurs in-place"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"collapsed": false,
|
||||
"input": [
|
||||
"knn.fit(X, y)"
|
||||
],
|
||||
"language": "python",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"metadata": {},
|
||||
"output_type": "pyout",
|
||||
"prompt_number": 8,
|
||||
"text": [
|
||||
"KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n",
|
||||
" metric_params=None, n_neighbors=1, p=2, weights='uniform')"
|
||||
]
|
||||
}
|
||||
],
|
||||
"prompt_number": 8
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**Step 4:** Predict the response for a new observation\n",
|
||||
"\n",
|
||||
"- New observations are called \"out-of-sample\" data\n",
|
||||
"- Uses the information it learned during the model training process"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"collapsed": false,
|
||||
"input": [
|
||||
"knn.predict([3, 5, 4, 2])"
|
||||
],
|
||||
"language": "python",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"metadata": {},
|
||||
"output_type": "pyout",
|
||||
"prompt_number": 9,
|
||||
"text": [
|
||||
"array([2])"
|
||||
]
|
||||
}
|
||||
],
|
||||
"prompt_number": 9
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- Returns a NumPy array\n",
|
||||
"- Can predict for multiple observations at once"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"collapsed": false,
|
||||
"input": [
|
||||
"X_new = [[3, 5, 4, 2], [5, 4, 3, 2]]\n",
|
||||
"knn.predict(X_new)"
|
||||
],
|
||||
"language": "python",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"metadata": {},
|
||||
"output_type": "pyout",
|
||||
"prompt_number": 10,
|
||||
"text": [
|
||||
"array([2, 1])"
|
||||
]
|
||||
}
|
||||
],
|
||||
"prompt_number": 10
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Using a different value for K"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"collapsed": false,
|
||||
"input": [
|
||||
"# instantiate the model (using the value K=5)\n",
|
||||
"knn = KNeighborsClassifier(n_neighbors=5)\n",
|
||||
"\n",
|
||||
"# fit the model with data\n",
|
||||
"knn.fit(X, y)\n",
|
||||
"\n",
|
||||
"# predict the response for new observations\n",
|
||||
"knn.predict(X_new)"
|
||||
],
|
||||
"language": "python",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"metadata": {},
|
||||
"output_type": "pyout",
|
||||
"prompt_number": 11,
|
||||
"text": [
|
||||
"array([1, 1])"
|
||||
]
|
||||
}
|
||||
],
|
||||
"prompt_number": 11
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Using a different classification model"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"collapsed": false,
|
||||
"input": [
|
||||
"# import the class\n",
|
||||
"from sklearn.linear_model import LogisticRegression\n",
|
||||
"\n",
|
||||
"# instantiate the model (using the default parameters)\n",
|
||||
"logreg = LogisticRegression()\n",
|
||||
"\n",
|
||||
"# fit the model with data\n",
|
||||
"logreg.fit(X, y)\n",
|
||||
"\n",
|
||||
"# predict the response for new observations\n",
|
||||
"logreg.predict(X_new)"
|
||||
],
|
||||
"language": "python",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"metadata": {},
|
||||
"output_type": "pyout",
|
||||
"prompt_number": 12,
|
||||
"text": [
|
||||
"array([2, 0])"
|
||||
]
|
||||
}
|
||||
],
|
||||
"prompt_number": 12
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Resources\n",
|
||||
"\n",
|
||||
"- [Nearest Neighbors](http://scikit-learn.org/stable/modules/neighbors.html) (user guide), [KNeighborsClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) (class documentation)\n",
|
||||
"- [Logistic Regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) (user guide), [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) (class documentation)\n",
|
||||
"- [Videos from An Introduction to Statistical Learning](http://www.dataschool.io/15-hours-of-expert-machine-learning-videos/)\n",
|
||||
" - Classification Problems and K-Nearest Neighbors (Chapter 2)\n",
|
||||
" - Introduction to Classification (Chapter 4)\n",
|
||||
" - Logistic Regression and Maximum Likelihood (Chapter 4)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Comments or Questions?\n",
|
||||
"\n",
|
||||
"- Email: <kevin@dataschool.io>\n",
|
||||
"- Website: http://dataschool.io\n",
|
||||
"- Twitter: [@justmarkham](https://twitter.com/justmarkham)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"collapsed": false,
|
||||
"input": [
|
||||
"from IPython.core.display import HTML\n",
|
||||
"def css_styling():\n",
|
||||
" styles = open(\"styles/custom.css\", \"r\").read()\n",
|
||||
" return HTML(styles)\n",
|
||||
"css_styling()"
|
||||
],
|
||||
"language": "python",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"html": [
|
||||
"<style>\n",
|
||||
" @font-face {\n",
|
||||
" font-family: \"Computer Modern\";\n",
|
||||
" src: url('http://mirrors.ctan.org/fonts/cm-unicode/fonts/otf/cmunss.otf');\n",
|
||||
" }\n",
|
||||
" div.cell{\n",
|
||||
" width: 90%;\n",
|
||||
"/* margin-left:auto;*/\n",
|
||||
"/* margin-right:auto;*/\n",
|
||||
" }\n",
|
||||
" ul {\n",
|
||||
" line-height: 145%;\n",
|
||||
" font-size: 90%;\n",
|
||||
" }\n",
|
||||
" li {\n",
|
||||
" margin-bottom: 1em;\n",
|
||||
" }\n",
|
||||
" h1 {\n",
|
||||
" font-family: Helvetica, serif;\n",
|
||||
" }\n",
|
||||
" h4{\n",
|
||||
" margin-top: 12px;\n",
|
||||
" margin-bottom: 3px;\n",
|
||||
" }\n",
|
||||
" div.text_cell_render{\n",
|
||||
" font-family: Computer Modern, \"Helvetica Neue\", Arial, Helvetica, Geneva, sans-serif;\n",
|
||||
" line-height: 145%;\n",
|
||||
" font-size: 130%;\n",
|
||||
" width: 90%;\n",
|
||||
" margin-left:auto;\n",
|
||||
" margin-right:auto;\n",
|
||||
" }\n",
|
||||
" .CodeMirror{\n",
|
||||
" font-family: \"Source Code Pro\", source-code-pro,Consolas, monospace;\n",
|
||||
" }\n",
|
||||
"/* .prompt{\n",
|
||||
" display: None;\n",
|
||||
" }*/\n",
|
||||
" .text_cell_render h5 {\n",
|
||||
" font-weight: 300;\n",
|
||||
" font-size: 16pt;\n",
|
||||
" color: #4057A1;\n",
|
||||
" font-style: italic;\n",
|
||||
" margin-bottom: 0.5em;\n",
|
||||
" margin-top: 0.5em;\n",
|
||||
" display: block;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .warning{\n",
|
||||
" color: rgb( 240, 20, 20 )\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<script>\n",
|
||||
" MathJax.Hub.Config({\n",
|
||||
" TeX: {\n",
|
||||
" extensions: [\"AMSmath.js\"]\n",
|
||||
" },\n",
|
||||
" tex2jax: {\n",
|
||||
" inlineMath: [ ['$','$'], [\"\\\\(\",\"\\\\)\"] ],\n",
|
||||
" displayMath: [ ['$$','$$'], [\"\\\\[\",\"\\\\]\"] ]\n",
|
||||
" },\n",
|
||||
" displayAlign: 'center', // Change this to 'center' to center equations.\n",
|
||||
" \"HTML-CSS\": {\n",
|
||||
" styles: {'.MathJax_Display': {\"margin\": 4}}\n",
|
||||
" }\n",
|
||||
" });\n",
|
||||
"</script>"
|
||||
],
|
||||
"metadata": {},
|
||||
"output_type": "pyout",
|
||||
"prompt_number": 1,
|
||||
"text": [
|
||||
"<IPython.core.display.HTML at 0x3e321d0>"
|
||||
]
|
||||
}
|
||||
],
|
||||
"prompt_number": 1
|
||||
}
|
||||
],
|
||||
"metadata": {}
|
||||
}
|
||||
]
|
||||
}
|
|
@ -27,3 +27,8 @@ This repo contains IPython notebooks from my scikit-learn video series, as seen
|
|||
- How do we load the iris dataset into scikit-learn?
|
||||
- How do we describe a dataset using machine learning terminology?
|
||||
- What are scikit-learn's four key requirements for working with data?
|
||||
|
||||
4. Training a machine learning model with scikit-learn ([video](https://www.youtube.com/watch?v=RlQuVL6-qe8&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=4), [notebook](http://nbviewer.ipython.org/github/justmarkham/scikit-learn-videos/blob/master/04_model_training.ipynb), blog post)
|
||||
- What is the K-nearest neighbors classification model?
|
||||
- What are the four steps for model training and prediction in scikit-learn?
|
||||
- How can I apply this pattern to other machine learning models?
|
||||
|
|
Binary file not shown.
After Width: | Height: | Size: 20 KiB |
Binary file not shown.
After Width: | Height: | Size: 21 KiB |
Binary file not shown.
After Width: | Height: | Size: 14 KiB |
Loading…
Reference in New Issue