add notebook and image for video 3

pull/7/head
Kevin Markham 2015-04-21 16:51:22 -04:00
parent 2d10b2696e
commit b24e2d9c94
2 changed files with 589 additions and 0 deletions

View File

@ -0,0 +1,589 @@
{
"metadata": {
"name": "",
"signature": "sha256:47353a005f0e929371e234ee800892c59aae1d73917cdc29dc3c451c61605a28"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Getting started in scikit-learn with the famous iris dataset\n",
"*From the video series: [Introduction to machine learning with scikit-learn](https://github.com/justmarkham/scikit-learn-videos)*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Agenda\n",
"\n",
"- What is the famous iris dataset, and how does it relate to machine learning?\n",
"- How do we load the iris dataset into scikit-learn?\n",
"- How do we describe a dataset using machine learning terminology?\n",
"- What are scikit-learn's four key requirements for working with data?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introducing the iris dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Iris](images/03_iris.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- 50 samples of 3 different species of iris (150 samples total)\n",
"- Measurements: sepal length, sepal width, petal length, petal width"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from IPython.display import HTML\n",
"HTML('<iframe src=http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data width=300 height=200></iframe>')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<iframe src=http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data width=300 height=200></iframe>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 2,
"text": [
"<IPython.core.display.HTML at 0x3e2bef0>"
]
}
],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Machine learning on the iris dataset\n",
"\n",
"- Framed as a **supervised learning** problem: Predict the species of an iris using the measurements\n",
"- Famous dataset for machine learning because prediction is **easy**\n",
"- Learn more about the iris dataset: [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Iris)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Loading the iris dataset into scikit-learn"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# import load_iris function from datasets module\n",
"from sklearn.datasets import load_iris"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 3
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# save \"bunch\" object containing iris dataset and its attributes\n",
"iris = load_iris()\n",
"type(iris)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 4,
"text": [
"sklearn.datasets.base.Bunch"
]
}
],
"prompt_number": 4
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# print the iris data\n",
"print iris.data"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[[ 5.1 3.5 1.4 0.2]\n",
" [ 4.9 3. 1.4 0.2]\n",
" [ 4.7 3.2 1.3 0.2]\n",
" [ 4.6 3.1 1.5 0.2]\n",
" [ 5. 3.6 1.4 0.2]\n",
" [ 5.4 3.9 1.7 0.4]\n",
" [ 4.6 3.4 1.4 0.3]\n",
" [ 5. 3.4 1.5 0.2]\n",
" [ 4.4 2.9 1.4 0.2]\n",
" [ 4.9 3.1 1.5 0.1]\n",
" [ 5.4 3.7 1.5 0.2]\n",
" [ 4.8 3.4 1.6 0.2]\n",
" [ 4.8 3. 1.4 0.1]\n",
" [ 4.3 3. 1.1 0.1]\n",
" [ 5.8 4. 1.2 0.2]\n",
" [ 5.7 4.4 1.5 0.4]\n",
" [ 5.4 3.9 1.3 0.4]\n",
" [ 5.1 3.5 1.4 0.3]\n",
" [ 5.7 3.8 1.7 0.3]\n",
" [ 5.1 3.8 1.5 0.3]\n",
" [ 5.4 3.4 1.7 0.2]\n",
" [ 5.1 3.7 1.5 0.4]\n",
" [ 4.6 3.6 1. 0.2]\n",
" [ 5.1 3.3 1.7 0.5]\n",
" [ 4.8 3.4 1.9 0.2]\n",
" [ 5. 3. 1.6 0.2]\n",
" [ 5. 3.4 1.6 0.4]\n",
" [ 5.2 3.5 1.5 0.2]\n",
" [ 5.2 3.4 1.4 0.2]\n",
" [ 4.7 3.2 1.6 0.2]\n",
" [ 4.8 3.1 1.6 0.2]\n",
" [ 5.4 3.4 1.5 0.4]\n",
" [ 5.2 4.1 1.5 0.1]\n",
" [ 5.5 4.2 1.4 0.2]\n",
" [ 4.9 3.1 1.5 0.1]\n",
" [ 5. 3.2 1.2 0.2]\n",
" [ 5.5 3.5 1.3 0.2]\n",
" [ 4.9 3.1 1.5 0.1]\n",
" [ 4.4 3. 1.3 0.2]\n",
" [ 5.1 3.4 1.5 0.2]\n",
" [ 5. 3.5 1.3 0.3]\n",
" [ 4.5 2.3 1.3 0.3]\n",
" [ 4.4 3.2 1.3 0.2]\n",
" [ 5. 3.5 1.6 0.6]\n",
" [ 5.1 3.8 1.9 0.4]\n",
" [ 4.8 3. 1.4 0.3]\n",
" [ 5.1 3.8 1.6 0.2]\n",
" [ 4.6 3.2 1.4 0.2]\n",
" [ 5.3 3.7 1.5 0.2]\n",
" [ 5. 3.3 1.4 0.2]\n",
" [ 7. 3.2 4.7 1.4]\n",
" [ 6.4 3.2 4.5 1.5]\n",
" [ 6.9 3.1 4.9 1.5]\n",
" [ 5.5 2.3 4. 1.3]\n",
" [ 6.5 2.8 4.6 1.5]\n",
" [ 5.7 2.8 4.5 1.3]\n",
" [ 6.3 3.3 4.7 1.6]\n",
" [ 4.9 2.4 3.3 1. ]\n",
" [ 6.6 2.9 4.6 1.3]\n",
" [ 5.2 2.7 3.9 1.4]\n",
" [ 5. 2. 3.5 1. ]\n",
" [ 5.9 3. 4.2 1.5]\n",
" [ 6. 2.2 4. 1. ]\n",
" [ 6.1 2.9 4.7 1.4]\n",
" [ 5.6 2.9 3.6 1.3]\n",
" [ 6.7 3.1 4.4 1.4]\n",
" [ 5.6 3. 4.5 1.5]\n",
" [ 5.8 2.7 4.1 1. ]\n",
" [ 6.2 2.2 4.5 1.5]\n",
" [ 5.6 2.5 3.9 1.1]\n",
" [ 5.9 3.2 4.8 1.8]\n",
" [ 6.1 2.8 4. 1.3]\n",
" [ 6.3 2.5 4.9 1.5]\n",
" [ 6.1 2.8 4.7 1.2]\n",
" [ 6.4 2.9 4.3 1.3]\n",
" [ 6.6 3. 4.4 1.4]\n",
" [ 6.8 2.8 4.8 1.4]\n",
" [ 6.7 3. 5. 1.7]\n",
" [ 6. 2.9 4.5 1.5]\n",
" [ 5.7 2.6 3.5 1. ]\n",
" [ 5.5 2.4 3.8 1.1]\n",
" [ 5.5 2.4 3.7 1. ]\n",
" [ 5.8 2.7 3.9 1.2]\n",
" [ 6. 2.7 5.1 1.6]\n",
" [ 5.4 3. 4.5 1.5]\n",
" [ 6. 3.4 4.5 1.6]\n",
" [ 6.7 3.1 4.7 1.5]\n",
" [ 6.3 2.3 4.4 1.3]\n",
" [ 5.6 3. 4.1 1.3]\n",
" [ 5.5 2.5 4. 1.3]\n",
" [ 5.5 2.6 4.4 1.2]\n",
" [ 6.1 3. 4.6 1.4]\n",
" [ 5.8 2.6 4. 1.2]\n",
" [ 5. 2.3 3.3 1. ]\n",
" [ 5.6 2.7 4.2 1.3]\n",
" [ 5.7 3. 4.2 1.2]\n",
" [ 5.7 2.9 4.2 1.3]\n",
" [ 6.2 2.9 4.3 1.3]\n",
" [ 5.1 2.5 3. 1.1]\n",
" [ 5.7 2.8 4.1 1.3]\n",
" [ 6.3 3.3 6. 2.5]\n",
" [ 5.8 2.7 5.1 1.9]\n",
" [ 7.1 3. 5.9 2.1]\n",
" [ 6.3 2.9 5.6 1.8]\n",
" [ 6.5 3. 5.8 2.2]\n",
" [ 7.6 3. 6.6 2.1]\n",
" [ 4.9 2.5 4.5 1.7]\n",
" [ 7.3 2.9 6.3 1.8]\n",
" [ 6.7 2.5 5.8 1.8]\n",
" [ 7.2 3.6 6.1 2.5]\n",
" [ 6.5 3.2 5.1 2. ]\n",
" [ 6.4 2.7 5.3 1.9]\n",
" [ 6.8 3. 5.5 2.1]\n",
" [ 5.7 2.5 5. 2. ]\n",
" [ 5.8 2.8 5.1 2.4]\n",
" [ 6.4 3.2 5.3 2.3]\n",
" [ 6.5 3. 5.5 1.8]\n",
" [ 7.7 3.8 6.7 2.2]\n",
" [ 7.7 2.6 6.9 2.3]\n",
" [ 6. 2.2 5. 1.5]\n",
" [ 6.9 3.2 5.7 2.3]\n",
" [ 5.6 2.8 4.9 2. ]\n",
" [ 7.7 2.8 6.7 2. ]\n",
" [ 6.3 2.7 4.9 1.8]\n",
" [ 6.7 3.3 5.7 2.1]\n",
" [ 7.2 3.2 6. 1.8]\n",
" [ 6.2 2.8 4.8 1.8]\n",
" [ 6.1 3. 4.9 1.8]\n",
" [ 6.4 2.8 5.6 2.1]\n",
" [ 7.2 3. 5.8 1.6]\n",
" [ 7.4 2.8 6.1 1.9]\n",
" [ 7.9 3.8 6.4 2. ]\n",
" [ 6.4 2.8 5.6 2.2]\n",
" [ 6.3 2.8 5.1 1.5]\n",
" [ 6.1 2.6 5.6 1.4]\n",
" [ 7.7 3. 6.1 2.3]\n",
" [ 6.3 3.4 5.6 2.4]\n",
" [ 6.4 3.1 5.5 1.8]\n",
" [ 6. 3. 4.8 1.8]\n",
" [ 6.9 3.1 5.4 2.1]\n",
" [ 6.7 3.1 5.6 2.4]\n",
" [ 6.9 3.1 5.1 2.3]\n",
" [ 5.8 2.7 5.1 1.9]\n",
" [ 6.8 3.2 5.9 2.3]\n",
" [ 6.7 3.3 5.7 2.5]\n",
" [ 6.7 3. 5.2 2.3]\n",
" [ 6.3 2.5 5. 1.9]\n",
" [ 6.5 3. 5.2 2. ]\n",
" [ 6.2 3.4 5.4 2.3]\n",
" [ 5.9 3. 5.1 1.8]]\n"
]
}
],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Machine learning terminology\n",
"\n",
"- Each row is an **observation** (also known as: sample, example, instance, record)\n",
"- Each column is a **feature** (also known as: predictor, attribute, independent variable, input, regressor, covariate)"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# print the names of the four features\n",
"print iris.feature_names"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']\n"
]
}
],
"prompt_number": 6
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# print integers representing the species of each observation\n",
"print iris.target"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n",
" 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2\n",
" 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n",
" 2 2]\n"
]
}
],
"prompt_number": 7
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# print the encoding scheme for species: 0 = setosa, 1 = versicolor, 2 = virginica\n",
"print iris.target_names"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"['setosa' 'versicolor' 'virginica']\n"
]
}
],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Each value we are predicting is the **response** (also known as: target, outcome, label, dependent variable)\n",
"- **Classification** is supervised learning in which the response is categorical\n",
"- **Regression** is supervised learning in which the response is ordered and continuous"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Requirements for working with data in scikit-learn\n",
"\n",
"1. Features and response are **separate objects**\n",
"2. Features and response should be **numeric**\n",
"3. Features and response should be **NumPy arrays**\n",
"4. Features and response should have **specific shapes**"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# check the types of the features and response\n",
"print type(iris.data)\n",
"print type(iris.target)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"<type 'numpy.ndarray'>\n",
"<type 'numpy.ndarray'>\n"
]
}
],
"prompt_number": 9
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# check the shape of the features (first dimension = number of observations, second dimensions = number of features)\n",
"print iris.data.shape"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"(150L, 4L)\n"
]
}
],
"prompt_number": 10
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# check the shape of the response (single dimension matching the number of observations)\n",
"print iris.target.shape"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"(150L,)\n"
]
}
],
"prompt_number": 11
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# store feature matrix in \"X\"\n",
"X = iris.data\n",
"\n",
"# store response vector in \"y\"\n",
"y = iris.target"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 12
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Resources\n",
"\n",
"- scikit-learn documentation: [Dataset loading utilities](http://scikit-learn.org/stable/datasets/)\n",
"- Jake VanderPlas: Fast Numerical Computing with NumPy ([slides](https://speakerdeck.com/jakevdp/losing-your-loops-fast-numerical-computing-with-numpy-pycon-2015), [video](https://www.youtube.com/watch?v=EEUXKG97YRw))\n",
"- Scott Shell: [An Introduction to NumPy](http://www.engr.ucsb.edu/~shell/che210d/numpy.pdf) (PDF)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Comments or Questions?\n",
"\n",
"- Email: <kevin@dataschool.io>\n",
"- Website: http://dataschool.io\n",
"- Twitter: [@justmarkham](https://twitter.com/justmarkham)"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from IPython.core.display import HTML\n",
"def css_styling():\n",
" styles = open(\"styles/custom.css\", \"r\").read()\n",
" return HTML(styles)\n",
"css_styling()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<style>\n",
" @font-face {\n",
" font-family: \"Computer Modern\";\n",
" src: url('http://mirrors.ctan.org/fonts/cm-unicode/fonts/otf/cmunss.otf');\n",
" }\n",
" div.cell{\n",
" width: 90%;\n",
"/* margin-left:auto;*/\n",
"/* margin-right:auto;*/\n",
" }\n",
" ul {\n",
" line-height: 145%;\n",
" font-size: 90%;\n",
" }\n",
" li {\n",
" margin-bottom: 1em;\n",
" }\n",
" h1 {\n",
" font-family: Helvetica, serif;\n",
" }\n",
" h4{\n",
" margin-top: 12px;\n",
" margin-bottom: 3px;\n",
" }\n",
" div.text_cell_render{\n",
" font-family: Computer Modern, \"Helvetica Neue\", Arial, Helvetica, Geneva, sans-serif;\n",
" line-height: 145%;\n",
" font-size: 130%;\n",
" width: 90%;\n",
" margin-left:auto;\n",
" margin-right:auto;\n",
" }\n",
" .CodeMirror{\n",
" font-family: \"Source Code Pro\", source-code-pro,Consolas, monospace;\n",
" }\n",
"/* .prompt{\n",
" display: None;\n",
" }*/\n",
" .text_cell_render h5 {\n",
" font-weight: 300;\n",
" font-size: 16pt;\n",
" color: #4057A1;\n",
" font-style: italic;\n",
" margin-bottom: 0.5em;\n",
" margin-top: 0.5em;\n",
" display: block;\n",
" }\n",
"\n",
" .warning{\n",
" color: rgb( 240, 20, 20 )\n",
" }\n",
"</style>\n",
"<script>\n",
" MathJax.Hub.Config({\n",
" TeX: {\n",
" extensions: [\"AMSmath.js\"]\n",
" },\n",
" tex2jax: {\n",
" inlineMath: [ ['$','$'], [\"\\\\(\",\"\\\\)\"] ],\n",
" displayMath: [ ['$$','$$'], [\"\\\\[\",\"\\\\]\"] ]\n",
" },\n",
" displayAlign: 'center', // Change this to 'center' to center equations.\n",
" \"HTML-CSS\": {\n",
" styles: {'.MathJax_Display': {\"margin\": 4}}\n",
" }\n",
" });\n",
"</script>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 1,
"text": [
"<IPython.core.display.HTML at 0x3e42240>"
]
}
],
"prompt_number": 1
}
],
"metadata": {}
}
]
}

BIN
images/03_iris.png 100644

Binary file not shown.

After

Width:  |  Height:  |  Size: 150 KiB