add notebook and image for video 3

2015-04-21 16:51:22 -04:00 · 2015-04-21 16:51:22 -04:00 · b24e2d9c94
parent 2d10b2696e
commit b24e2d9c94
2 changed files with 589 additions and 0 deletions
--- a/03_getting_started_with_iris.ipynb
+++ b/03_getting_started_with_iris.ipynb
@ -0,0 +1,589 @@
+{
+ "metadata": {
+  "name": "",
+  "signature": "sha256:47353a005f0e929371e234ee800892c59aae1d73917cdc29dc3c451c61605a28"
+ },
+ "nbformat": 3,
+ "nbformat_minor": 0,
+ "worksheets": [
+  {
+   "cells": [
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "# Getting started in scikit-learn with the famous iris dataset\n",
+      "*From the video series: [Introduction to machine learning with scikit-learn](https://github.com/justmarkham/scikit-learn-videos)*"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "## Agenda\n",
+      "\n",
+      "- What is the famous iris dataset, and how does it relate to machine learning?\n",
+      "- How do we load the iris dataset into scikit-learn?\n",
+      "- How do we describe a dataset using machine learning terminology?\n",
+      "- What are scikit-learn's four key requirements for working with data?"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "## Introducing the iris dataset"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "![Iris](images/03_iris.png)"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "- 50 samples of 3 different species of iris (150 samples total)\n",
+      "- Measurements: sepal length, sepal width, petal length, petal width"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "from IPython.display import HTML\n",
+      "HTML('<iframe src=http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data width=300 height=200></iframe>')"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "html": [
+        "<iframe src=http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data width=300 height=200></iframe>"
+       ],
+       "metadata": {},
+       "output_type": "pyout",
+       "prompt_number": 2,
+       "text": [
+        "<IPython.core.display.HTML at 0x3e2bef0>"
+       ]
+      }
+     ],
+     "prompt_number": 2
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "## Machine learning on the iris dataset\n",
+      "\n",
+      "- Framed as a **supervised learning** problem: Predict the species of an iris using the measurements\n",
+      "- Famous dataset for machine learning because prediction is **easy**\n",
+      "- Learn more about the iris dataset: [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Iris)"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "## Loading the iris dataset into scikit-learn"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "# import load_iris function from datasets module\n",
+      "from sklearn.datasets import load_iris"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 3
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "# save \"bunch\" object containing iris dataset and its attributes\n",
+      "iris = load_iris()\n",
+      "type(iris)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "metadata": {},
+       "output_type": "pyout",
+       "prompt_number": 4,
+       "text": [
+        "sklearn.datasets.base.Bunch"
+       ]
+      }
+     ],
+     "prompt_number": 4
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "# print the iris data\n",
+      "print iris.data"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "[[ 5.1  3.5  1.4  0.2]\n",
+        " [ 4.9  3.   1.4  0.2]\n",
+        " [ 4.7  3.2  1.3  0.2]\n",
+        " [ 4.6  3.1  1.5  0.2]\n",
+        " [ 5.   3.6  1.4  0.2]\n",
+        " [ 5.4  3.9  1.7  0.4]\n",
+        " [ 4.6  3.4  1.4  0.3]\n",
+        " [ 5.   3.4  1.5  0.2]\n",
+        " [ 4.4  2.9  1.4  0.2]\n",
+        " [ 4.9  3.1  1.5  0.1]\n",
+        " [ 5.4  3.7  1.5  0.2]\n",
+        " [ 4.8  3.4  1.6  0.2]\n",
+        " [ 4.8  3.   1.4  0.1]\n",
+        " [ 4.3  3.   1.1  0.1]\n",
+        " [ 5.8  4.   1.2  0.2]\n",
+        " [ 5.7  4.4  1.5  0.4]\n",
+        " [ 5.4  3.9  1.3  0.4]\n",
+        " [ 5.1  3.5  1.4  0.3]\n",
+        " [ 5.7  3.8  1.7  0.3]\n",
+        " [ 5.1  3.8  1.5  0.3]\n",
+        " [ 5.4  3.4  1.7  0.2]\n",
+        " [ 5.1  3.7  1.5  0.4]\n",
+        " [ 4.6  3.6  1.   0.2]\n",
+        " [ 5.1  3.3  1.7  0.5]\n",
+        " [ 4.8  3.4  1.9  0.2]\n",
+        " [ 5.   3.   1.6  0.2]\n",
+        " [ 5.   3.4  1.6  0.4]\n",
+        " [ 5.2  3.5  1.5  0.2]\n",
+        " [ 5.2  3.4  1.4  0.2]\n",
+        " [ 4.7  3.2  1.6  0.2]\n",
+        " [ 4.8  3.1  1.6  0.2]\n",
+        " [ 5.4  3.4  1.5  0.4]\n",
+        " [ 5.2  4.1  1.5  0.1]\n",
+        " [ 5.5  4.2  1.4  0.2]\n",
+        " [ 4.9  3.1  1.5  0.1]\n",
+        " [ 5.   3.2  1.2  0.2]\n",
+        " [ 5.5  3.5  1.3  0.2]\n",
+        " [ 4.9  3.1  1.5  0.1]\n",
+        " [ 4.4  3.   1.3  0.2]\n",
+        " [ 5.1  3.4  1.5  0.2]\n",
+        " [ 5.   3.5  1.3  0.3]\n",
+        " [ 4.5  2.3  1.3  0.3]\n",
+        " [ 4.4  3.2  1.3  0.2]\n",
+        " [ 5.   3.5  1.6  0.6]\n",
+        " [ 5.1  3.8  1.9  0.4]\n",
+        " [ 4.8  3.   1.4  0.3]\n",
+        " [ 5.1  3.8  1.6  0.2]\n",
+        " [ 4.6  3.2  1.4  0.2]\n",
+        " [ 5.3  3.7  1.5  0.2]\n",
+        " [ 5.   3.3  1.4  0.2]\n",
+        " [ 7.   3.2  4.7  1.4]\n",
+        " [ 6.4  3.2  4.5  1.5]\n",
+        " [ 6.9  3.1  4.9  1.5]\n",
+        " [ 5.5  2.3  4.   1.3]\n",
+        " [ 6.5  2.8  4.6  1.5]\n",
+        " [ 5.7  2.8  4.5  1.3]\n",
+        " [ 6.3  3.3  4.7  1.6]\n",
+        " [ 4.9  2.4  3.3  1. ]\n",
+        " [ 6.6  2.9  4.6  1.3]\n",
+        " [ 5.2  2.7  3.9  1.4]\n",
+        " [ 5.   2.   3.5  1. ]\n",
+        " [ 5.9  3.   4.2  1.5]\n",
+        " [ 6.   2.2  4.   1. ]\n",
+        " [ 6.1  2.9  4.7  1.4]\n",
+        " [ 5.6  2.9  3.6  1.3]\n",
+        " [ 6.7  3.1  4.4  1.4]\n",
+        " [ 5.6  3.   4.5  1.5]\n",
+        " [ 5.8  2.7  4.1  1. ]\n",
+        " [ 6.2  2.2  4.5  1.5]\n",
+        " [ 5.6  2.5  3.9  1.1]\n",
+        " [ 5.9  3.2  4.8  1.8]\n",
+        " [ 6.1  2.8  4.   1.3]\n",
+        " [ 6.3  2.5  4.9  1.5]\n",
+        " [ 6.1  2.8  4.7  1.2]\n",
+        " [ 6.4  2.9  4.3  1.3]\n",
+        " [ 6.6  3.   4.4  1.4]\n",
+        " [ 6.8  2.8  4.8  1.4]\n",
+        " [ 6.7  3.   5.   1.7]\n",
+        " [ 6.   2.9  4.5  1.5]\n",
+        " [ 5.7  2.6  3.5  1. ]\n",
+        " [ 5.5  2.4  3.8  1.1]\n",
+        " [ 5.5  2.4  3.7  1. ]\n",
+        " [ 5.8  2.7  3.9  1.2]\n",
+        " [ 6.   2.7  5.1  1.6]\n",
+        " [ 5.4  3.   4.5  1.5]\n",
+        " [ 6.   3.4  4.5  1.6]\n",
+        " [ 6.7  3.1  4.7  1.5]\n",
+        " [ 6.3  2.3  4.4  1.3]\n",
+        " [ 5.6  3.   4.1  1.3]\n",
+        " [ 5.5  2.5  4.   1.3]\n",
+        " [ 5.5  2.6  4.4  1.2]\n",
+        " [ 6.1  3.   4.6  1.4]\n",
+        " [ 5.8  2.6  4.   1.2]\n",
+        " [ 5.   2.3  3.3  1. ]\n",
+        " [ 5.6  2.7  4.2  1.3]\n",
+        " [ 5.7  3.   4.2  1.2]\n",
+        " [ 5.7  2.9  4.2  1.3]\n",
+        " [ 6.2  2.9  4.3  1.3]\n",
+        " [ 5.1  2.5  3.   1.1]\n",
+        " [ 5.7  2.8  4.1  1.3]\n",
+        " [ 6.3  3.3  6.   2.5]\n",
+        " [ 5.8  2.7  5.1  1.9]\n",
+        " [ 7.1  3.   5.9  2.1]\n",
+        " [ 6.3  2.9  5.6  1.8]\n",
+        " [ 6.5  3.   5.8  2.2]\n",
+        " [ 7.6  3.   6.6  2.1]\n",
+        " [ 4.9  2.5  4.5  1.7]\n",
+        " [ 7.3  2.9  6.3  1.8]\n",
+        " [ 6.7  2.5  5.8  1.8]\n",
+        " [ 7.2  3.6  6.1  2.5]\n",
+        " [ 6.5  3.2  5.1  2. ]\n",
+        " [ 6.4  2.7  5.3  1.9]\n",
+        " [ 6.8  3.   5.5  2.1]\n",
+        " [ 5.7  2.5  5.   2. ]\n",
+        " [ 5.8  2.8  5.1  2.4]\n",
+        " [ 6.4  3.2  5.3  2.3]\n",
+        " [ 6.5  3.   5.5  1.8]\n",
+        " [ 7.7  3.8  6.7  2.2]\n",
+        " [ 7.7  2.6  6.9  2.3]\n",
+        " [ 6.   2.2  5.   1.5]\n",
+        " [ 6.9  3.2  5.7  2.3]\n",
+        " [ 5.6  2.8  4.9  2. ]\n",
+        " [ 7.7  2.8  6.7  2. ]\n",
+        " [ 6.3  2.7  4.9  1.8]\n",
+        " [ 6.7  3.3  5.7  2.1]\n",
+        " [ 7.2  3.2  6.   1.8]\n",
+        " [ 6.2  2.8  4.8  1.8]\n",
+        " [ 6.1  3.   4.9  1.8]\n",
+        " [ 6.4  2.8  5.6  2.1]\n",
+        " [ 7.2  3.   5.8  1.6]\n",
+        " [ 7.4  2.8  6.1  1.9]\n",
+        " [ 7.9  3.8  6.4  2. ]\n",
+        " [ 6.4  2.8  5.6  2.2]\n",
+        " [ 6.3  2.8  5.1  1.5]\n",
+        " [ 6.1  2.6  5.6  1.4]\n",
+        " [ 7.7  3.   6.1  2.3]\n",
+        " [ 6.3  3.4  5.6  2.4]\n",
+        " [ 6.4  3.1  5.5  1.8]\n",
+        " [ 6.   3.   4.8  1.8]\n",
+        " [ 6.9  3.1  5.4  2.1]\n",
+        " [ 6.7  3.1  5.6  2.4]\n",
+        " [ 6.9  3.1  5.1  2.3]\n",
+        " [ 5.8  2.7  5.1  1.9]\n",
+        " [ 6.8  3.2  5.9  2.3]\n",
+        " [ 6.7  3.3  5.7  2.5]\n",
+        " [ 6.7  3.   5.2  2.3]\n",
+        " [ 6.3  2.5  5.   1.9]\n",
+        " [ 6.5  3.   5.2  2. ]\n",
+        " [ 6.2  3.4  5.4  2.3]\n",
+        " [ 5.9  3.   5.1  1.8]]\n"
+       ]
+      }
+     ],
+     "prompt_number": 5
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "## Machine learning terminology\n",
+      "\n",
+      "- Each row is an **observation** (also known as: sample, example, instance, record)\n",
+      "- Each column is a **feature** (also known as: predictor, attribute, independent variable, input, regressor, covariate)"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "# print the names of the four features\n",
+      "print iris.feature_names"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']\n"
+       ]
+      }
+     ],
+     "prompt_number": 6
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "# print integers representing the species of each observation\n",
+      "print iris.target"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
+        " 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n",
+        " 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2\n",
+        " 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n",
+        " 2 2]\n"
+       ]
+      }
+     ],
+     "prompt_number": 7
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "# print the encoding scheme for species: 0 = setosa, 1 = versicolor, 2 = virginica\n",
+      "print iris.target_names"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "['setosa' 'versicolor' 'virginica']\n"
+       ]
+      }
+     ],
+     "prompt_number": 8
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "- Each value we are predicting is the **response** (also known as: target, outcome, label, dependent variable)\n",
+      "- **Classification** is supervised learning in which the response is categorical\n",
+      "- **Regression** is supervised learning in which the response is ordered and continuous"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "## Requirements for working with data in scikit-learn\n",
+      "\n",
+      "1. Features and response are **separate objects**\n",
+      "2. Features and response should be **numeric**\n",
+      "3. Features and response should be **NumPy arrays**\n",
+      "4. Features and response should have **specific shapes**"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "# check the types of the features and response\n",
+      "print type(iris.data)\n",
+      "print type(iris.target)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "<type 'numpy.ndarray'>\n",
+        "<type 'numpy.ndarray'>\n"
+       ]
+      }
+     ],
+     "prompt_number": 9
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "# check the shape of the features (first dimension = number of observations, second dimensions = number of features)\n",
+      "print iris.data.shape"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "(150L, 4L)\n"
+       ]
+      }
+     ],
+     "prompt_number": 10
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "# check the shape of the response (single dimension matching the number of observations)\n",
+      "print iris.target.shape"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "(150L,)\n"
+       ]
+      }
+     ],
+     "prompt_number": 11
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "# store feature matrix in \"X\"\n",
+      "X = iris.data\n",
+      "\n",
+      "# store response vector in \"y\"\n",
+      "y = iris.target"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 12
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "## Resources\n",
+      "\n",
+      "- scikit-learn documentation: [Dataset loading utilities](http://scikit-learn.org/stable/datasets/)\n",
+      "- Jake VanderPlas: Fast Numerical Computing with NumPy ([slides](https://speakerdeck.com/jakevdp/losing-your-loops-fast-numerical-computing-with-numpy-pycon-2015), [video](https://www.youtube.com/watch?v=EEUXKG97YRw))\n",
+      "- Scott Shell: [An Introduction to NumPy](http://www.engr.ucsb.edu/~shell/che210d/numpy.pdf) (PDF)"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "## Comments or Questions?\n",
+      "\n",
+      "- Email: <kevin@dataschool.io>\n",
+      "- Website: http://dataschool.io\n",
+      "- Twitter: [@justmarkham](https://twitter.com/justmarkham)"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "from IPython.core.display import HTML\n",
+      "def css_styling():\n",
+      "    styles = open(\"styles/custom.css\", \"r\").read()\n",
+      "    return HTML(styles)\n",
+      "css_styling()"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "html": [
+        "<style>\n",
+        "    @font-face {\n",
+        "        font-family: \"Computer Modern\";\n",
+        "        src: url('http://mirrors.ctan.org/fonts/cm-unicode/fonts/otf/cmunss.otf');\n",
+        "    }\n",
+        "    div.cell{\n",
+        "        width: 90%;\n",
+        "/*        margin-left:auto;*/\n",
+        "/*        margin-right:auto;*/\n",
+        "    }\n",
+        "    ul {\n",
+        "        line-height: 145%;\n",
+        "        font-size: 90%;\n",
+        "    }\n",
+        "    li {\n",
+        "        margin-bottom: 1em;\n",
+        "    }\n",
+        "    h1 {\n",
+        "        font-family: Helvetica, serif;\n",
+        "    }\n",
+        "    h4{\n",
+        "        margin-top: 12px;\n",
+        "        margin-bottom: 3px;\n",
+        "       }\n",
+        "    div.text_cell_render{\n",
+        "        font-family: Computer Modern, \"Helvetica Neue\", Arial, Helvetica, Geneva, sans-serif;\n",
+        "        line-height: 145%;\n",
+        "        font-size: 130%;\n",
+        "        width: 90%;\n",
+        "        margin-left:auto;\n",
+        "        margin-right:auto;\n",
+        "    }\n",
+        "    .CodeMirror{\n",
+        "            font-family: \"Source Code Pro\", source-code-pro,Consolas, monospace;\n",
+        "    }\n",
+        "/*    .prompt{\n",
+        "        display: None;\n",
+        "    }*/\n",
+        "    .text_cell_render h5 {\n",
+        "        font-weight: 300;\n",
+        "        font-size: 16pt;\n",
+        "        color: #4057A1;\n",
+        "        font-style: italic;\n",
+        "        margin-bottom: 0.5em;\n",
+        "        margin-top: 0.5em;\n",
+        "        display: block;\n",
+        "    }\n",
+        "\n",
+        "    .warning{\n",
+        "        color: rgb( 240, 20, 20 )\n",
+        "        }\n",
+        "</style>\n",
+        "<script>\n",
+        "    MathJax.Hub.Config({\n",
+        "                        TeX: {\n",
+        "                           extensions: [\"AMSmath.js\"]\n",
+        "                           },\n",
+        "                tex2jax: {\n",
+        "                    inlineMath: [ ['$','$'], [\"\\\\(\",\"\\\\)\"] ],\n",
+        "                    displayMath: [ ['$$','$$'], [\"\\\\[\",\"\\\\]\"] ]\n",
+        "                },\n",
+        "                displayAlign: 'center', // Change this to 'center' to center equations.\n",
+        "                \"HTML-CSS\": {\n",
+        "                    styles: {'.MathJax_Display': {\"margin\": 4}}\n",
+        "                }\n",
+        "        });\n",
+        "</script>"
+       ],
+       "metadata": {},
+       "output_type": "pyout",
+       "prompt_number": 1,
+       "text": [
+        "<IPython.core.display.HTML at 0x3e42240>"
+       ]
+      }
+     ],
+     "prompt_number": 1
+    }
+   ],
+   "metadata": {}
+  }
+ ]
+}
--- a/images/03_iris.png
+++ b/images/03_iris.png