diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..2adfa4c --- /dev/null +++ b/.gitignore @@ -0,0 +1,2 @@ +.ipynb_checkpoints/ +*.pyc diff --git a/01_machine_learning_intro.ipynb b/01_machine_learning_intro.ipynb new file mode 100644 index 0000000..e989db0 --- /dev/null +++ b/01_machine_learning_intro.ipynb @@ -0,0 +1,248 @@ +{ + "metadata": { + "name": "", + "signature": "sha256:3a45466f81f7926609b8d5a7f9daaac6a202c78255a1369eb02391279866cba5" + }, + "nbformat": 3, + "nbformat_minor": 0, + "worksheets": [ + { + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# What is machine learning, and how does it work?\n", + "*From the video series: [Introduction to machine learning with scikit-learn](https://github.com/justmarkham/scikit-learn-videos)*" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![Machine learning](images/01_robot.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Agenda\n", + "\n", + "- What is machine learning?\n", + "- What are the two main categories of machine learning?\n", + "- What are some examples of machine learning?\n", + "- How does machine learning \"work\"?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What is machine learning?\n", + "\n", + "One definition: \"Machine learning is the semi-automated extraction of knowledge from data\"\n", + "\n", + "- **Knowledge from data**: Starts with a question that might be answerable using data\n", + "- **Automated extraction**: A computer provides the insight\n", + "- **Semi-automated**: Requires many smart decisions by a human" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What are the two main categories of machine learning?\n", + "\n", + "**Supervised learning**: Making predictions using data\n", + " \n", + "- Example: Is a given email \"spam\" or \"ham\"?\n", + "- There is an outcome we are trying to predict" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![Spam filter](images/01_spam_filter.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Unsupervised learning**: Extracting structure from data\n", + "\n", + "- Example: Segment grocery store shoppers into clusters that exhibit similar behaviors\n", + "- There is no \"right answer\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![Clustering](images/01_clustering.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## How does machine learning \"work\"?\n", + "\n", + "High-level steps of supervised learning:\n", + "\n", + "1. First, train a **machine learning model** using **labeled data**\n", + "\n", + " - \"Labeled data\" has been labeled with the outcome\n", + " - \"Machine learning model\" learns the relationship between the attributes of the data and its outcome\n", + "\n", + "2. Then, make **predictions** on **new data** for which the label is unknown" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![Supervised learning](images/01_supervised_learning.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The primary goal of supervised learning is to build a model that \"generalizes\": It accurately predicts the **future** rather than the **past**!" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Questions about machine learning\n", + "\n", + "- How do I choose **which attributes** of my data to include in the model?\n", + "- How do I choose **which model** to use?\n", + "- How do I **optimize** this model for best performance?\n", + "- How do I ensure that I'm building a model that will **generalize** to unseen data?\n", + "- Can I **estimate** how well my model is likely to perform on unseen data?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Resources\n", + "\n", + "- Book: [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/) (section 2.1, 14 pages)\n", + "- Video: [Learning Paradigms](http://work.caltech.edu/library/014.html) (13 minutes)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Comments or Questions?\n", + "\n", + "- Email: \n", + "- Website: http://dataschool.io\n", + "- Twitter: [@justmarkham](https://twitter.com/justmarkham)" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "from IPython.core.display import HTML\n", + "def css_styling():\n", + " styles = open(\"styles/custom.css\", \"r\").read()\n", + " return HTML(styles)\n", + "css_styling()" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "" + ], + "metadata": {}, + "output_type": "pyout", + "prompt_number": 1, + "text": [ + "" + ] + } + ], + "prompt_number": 1 + } + ], + "metadata": {} + } + ] +} \ No newline at end of file diff --git a/images/01_clustering.png b/images/01_clustering.png new file mode 100644 index 0000000..e74111e Binary files /dev/null and b/images/01_clustering.png differ diff --git a/images/01_robot.png b/images/01_robot.png new file mode 100644 index 0000000..d22bc09 Binary files /dev/null and b/images/01_robot.png differ diff --git a/images/01_spam_filter.png b/images/01_spam_filter.png new file mode 100644 index 0000000..285afda Binary files /dev/null and b/images/01_spam_filter.png differ diff --git a/images/01_supervised_learning.png b/images/01_supervised_learning.png new file mode 100644 index 0000000..87174b5 Binary files /dev/null and b/images/01_supervised_learning.png differ diff --git a/styles/custom.css b/styles/custom.css new file mode 100644 index 0000000..29ca1cc --- /dev/null +++ b/styles/custom.css @@ -0,0 +1,67 @@ + + \ No newline at end of file