update to scikit-learn 0.23.2 and Python 3.9.1

master
Kevin Markham 2021-03-02 08:53:12 -05:00
parent cec096b944
commit 4e8af9d831
11 changed files with 519 additions and 1229 deletions

View File

@ -4,16 +4,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# What is machine learning, and how does it work? ([video #1](https://www.youtube.com/watch?v=elojMnjn4kk&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=1))\n",
"# What is Machine Learning, and how does it work? ([video #1](https://www.youtube.com/watch?v=elojMnjn4kk&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=1))\n",
"\n",
"Created by [Data School](http://www.dataschool.io/). Watch all 9 videos on [YouTube](https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A). Download the notebooks from [GitHub](https://github.com/justmarkham/scikit-learn-videos)."
"Created by [Data School](https://www.dataschool.io). Watch all 10 videos on [YouTube](https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A). Download the notebooks from [GitHub](https://github.com/justmarkham/scikit-learn-videos)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Machine learning](images/01_robot.png)"
"![Machine Learning](images/01_robot.png)"
]
},
{
@ -22,19 +22,19 @@
"source": [
"## Agenda\n",
"\n",
"- What is machine learning?\n",
"- What are the two main categories of machine learning?\n",
"- What are some examples of machine learning?\n",
"- How does machine learning \"work\"?"
"- What is Machine Learning?\n",
"- What are the two main categories of Machine Learning?\n",
"- What are some examples of Machine Learning?\n",
"- How does Machine Learning \"work\"?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What is machine learning?\n",
"## What is Machine Learning?\n",
"\n",
"One definition: \"Machine learning is the semi-automated extraction of knowledge from data\"\n",
"One definition: \"Machine Learning is the semi-automated extraction of knowledge from data\"\n",
"\n",
"- **Knowledge from data**: Starts with a question that might be answerable using data\n",
"- **Automated extraction**: A computer provides the insight\n",
@ -45,7 +45,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## What are the two main categories of machine learning?\n",
"## What are the two main categories of Machine Learning?\n",
"\n",
"**Supervised learning**: Making predictions using data\n",
" \n",
@ -81,14 +81,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## How does machine learning \"work\"?\n",
"## How does Machine Learning \"work\"?\n",
"\n",
"High-level steps of supervised learning:\n",
"\n",
"1. First, train a **machine learning model** using **labeled data**\n",
"1. First, train a **Machine Learning model** using **labeled data**\n",
"\n",
" - \"Labeled data\" has been labeled with the outcome\n",
" - \"Machine learning model\" learns the relationship between the attributes of the data and its outcome\n",
" - \"Machine Learning model\" learns the relationship between the attributes of the data and its outcome\n",
"\n",
"2. Then, make **predictions** on **new data** for which the label is unknown"
]
@ -111,7 +111,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Questions about machine learning\n",
"## Questions about Machine Learning\n",
"\n",
"- How do I choose **which attributes** of my data to include in the model?\n",
"- How do I choose **which model** to use?\n",
@ -126,7 +126,7 @@
"source": [
"## Resources\n",
"\n",
"- Book: [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/) (section 2.1, 14 pages)\n",
"- Book: [An Introduction to Statistical Learning](https://www.statlearning.com/) (section 2.1, 14 pages)\n",
"- Video: [Learning Paradigms](http://work.caltech.edu/library/014.html) (13 minutes)"
]
},
@ -137,102 +137,9 @@
"## Comments or Questions?\n",
"\n",
"- Email: <kevin@dataschool.io>\n",
"- Website: http://dataschool.io\n",
"- Website: https://www.dataschool.io\n",
"- Twitter: [@justmarkham](https://twitter.com/justmarkham)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>\n",
" @font-face {\n",
" font-family: \"Computer Modern\";\n",
" src: url('http://mirrors.ctan.org/fonts/cm-unicode/fonts/otf/cmunss.otf');\n",
" }\n",
" div.cell{\n",
" width: 90%;\n",
"/* margin-left:auto;*/\n",
"/* margin-right:auto;*/\n",
" }\n",
" ul {\n",
" line-height: 145%;\n",
" font-size: 90%;\n",
" }\n",
" li {\n",
" margin-bottom: 1em;\n",
" }\n",
" h1 {\n",
" font-family: Helvetica, serif;\n",
" }\n",
" h4{\n",
" margin-top: 12px;\n",
" margin-bottom: 3px;\n",
" }\n",
" div.text_cell_render{\n",
" font-family: Computer Modern, \"Helvetica Neue\", Arial, Helvetica, Geneva, sans-serif;\n",
" line-height: 145%;\n",
" font-size: 130%;\n",
" width: 90%;\n",
" margin-left:auto;\n",
" margin-right:auto;\n",
" }\n",
" .CodeMirror{\n",
" font-family: \"Source Code Pro\", source-code-pro,Consolas, monospace;\n",
" }\n",
"/* .prompt{\n",
" display: None;\n",
" }*/\n",
" .text_cell_render h5 {\n",
" font-weight: 300;\n",
" font-size: 16pt;\n",
" color: #4057A1;\n",
" font-style: italic;\n",
" margin-bottom: 0.5em;\n",
" margin-top: 0.5em;\n",
" display: block;\n",
" }\n",
"\n",
" .warning{\n",
" color: rgb( 240, 20, 20 )\n",
" }\n",
"</style>\n",
"<script>\n",
" MathJax.Hub.Config({\n",
" TeX: {\n",
" extensions: [\"AMSmath.js\"]\n",
" },\n",
" tex2jax: {\n",
" inlineMath: [ ['$','$'], [\"\\\\(\",\"\\\\)\"] ],\n",
" displayMath: [ ['$$','$$'], [\"\\\\[\",\"\\\\]\"] ]\n",
" },\n",
" displayAlign: 'center', // Change this to 'center' to center equations.\n",
" \"HTML-CSS\": {\n",
" styles: {'.MathJax_Display': {\"margin\": 4}}\n",
" }\n",
" });\n",
"</script>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from IPython.core.display import HTML\n",
"def css_styling():\n",
" styles = open(\"styles/custom.css\", \"r\").read()\n",
" return HTML(styles)\n",
"css_styling()"
]
}
],
"metadata": {
@ -251,7 +158,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
"version": "3.9.1"
}
},
"nbformat": 4,

View File

@ -4,9 +4,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Setting up Python for machine learning: scikit-learn and Jupyter Notebook ([video #2](https://www.youtube.com/watch?v=IsXXlYVBt1M&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=2))\n",
"# Setting up Python for Machine Learning: scikit-learn and Jupyter Notebook ([video #2](https://www.youtube.com/watch?v=IsXXlYVBt1M&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=2))\n",
"\n",
"Created by [Data School](http://www.dataschool.io/). Watch all 9 videos on [YouTube](https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A). Download the notebooks from [GitHub](https://github.com/justmarkham/scikit-learn-videos).\n",
"Created by [Data School](https://www.dataschool.io). Watch all 10 videos on [YouTube](https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A). Download the notebooks from [GitHub](https://github.com/justmarkham/scikit-learn-videos).\n",
"\n",
"**Note:** Since the video recording, the official name of the \"IPython Notebook\" was changed to \"Jupyter Notebook\". However, the functionality is the same."
]
@ -38,7 +38,7 @@
"\n",
"### Benefits:\n",
"\n",
"- **Consistent interface** to machine learning models\n",
"- **Consistent interface** to Machine Learning models\n",
"- Provides many **tuning parameters** but with **sensible defaults**\n",
"- Exceptional **documentation**\n",
"- Rich set of functionality for **companion tasks**\n",
@ -46,14 +46,14 @@
"\n",
"### Potential drawbacks:\n",
"\n",
"- Harder (than R) to **get started with machine learning**\n",
"- Harder (than R) to **get started with Machine Learning**\n",
"- Less emphasis (than R) on **model interpretability**\n",
"\n",
"### Further reading:\n",
"\n",
"- Ben Lorica: [Six reasons why I recommend scikit-learn](http://radar.oreilly.com/2013/12/six-reasons-why-i-recommend-scikit-learn.html)\n",
"- scikit-learn authors: [API design for machine learning software](http://arxiv.org/pdf/1309.0238v1.pdf)\n",
"- Data School: [Should you teach Python or R for data science?](http://www.dataschool.io/python-or-r-for-data-science/)"
"- Ben Lorica: [Six reasons why I recommend scikit-learn](https://www.oreilly.com/content/six-reasons-why-i-recommend-scikit-learn/)\n",
"- scikit-learn authors: [API design for machine learning software](https://arxiv.org/pdf/1309.0238v1.pdf)\n",
"- Data School: [Should you teach Python or R for data science?](https://www.dataschool.io/python-or-r-for-data-science/)"
]
},
{
@ -69,9 +69,9 @@
"source": [
"## Installing scikit-learn\n",
"\n",
"**Option 1:** [Install scikit-learn library](http://scikit-learn.org/stable/install.html) and dependencies (NumPy and SciPy)\n",
"**Option 1:** [Install scikit-learn library](https://scikit-learn.org/stable/install.html) and dependencies (NumPy and SciPy)\n",
"\n",
"**Option 2:** [Install Anaconda distribution](https://www.anaconda.com/download/) of Python, which includes:\n",
"**Option 2:** [Install Anaconda distribution](https://www.anaconda.com/products/individual) of Python, which includes:\n",
"\n",
"- Hundreds of useful packages (including scikit-learn)\n",
"- IPython and Jupyter Notebook\n",
@ -124,9 +124,9 @@
"\n",
"### IPython, Jupyter, and Markdown resources:\n",
"\n",
"- [nbviewer](http://nbviewer.jupyter.org/): view notebooks online as static documents\n",
"- [IPython documentation](http://ipython.readthedocs.io/en/stable/)\n",
"- [Jupyter Notebook quickstart](http://jupyter.readthedocs.io/en/latest/content-quickstart.html)\n",
"- [nbviewer](https://nbviewer.jupyter.org/): view notebooks online as static documents\n",
"- [IPython documentation](https://ipython.readthedocs.io/en/stable/)\n",
"- [Jupyter Notebook quickstart](https://jupyter.readthedocs.io/en/latest/content-quickstart.html)\n",
"- [GitHub's Mastering Markdown](https://guides.github.com/features/mastering-markdown/): short guide with lots of examples"
]
},
@ -149,102 +149,9 @@
"## Comments or Questions?\n",
"\n",
"- Email: <kevin@dataschool.io>\n",
"- Website: http://dataschool.io\n",
"- Website: https://www.dataschool.io\n",
"- Twitter: [@justmarkham](https://twitter.com/justmarkham)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>\n",
" @font-face {\n",
" font-family: \"Computer Modern\";\n",
" src: url('http://mirrors.ctan.org/fonts/cm-unicode/fonts/otf/cmunss.otf');\n",
" }\n",
" div.cell{\n",
" width: 90%;\n",
"/* margin-left:auto;*/\n",
"/* margin-right:auto;*/\n",
" }\n",
" ul {\n",
" line-height: 145%;\n",
" font-size: 90%;\n",
" }\n",
" li {\n",
" margin-bottom: 1em;\n",
" }\n",
" h1 {\n",
" font-family: Helvetica, serif;\n",
" }\n",
" h4{\n",
" margin-top: 12px;\n",
" margin-bottom: 3px;\n",
" }\n",
" div.text_cell_render{\n",
" font-family: Computer Modern, \"Helvetica Neue\", Arial, Helvetica, Geneva, sans-serif;\n",
" line-height: 145%;\n",
" font-size: 130%;\n",
" width: 90%;\n",
" margin-left:auto;\n",
" margin-right:auto;\n",
" }\n",
" .CodeMirror{\n",
" font-family: \"Source Code Pro\", source-code-pro,Consolas, monospace;\n",
" }\n",
"/* .prompt{\n",
" display: None;\n",
" }*/\n",
" .text_cell_render h5 {\n",
" font-weight: 300;\n",
" font-size: 16pt;\n",
" color: #4057A1;\n",
" font-style: italic;\n",
" margin-bottom: 0.5em;\n",
" margin-top: 0.5em;\n",
" display: block;\n",
" }\n",
"\n",
" .warning{\n",
" color: rgb( 240, 20, 20 )\n",
" }\n",
"</style>\n",
"<script>\n",
" MathJax.Hub.Config({\n",
" TeX: {\n",
" extensions: [\"AMSmath.js\"]\n",
" },\n",
" tex2jax: {\n",
" inlineMath: [ ['$','$'], [\"\\\\(\",\"\\\\)\"] ],\n",
" displayMath: [ ['$$','$$'], [\"\\\\[\",\"\\\\]\"] ]\n",
" },\n",
" displayAlign: 'center', // Change this to 'center' to center equations.\n",
" \"HTML-CSS\": {\n",
" styles: {'.MathJax_Display': {\"margin\": 4}}\n",
" }\n",
" });\n",
"</script>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from IPython.core.display import HTML\n",
"def css_styling():\n",
" styles = open(\"styles/custom.css\", \"r\").read()\n",
" return HTML(styles)\n",
"css_styling()"
]
}
],
"metadata": {
@ -263,7 +170,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
"version": "3.9.1"
}
},
"nbformat": 4,

View File

@ -6,9 +6,9 @@
"source": [
"# Getting started in scikit-learn with the famous iris dataset ([video #3](https://www.youtube.com/watch?v=hd1W4CyPX58&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=3))\n",
"\n",
"Created by [Data School](http://www.dataschool.io/). Watch all 9 videos on [YouTube](https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A). Download the notebooks from [GitHub](https://github.com/justmarkham/scikit-learn-videos).\n",
"Created by [Data School](https://www.dataschool.io). Watch all 10 videos on [YouTube](https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A). Download the notebooks from [GitHub](https://github.com/justmarkham/scikit-learn-videos).\n",
"\n",
"**Note:** This notebook uses Python 3.6 and scikit-learn 0.19.1. The original notebook (shown in the video) used Python 2.7 and scikit-learn 0.16, and can be downloaded from the [archive branch](https://github.com/justmarkham/scikit-learn-videos/tree/archive)."
"**Note:** This notebook uses Python 3.9.1 and scikit-learn 0.23.2. The original notebook (shown in the video) used Python 2.7 and scikit-learn 0.16."
]
},
{
@ -17,9 +17,9 @@
"source": [
"## Agenda\n",
"\n",
"- What is the famous iris dataset, and how does it relate to machine learning?\n",
"- What is the famous iris dataset, and how does it relate to Machine Learning?\n",
"- How do we load the iris dataset into scikit-learn?\n",
"- How do we describe a dataset using machine learning terminology?\n",
"- How do we describe a dataset using Machine Learning terminology?\n",
"- What are scikit-learn's four key requirements for working with data?"
]
},
@ -47,8 +47,19 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# added empty cell so that the cell numbering matches the video"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
@ -57,14 +68,14 @@
" <iframe\n",
" width=\"300\"\n",
" height=\"200\"\n",
" src=\"http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data\"\n",
" src=\"https://www.dataschool.io/files/iris.txt\"\n",
" frameborder=\"0\"\n",
" allowfullscreen\n",
" ></iframe>\n",
" "
],
"text/plain": [
"<IPython.lib.display.IFrame at 0x10caa2470>"
"<IPython.lib.display.IFrame at 0x7fe408230e80>"
]
},
"execution_count": 2,
@ -74,17 +85,17 @@
],
"source": [
"from IPython.display import IFrame\n",
"IFrame('http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', width=300, height=200)"
"IFrame('https://www.dataschool.io/files/iris.txt', width=300, height=200)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Machine learning on the iris dataset\n",
"## Machine Learning on the iris dataset\n",
"\n",
"- Framed as a **supervised learning** problem: Predict the species of an iris using the measurements\n",
"- Famous dataset for machine learning because prediction is **easy**\n",
"- Famous dataset for Machine Learning because prediction is **easy**\n",
"- Learn more about the iris dataset: [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Iris)"
]
},
@ -130,7 +141,9 @@
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
@ -170,10 +183,10 @@
" [5.4 3.4 1.5 0.4]\n",
" [5.2 4.1 1.5 0.1]\n",
" [5.5 4.2 1.4 0.2]\n",
" [4.9 3.1 1.5 0.1]\n",
" [4.9 3.1 1.5 0.2]\n",
" [5. 3.2 1.2 0.2]\n",
" [5.5 3.5 1.3 0.2]\n",
" [4.9 3.1 1.5 0.1]\n",
" [4.9 3.6 1.4 0.1]\n",
" [4.4 3. 1.3 0.2]\n",
" [5.1 3.4 1.5 0.2]\n",
" [5. 3.5 1.3 0.3]\n",
@ -298,7 +311,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Machine learning terminology\n",
"## Machine Learning terminology\n",
"\n",
"- Each row is an **observation** (also known as: sample, example, instance, record)\n",
"- Each column is a **feature** (also known as: predictor, attribute, independent variable, input, regressor, covariate)"
@ -378,7 +391,7 @@
"## Requirements for working with data in scikit-learn\n",
"\n",
"1. Features and response are **separate objects**\n",
"2. Features and response should be **numeric**\n",
"2. Features should always be **numeric**, and response should be **numeric** for regression problems\n",
"3. Features and response should be **NumPy arrays**\n",
"4. Features and response should have **specific shapes**"
]
@ -458,9 +471,9 @@
"source": [
"## Resources\n",
"\n",
"- scikit-learn documentation: [Dataset loading utilities](http://scikit-learn.org/stable/datasets/)\n",
"- scikit-learn documentation: [Dataset loading utilities](https://scikit-learn.org/stable/datasets.html)\n",
"- Jake VanderPlas: Fast Numerical Computing with NumPy ([slides](https://speakerdeck.com/jakevdp/losing-your-loops-fast-numerical-computing-with-numpy-pycon-2015), [video](https://www.youtube.com/watch?v=EEUXKG97YRw))\n",
"- Scott Shell: [An Introduction to NumPy](http://www.engr.ucsb.edu/~shell/che210d/numpy.pdf) (PDF)"
"- Scott Shell: [An Introduction to NumPy](https://sites.engineering.ucsb.edu/~shell/che210d/numpy.pdf) (PDF)"
]
},
{
@ -470,102 +483,9 @@
"## Comments or Questions?\n",
"\n",
"- Email: <kevin@dataschool.io>\n",
"- Website: http://dataschool.io\n",
"- Website: https://www.dataschool.io\n",
"- Twitter: [@justmarkham](https://twitter.com/justmarkham)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>\n",
" @font-face {\n",
" font-family: \"Computer Modern\";\n",
" src: url('http://mirrors.ctan.org/fonts/cm-unicode/fonts/otf/cmunss.otf');\n",
" }\n",
" div.cell{\n",
" width: 90%;\n",
"/* margin-left:auto;*/\n",
"/* margin-right:auto;*/\n",
" }\n",
" ul {\n",
" line-height: 145%;\n",
" font-size: 90%;\n",
" }\n",
" li {\n",
" margin-bottom: 1em;\n",
" }\n",
" h1 {\n",
" font-family: Helvetica, serif;\n",
" }\n",
" h4{\n",
" margin-top: 12px;\n",
" margin-bottom: 3px;\n",
" }\n",
" div.text_cell_render{\n",
" font-family: Computer Modern, \"Helvetica Neue\", Arial, Helvetica, Geneva, sans-serif;\n",
" line-height: 145%;\n",
" font-size: 130%;\n",
" width: 90%;\n",
" margin-left:auto;\n",
" margin-right:auto;\n",
" }\n",
" .CodeMirror{\n",
" font-family: \"Source Code Pro\", source-code-pro,Consolas, monospace;\n",
" }\n",
"/* .prompt{\n",
" display: None;\n",
" }*/\n",
" .text_cell_render h5 {\n",
" font-weight: 300;\n",
" font-size: 16pt;\n",
" color: #4057A1;\n",
" font-style: italic;\n",
" margin-bottom: 0.5em;\n",
" margin-top: 0.5em;\n",
" display: block;\n",
" }\n",
"\n",
" .warning{\n",
" color: rgb( 240, 20, 20 )\n",
" }\n",
"</style>\n",
"<script>\n",
" MathJax.Hub.Config({\n",
" TeX: {\n",
" extensions: [\"AMSmath.js\"]\n",
" },\n",
" tex2jax: {\n",
" inlineMath: [ ['$','$'], [\"\\\\(\",\"\\\\)\"] ],\n",
" displayMath: [ ['$$','$$'], [\"\\\\[\",\"\\\\]\"] ]\n",
" },\n",
" displayAlign: 'center', // Change this to 'center' to center equations.\n",
" \"HTML-CSS\": {\n",
" styles: {'.MathJax_Display': {\"margin\": 4}}\n",
" }\n",
" });\n",
"</script>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from IPython.core.display import HTML\n",
"def css_styling():\n",
" styles = open(\"styles/custom.css\", \"r\").read()\n",
" return HTML(styles)\n",
"css_styling()"
]
}
],
"metadata": {
@ -584,7 +504,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
"version": "3.9.1"
}
},
"nbformat": 4,

View File

@ -4,11 +4,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Training a machine learning model with scikit-learn ([video #4](https://www.youtube.com/watch?v=RlQuVL6-qe8&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=4))\n",
"# Training a Machine Learning model with scikit-learn ([video #4](https://www.youtube.com/watch?v=RlQuVL6-qe8&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=4))\n",
"\n",
"Created by [Data School](http://www.dataschool.io/). Watch all 9 videos on [YouTube](https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A). Download the notebooks from [GitHub](https://github.com/justmarkham/scikit-learn-videos).\n",
"Created by [Data School](https://www.dataschool.io). Watch all 10 videos on [YouTube](https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A). Download the notebooks from [GitHub](https://github.com/justmarkham/scikit-learn-videos).\n",
"\n",
"**Note:** This notebook uses Python 3.6 and scikit-learn 0.19.1. The original notebook (shown in the video) used Python 2.7 and scikit-learn 0.16, and can be downloaded from the [archive branch](https://github.com/justmarkham/scikit-learn-videos/tree/archive)."
"**Note:** This notebook uses Python 3.9.1 and scikit-learn 0.23.2. The original notebook (shown in the video) used Python 2.7 and scikit-learn 0.16."
]
},
{
@ -19,7 +19,7 @@
"\n",
"- What is the **K-nearest neighbors** classification model?\n",
"- What are the four steps for **model training and prediction** in scikit-learn?\n",
"- How can I apply this pattern to **other machine learning models**?"
"- How can I apply this pattern to **other Machine Learning models**?"
]
},
{
@ -29,6 +29,15 @@
"## Reviewing the iris dataset"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# added empty cell so that the cell numbering matches the video"
]
},
{
"cell_type": "code",
"execution_count": 2,
@ -41,14 +50,14 @@
" <iframe\n",
" width=\"300\"\n",
" height=\"200\"\n",
" src=\"http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data\"\n",
" src=\"https://www.dataschool.io/files/iris.txt\"\n",
" frameborder=\"0\"\n",
" allowfullscreen\n",
" ></iframe>\n",
" "
],
"text/plain": [
"<IPython.lib.display.IFrame at 0x10fb4e4a8>"
"<IPython.lib.display.IFrame at 0x7f8c18558700>"
]
},
"execution_count": 2,
@ -58,7 +67,7 @@
],
"source": [
"from IPython.display import IFrame\n",
"IFrame('http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', width=300, height=200)"
"IFrame('https://www.dataschool.io/files/iris.txt', width=300, height=200)"
]
},
{
@ -119,7 +128,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"*Image Credits: [Data3classes](http://commons.wikimedia.org/wiki/File:Data3classes.png#/media/File:Data3classes.png), [Map1NN](http://commons.wikimedia.org/wiki/File:Map1NN.png#/media/File:Map1NN.png), [Map5NN](http://commons.wikimedia.org/wiki/File:Map5NN.png#/media/File:Map5NN.png) by Agor153. Licensed under CC BY-SA 3.0*"
"*Image Credits: [Data3classes](https://commons.wikimedia.org/wiki/File:Data3classes.png#/media/File:Data3classes.png), [Map1NN](https://commons.wikimedia.org/wiki/File:Map1NN.png#/media/File:Map1NN.png), [Map5NN](https://commons.wikimedia.org/wiki/File:Map5NN.png#/media/File:Map5NN.png) by Agor153. Licensed under CC BY-SA 3.0*"
]
},
{
@ -228,9 +237,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n",
" metric_params=None, n_jobs=1, n_neighbors=1, p=2,\n",
" weights='uniform')\n"
"KNeighborsClassifier(n_neighbors=1)\n"
]
}
],
@ -256,9 +263,7 @@
{
"data": {
"text/plain": [
"KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n",
" metric_params=None, n_jobs=1, n_neighbors=1, p=2,\n",
" weights='uniform')"
"KNeighborsClassifier(n_neighbors=1)"
]
},
"execution_count": 8,
@ -390,8 +395,8 @@
"# import the class\n",
"from sklearn.linear_model import LogisticRegression\n",
"\n",
"# instantiate the model (using the default parameters)\n",
"logreg = LogisticRegression()\n",
"# instantiate the model\n",
"logreg = LogisticRegression(solver='liblinear')\n",
"\n",
"# fit the model with data\n",
"logreg.fit(X, y)\n",
@ -406,9 +411,9 @@
"source": [
"## Resources\n",
"\n",
"- [Nearest Neighbors](http://scikit-learn.org/stable/modules/neighbors.html) (user guide), [KNeighborsClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) (class documentation)\n",
"- [Logistic Regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) (user guide), [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) (class documentation)\n",
"- [Videos from An Introduction to Statistical Learning](http://www.dataschool.io/15-hours-of-expert-machine-learning-videos/)\n",
"- [Nearest Neighbors](https://scikit-learn.org/stable/modules/neighbors.html) (user guide), [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) (class documentation)\n",
"- [Logistic Regression](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) (user guide), [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) (class documentation)\n",
"- [Videos from An Introduction to Statistical Learning](https://www.dataschool.io/15-hours-of-expert-machine-learning-videos/)\n",
" - Classification Problems and K-Nearest Neighbors (Chapter 2)\n",
" - Introduction to Classification (Chapter 4)\n",
" - Logistic Regression and Maximum Likelihood (Chapter 4)"
@ -421,102 +426,9 @@
"## Comments or Questions?\n",
"\n",
"- Email: <kevin@dataschool.io>\n",
"- Website: http://dataschool.io\n",
"- Website: https://www.dataschool.io\n",
"- Twitter: [@justmarkham](https://twitter.com/justmarkham)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>\n",
" @font-face {\n",
" font-family: \"Computer Modern\";\n",
" src: url('http://mirrors.ctan.org/fonts/cm-unicode/fonts/otf/cmunss.otf');\n",
" }\n",
" div.cell{\n",
" width: 90%;\n",
"/* margin-left:auto;*/\n",
"/* margin-right:auto;*/\n",
" }\n",
" ul {\n",
" line-height: 145%;\n",
" font-size: 90%;\n",
" }\n",
" li {\n",
" margin-bottom: 1em;\n",
" }\n",
" h1 {\n",
" font-family: Helvetica, serif;\n",
" }\n",
" h4{\n",
" margin-top: 12px;\n",
" margin-bottom: 3px;\n",
" }\n",
" div.text_cell_render{\n",
" font-family: Computer Modern, \"Helvetica Neue\", Arial, Helvetica, Geneva, sans-serif;\n",
" line-height: 145%;\n",
" font-size: 130%;\n",
" width: 90%;\n",
" margin-left:auto;\n",
" margin-right:auto;\n",
" }\n",
" .CodeMirror{\n",
" font-family: \"Source Code Pro\", source-code-pro,Consolas, monospace;\n",
" }\n",
"/* .prompt{\n",
" display: None;\n",
" }*/\n",
" .text_cell_render h5 {\n",
" font-weight: 300;\n",
" font-size: 16pt;\n",
" color: #4057A1;\n",
" font-style: italic;\n",
" margin-bottom: 0.5em;\n",
" margin-top: 0.5em;\n",
" display: block;\n",
" }\n",
"\n",
" .warning{\n",
" color: rgb( 240, 20, 20 )\n",
" }\n",
"</style>\n",
"<script>\n",
" MathJax.Hub.Config({\n",
" TeX: {\n",
" extensions: [\"AMSmath.js\"]\n",
" },\n",
" tex2jax: {\n",
" inlineMath: [ ['$','$'], [\"\\\\(\",\"\\\\)\"] ],\n",
" displayMath: [ ['$$','$$'], [\"\\\\[\",\"\\\\]\"] ]\n",
" },\n",
" displayAlign: 'center', // Change this to 'center' to center equations.\n",
" \"HTML-CSS\": {\n",
" styles: {'.MathJax_Display': {\"margin\": 4}}\n",
" }\n",
" });\n",
"</script>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from IPython.core.display import HTML\n",
"def css_styling():\n",
" styles = open(\"styles/custom.css\", \"r\").read()\n",
" return HTML(styles)\n",
"css_styling()"
]
}
],
"metadata": {
@ -535,7 +447,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
"version": "3.9.1"
}
},
"nbformat": 4,

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@ -4,11 +4,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Encoding categorical features ([video #10](https://www.youtube.com/watch?v=irHhDMbw3xo&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=10))\n",
"# Building a Machine Learning workflow ([video #10](https://www.youtube.com/watch?v=irHhDMbw3xo&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=10))\n",
"\n",
"Created by [Data School](https://www.dataschool.io). Watch all 10 videos on [YouTube](https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A). Download the notebooks from [GitHub](https://github.com/justmarkham/scikit-learn-videos).\n",
"\n",
"**Note:** This notebook uses scikit-learn 0.20. Some of the code below will not work if you are using an earlier version of scikit-learn."
"**Note:** This notebook uses Python 3.9.1 and scikit-learn 0.23.2. The original notebook (shown in the video) used Python 3.7 and scikit-learn 0.20.2."
]
},
{
@ -297,33 +297,71 @@
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(889, 1)\n",
"(889,)\n"
]
"data": {
"text/plain": [
"(889, 1)"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(X.shape)\n",
"print(y.shape)"
"X.shape"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"outputs": [
{
"data": {
"text/plain": [
"(889,)"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.linear_model import LogisticRegression\n",
"logreg = LogisticRegression(solver='lbfgs')"
"y.shape"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LogisticRegression"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"logreg = LogisticRegression()"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import cross_val_score"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
@ -331,19 +369,18 @@
"0.6783406335301212"
]
},
"execution_count": 13,
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.model_selection import cross_val_score\n",
"cross_val_score(logreg, X, y, cv=5, scoring='accuracy').mean()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"execution_count": 17,
"metadata": {},
"outputs": [
{
@ -354,7 +391,7 @@
"Name: Survived, dtype: float64"
]
},
"execution_count": 14,
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
@ -372,7 +409,7 @@
},
{
"cell_type": "code",
"execution_count": 15,
"execution_count": 18,
"metadata": {},
"outputs": [
{
@ -451,7 +488,7 @@
"4 0 3 male S"
]
},
"execution_count": 15,
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
@ -462,7 +499,7 @@
},
{
"cell_type": "code",
"execution_count": 16,
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
@ -473,7 +510,7 @@
},
{
"cell_type": "code",
"execution_count": 17,
"execution_count": 20,
"metadata": {},
"outputs": [
{
@ -488,7 +525,7 @@
" [0., 1.]])"
]
},
"execution_count": 17,
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
@ -499,7 +536,7 @@
},
{
"cell_type": "code",
"execution_count": 18,
"execution_count": 21,
"metadata": {},
"outputs": [
{
@ -508,7 +545,7 @@
"[array(['female', 'male'], dtype=object)]"
]
},
"execution_count": 18,
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
@ -519,7 +556,7 @@
},
{
"cell_type": "code",
"execution_count": 19,
"execution_count": 22,
"metadata": {},
"outputs": [
{
@ -534,7 +571,7 @@
" [0., 1., 0.]])"
]
},
"execution_count": 19,
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
@ -545,7 +582,7 @@
},
{
"cell_type": "code",
"execution_count": 20,
"execution_count": 23,
"metadata": {},
"outputs": [
{
@ -554,7 +591,7 @@
"[array(['C', 'Q', 'S'], dtype=object)]"
]
},
"execution_count": 20,
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
@ -563,32 +600,6 @@
"ohe.categories_"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[0., 1., 0., 0., 1.],\n",
" [1., 0., 1., 0., 0.],\n",
" [1., 0., 0., 0., 1.],\n",
" ...,\n",
" [1., 0., 0., 0., 1.],\n",
" [0., 1., 1., 0., 0.],\n",
" [0., 1., 0., 1., 0.]])"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ohe.fit_transform(df[['Sex', 'Embarked']])"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -598,7 +609,7 @@
},
{
"cell_type": "code",
"execution_count": 22,
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
@ -607,7 +618,7 @@
},
{
"cell_type": "code",
"execution_count": 23,
"execution_count": 25,
"metadata": {},
"outputs": [
{
@ -680,7 +691,7 @@
"4 3 male S"
]
},
"execution_count": 23,
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
@ -691,7 +702,7 @@
},
{
"cell_type": "code",
"execution_count": 24,
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
@ -701,7 +712,7 @@
},
{
"cell_type": "code",
"execution_count": 25,
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
@ -712,7 +723,7 @@
},
{
"cell_type": "code",
"execution_count": 26,
"execution_count": 28,
"metadata": {},
"outputs": [
{
@ -727,7 +738,7 @@
" [0., 1., 0., 1., 0., 3.]])"
]
},
"execution_count": 26,
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
@ -738,7 +749,7 @@
},
{
"cell_type": "code",
"execution_count": 27,
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
@ -748,7 +759,7 @@
},
{
"cell_type": "code",
"execution_count": 28,
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
@ -757,7 +768,7 @@
},
{
"cell_type": "code",
"execution_count": 29,
"execution_count": 31,
"metadata": {},
"outputs": [
{
@ -766,7 +777,7 @@
"0.7727924839713071"
]
},
"execution_count": 29,
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
@ -786,7 +797,16 @@
},
{
"cell_type": "code",
"execution_count": 30,
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"# added empty cell so that the cell numbering matches the video"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"scrolled": true
},
@ -861,7 +881,7 @@
"790 3 male Q"
]
},
"execution_count": 30,
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
@ -873,7 +893,7 @@
},
{
"cell_type": "code",
"execution_count": 31,
"execution_count": 34,
"metadata": {
"scrolled": true
},
@ -881,15 +901,15 @@
{
"data": {
"text/plain": [
"Pipeline(memory=None,\n",
" steps=[('columntransformer', ColumnTransformer(n_jobs=None, remainder='passthrough', sparse_threshold=0.3,\n",
" transformer_weights=None,\n",
" transformers=[('onehotencoder', OneHotEncoder(categorical_features=None, categories=None,\n",
" dtype=<class 'numpy.float64'>, handle_unknown='error...enalty='l2', random_state=None, solver='lbfgs',\n",
" tol=0.0001, verbose=0, warm_start=False))])"
"Pipeline(steps=[('columntransformer',\n",
" ColumnTransformer(remainder='passthrough',\n",
" transformers=[('onehotencoder',\n",
" OneHotEncoder(),\n",
" ['Sex', 'Embarked'])])),\n",
" ('logisticregression', LogisticRegression())])"
]
},
"execution_count": 31,
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
@ -900,7 +920,7 @@
},
{
"cell_type": "code",
"execution_count": 32,
"execution_count": 35,
"metadata": {},
"outputs": [
{
@ -909,7 +929,7 @@
"array([1, 0, 1, 1, 0])"
]
},
"execution_count": 32,
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
@ -927,7 +947,7 @@
},
{
"cell_type": "code",
"execution_count": 33,
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
@ -941,7 +961,7 @@
},
{
"cell_type": "code",
"execution_count": 34,
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
@ -953,7 +973,7 @@
},
{
"cell_type": "code",
"execution_count": 35,
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
@ -965,67 +985,13 @@
},
{
"cell_type": "code",
"execution_count": 36,
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"pipe = make_pipeline(column_trans, logreg)"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.7727924839713071"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"X_new = X.sample(5, random_state=99)"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Pipeline(memory=None,\n",
" steps=[('columntransformer', ColumnTransformer(n_jobs=None, remainder='passthrough', sparse_threshold=0.3,\n",
" transformer_weights=None,\n",
" transformers=[('onehotencoder', OneHotEncoder(categorical_features=None, categories=None,\n",
" dtype=<class 'numpy.float64'>, handle_unknown='error...enalty='l2', random_state=None, solver='lbfgs',\n",
" tol=0.0001, verbose=0, warm_start=False))])"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pipe.fit(X, y)"
]
},
{
"cell_type": "code",
"execution_count": 40,
@ -1034,7 +1000,7 @@
{
"data": {
"text/plain": [
"array([1, 0, 1, 1, 0])"
"0.7727924839713071"
]
},
"execution_count": 40,
@ -1043,8 +1009,49 @@
}
],
"source": [
"cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"X_new = X.sample(5, random_state=99)"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1, 0, 1, 1, 0])"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pipe.fit(X, y)\n",
"pipe.predict(X_new)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Comments or Questions?\n",
"\n",
"- Email: <kevin@dataschool.io>\n",
"- Website: https://www.dataschool.io\n",
"- Twitter: [@justmarkham](https://twitter.com/justmarkham)"
]
}
],
"metadata": {
@ -1063,7 +1070,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.5"
"version": "3.9.1"
}
},
"nbformat": 4,

View File

@ -1,41 +1,41 @@
# Introduction to machine learning with scikit-learn
# Introduction to Machine Learning with scikit-learn
This video series will teach you how to solve machine learning problems using Python's popular scikit-learn library. There are **10 video tutorials** totaling 4.5 hours, each with a corresponding **Jupyter notebook**. The notebook contains everything you see in the video: code, output, images, and comments.
This video series will teach you how to solve Machine Learning problems using Python's popular scikit-learn library. There are **10 video tutorials** totaling 4.5 hours, each with a corresponding **Jupyter notebook**. The notebook contains everything you see in the video: code, output, images, and comments.
**Note:** The notebooks in this repository have been updated to use Python 3.6 and scikit-learn 0.19.1. The original notebooks (shown in the video) used Python 2.7 and scikit-learn 0.16, and can be downloaded from the [archive branch](https://github.com/justmarkham/scikit-learn-videos/tree/archive). You can read about how I updated the code in this [blog post](https://www.dataschool.io/how-to-update-your-scikit-learn-code-for-2018/).
**Note:** The notebooks in this repository have been updated to use Python 3.9.1 and scikit-learn 0.23.2. The original notebooks (shown in the video) used Python 2.7 and scikit-learn 0.16, and can be downloaded from the [archive branch](https://github.com/justmarkham/scikit-learn-videos/tree/archive). You can read about how I updated the code in this [blog post](https://www.dataschool.io/how-to-update-your-scikit-learn-code-for-2018/).
You can [watch the entire series](https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A) on YouTube, and [view all of the notebooks](http://nbviewer.jupyter.org/github/justmarkham/scikit-learn-videos/tree/master/) using nbviewer.
[![Watch the first tutorial video](images/youtube.png)](https://www.youtube.com/watch?v=elojMnjn4kk&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=1 "Watch the first tutorial video")
Once you complete this video series, I recommend enrolling in my online course, [Machine Learning with Text in Python](http://www.dataschool.io/learn/), to gain a deeper understanding of scikit-learn and Natural Language Processing.
Once you complete this video series, I recommend enrolling in my online course, [Machine Learning with Text in Python](https://www.dataschool.io/learn/), to gain a deeper understanding of scikit-learn and Natural Language Processing.
## Table of Contents
1. What is machine learning, and how does it work? ([video](https://www.youtube.com/watch?v=elojMnjn4kk&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=1), [notebook](01_machine_learning_intro.ipynb))
- What is machine learning?
- What are the two main categories of machine learning?
- What are some examples of machine learning?
- How does machine learning "work"?
1. What is Machine Learning, and how does it work? ([video](https://www.youtube.com/watch?v=elojMnjn4kk&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=1), [notebook](01_machine_learning_intro.ipynb))
- What is Machine Learning?
- What are the two main categories of Machine Learning?
- What are some examples of Machine Learning?
- How does Machine Learning "work"?
2. Setting up Python for machine learning: scikit-learn and Jupyter Notebook ([video](https://www.youtube.com/watch?v=IsXXlYVBt1M&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=2), [notebook](02_machine_learning_setup.ipynb))
2. Setting up Python for Machine Learning: scikit-learn and Jupyter Notebook ([video](https://www.youtube.com/watch?v=IsXXlYVBt1M&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=2), [notebook](02_machine_learning_setup.ipynb))
- What are the benefits and drawbacks of scikit-learn?
- How do I install scikit-learn?
- How do I use the Jupyter Notebook?
- What are some good resources for learning Python?
3. Getting started in scikit-learn with the famous iris dataset ([video](https://www.youtube.com/watch?v=hd1W4CyPX58&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=3), [notebook](03_getting_started_with_iris.ipynb))
- What is the famous iris dataset, and how does it relate to machine learning?
- What is the famous iris dataset, and how does it relate to Machine Learning?
- How do we load the iris dataset into scikit-learn?
- How do we describe a dataset using machine learning terminology?
- How do we describe a dataset using Machine Learning terminology?
- What are scikit-learn's four key requirements for working with data?
4. Training a machine learning model with scikit-learn ([video](https://www.youtube.com/watch?v=RlQuVL6-qe8&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=4), [notebook](04_model_training.ipynb))
4. Training a Machine Learning model with scikit-learn ([video](https://www.youtube.com/watch?v=RlQuVL6-qe8&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=4), [notebook](04_model_training.ipynb))
- What is the K-nearest neighbors classification model?
- What are the four steps for model training and prediction in scikit-learn?
- How can I apply this pattern to other machine learning models?
- How can I apply this pattern to other Machine Learning models?
5. Comparing machine learning models in scikit-learn ([video](https://www.youtube.com/watch?v=0pP4EwWJgIU&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=5), [notebook](05_model_evaluation.ipynb))
5. Comparing Machine Learning models in scikit-learn ([video](https://www.youtube.com/watch?v=0pP4EwWJgIU&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=5), [notebook](05_model_evaluation.ipynb))
- How do I choose which model to use for my supervised learning task?
- How do I choose the best tuning parameters for that model?
- How do I estimate the likely performance of my model on out-of-sample data?
@ -70,7 +70,7 @@ Once you complete this video series, I recommend enrolling in my online course,
- What is the purpose of an ROC curve?
- How does Area Under the Curve (AUC) differ from classification accuracy?
10. Encoding categorical features ([video](https://www.youtube.com/watch?v=irHhDMbw3xo&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=10), [notebook](10_categorical_features.ipynb))
10. Building a Machine Learning workflow ([video](https://www.youtube.com/watch?v=irHhDMbw3xo&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=10), [notebook](10_categorical_features.ipynb))
- Why should you use a Pipeline?
- How do you encode categorical features with OneHotEncoder?
- How do you apply OneHotEncoder to selected columns with ColumnTransformer?
@ -80,7 +80,7 @@ Once you complete this video series, I recommend enrolling in my online course,
## Bonus Video
At the PyCon 2016 conference, I taught a **3-hour tutorial** that builds upon this video series and focuses on **text-based data**. You can watch the [tutorial video](https://www.youtube.com/watch?v=ZiKMIuYidY0&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=10) on YouTube.
At the PyCon 2016 conference, I taught a **3-hour tutorial** that builds upon this video series and focuses on **text-based data**. You can watch the [tutorial video](https://www.youtube.com/watch?v=ZiKMIuYidY0&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=11) on YouTube.
Here are the topics I covered: