Jupyter notebooks from the scikit-learn video series Youtube link: https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A
 
 
Go to file
Kevin Markham e5a6565083 update gitignore 2018-06-28 13:42:55 -04:00
images rewrite README and add thumbnail 2016-08-09 16:21:56 -04:00
styles add first notebook and supporting files 2015-04-08 00:49:52 -04:00
.gitignore update gitignore 2018-06-28 13:42:55 -04:00
01_machine_learning_intro.ipynb convert notebooks to v4, update links, make compatible with python 3 and sklearn 0.17 2016-08-03 14:47:50 -04:00
02_machine_learning_setup.ipynb convert notebooks to v4, update links, make compatible with python 3 and sklearn 0.17 2016-08-03 14:47:50 -04:00
03_getting_started_with_iris.ipynb convert notebooks to v4, update links, make compatible with python 3 and sklearn 0.17 2016-08-03 14:47:50 -04:00
04_model_training.ipynb convert notebooks to v4, update links, make compatible with python 3 and sklearn 0.17 2016-08-03 14:47:50 -04:00
05_model_evaluation.ipynb convert notebooks to v4, update links, make compatible with python 3 and sklearn 0.17 2016-08-03 14:47:50 -04:00
06_linear_regression.ipynb python-3-zip fix 2016-10-22 21:03:09 -04:00
07_cross_validation.ipynb convert notebooks to v4, update links, make compatible with python 3 and sklearn 0.17 2016-08-03 14:47:50 -04:00
08_grid_search.ipynb convert notebooks to v4, update links, make compatible with python 3 and sklearn 0.17 2016-08-03 14:47:50 -04:00
09_classification_metrics.ipynb convert notebooks to v4, update links, make compatible with python 3 and sklearn 0.17 2016-08-03 14:47:50 -04:00
README.md remove binder link and requirements file 2018-06-28 13:40:05 -04:00

README.md

Introduction to machine learning with scikit-learn

This video series will teach you how to solve machine learning problems using Pythons popular scikit-learn library. It was featured on Kaggles blog in 2015.

There are 9 video tutorials totaling 4 hours, each with a corresponding Jupyter notebook. The notebook contains everything you see in the video: code, output, images, and comments.

You can watch the entire series on YouTube, and view all of the notebooks using nbviewer.

Watch the first tutorial video

Once you complete this video series, I recommend enrolling in my online course, Machine Learning with Text in Python, to gain a deeper understanding of scikit-learn and Natural Language Processing.

Table of Contents

  1. What is machine learning, and how does it work? (video, notebook, blog post)
    • What is machine learning?
    • What are the two main categories of machine learning?
    • What are some examples of machine learning?
    • How does machine learning “work”?
  2. Setting up Python for machine learning: scikit-learn and IPython Notebook (video, notebook, blog post)
    • What are the benefits and drawbacks of scikit-learn?
    • How do I install scikit-learn?
    • How do I use the IPython Notebook?
    • What are some good resources for learning Python?
  3. Getting started in scikit-learn with the famous iris dataset (video, notebook, blog post)
    • What is the famous iris dataset, and how does it relate to machine learning?
    • How do we load the iris dataset into scikit-learn?
    • How do we describe a dataset using machine learning terminology?
    • What are scikit-learns four key requirements for working with data?
  4. Training a machine learning model with scikit-learn (video, notebook, blog post)
    • What is the K-nearest neighbors classification model?
    • What are the four steps for model training and prediction in scikit-learn?
    • How can I apply this pattern to other machine learning models?
  5. Comparing machine learning models in scikit-learn (video, notebook, blog post)
    • How do I choose which model to use for my supervised learning task?
    • How do I choose the best tuning parameters for that model?
    • How do I estimate the likely performance of my model on out-of-sample data?
  6. Data science pipeline: pandas, seaborn, scikit-learn (video, notebook, blog post)
    • How do I use the pandas library to read data into Python?
    • How do I use the seaborn library to visualize data?
    • What is linear regression, and how does it work?
    • How do I train and interpret a linear regression model in scikit-learn?
    • What are some evaluation metrics for regression problems?
    • How do I choose which features to include in my model?
  7. Cross-validation for parameter tuning, model selection, and feature selection (video, notebook, blog post)
    • What is the drawback of using the train/test split procedure for model evaluation?
    • How does K-fold cross-validation overcome this limitation?
    • How can cross-validation be used for selecting tuning parameters, choosing between models, and selecting features?
    • What are some possible improvements to cross-validation?
  8. Efficiently searching for optimal tuning parameters (video, notebook, blog post)
    • How can K-fold cross-validation be used to search for an optimal tuning parameter?
    • How can this process be made more efficient?
    • How do you search for multiple tuning parameters at once?
    • What do you do with those tuning parameters before making real predictions?
    • How can the computational expense of this process be reduced?
  9. Evaluating a classification model (video, notebook, blog post)
    • What is the purpose of model evaluation, and what are some common evaluation procedures?
    • What is the usage of classification accuracy, and what are its limitations?
    • How does a confusion matrix describe the performance of a classifier?
    • What metrics can be computed from a confusion matrix?
    • How can you adjust classifier performance by changing the classification threshold?
    • What is the purpose of an ROC curve?
    • How does Area Under the Curve (AUC) differ from classification accuracy?

Bonus Video

At the PyCon 2016 conference, I taught a 3-hour tutorial that builds upon this video series and focuses on text-based data. You can watch the tutorial video on YouTube.

Here are the topics I covered:

  1. Model building in scikit-learn (refresher)
  2. Representing text as numerical data
  3. Reading a text-based dataset into pandas
  4. Vectorizing our dataset
  5. Building and evaluating a model
  6. Comparing models
  7. Examining a model for further insight
  8. Practicing this workflow on another dataset
  9. Tuning the vectorizer (discussion)

Visit this GitHub repository to access the tutorial notebooks and many other recommended resources.