Initial commit to fork

master
Taylor Smith 2018-05-31 15:45:00 -05:00
parent 95b7eb85a6
commit 13781004a7
84 changed files with 9789 additions and 0 deletions

7
.coveragerc 100644
View File

@ -0,0 +1,7 @@
[run]
source = packtml
include = */packtml/*
omit =
*/packtml/setup.py
*/packtml/utils/plotting.py
*/setup.py

119
.gitignore vendored 100644
View File

@ -0,0 +1,119 @@
# scratch code
scratch/
# Any data unpackaged by tensorflow
MNIST_data/
# In-progress word docs
~$*.doc*
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# Ignore PyCharm stuff...
.idea/
# Mac stuff
.DS_Store
# C extensions
*.so
# Testing
.pytest_cache/
# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# Jupyter Notebook
.ipynb_checkpoints
# pyenv
.python-version
# celery beat schedule file
celerybeat-schedule
# SageMath parsed files
*.sage.py
# dotenv
.env
# virtualenv
.venv
venv/
ENV/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/

28
.travis.yml 100644
View File

@ -0,0 +1,28 @@
language: python
sudo: required
cache:
apt: true
directories:
- $HOME/.cache/pip
- $HOME/.ccache
before_install:
- source build_tools/travis/before_install.sh
env:
global:
- TEST_DIR=/tmp/packtml
matrix:
include:
- os: linux
dist: trusty
env: PYTHON_VERSION="3.6"
- os: linux
dist: trusty
env: PYTHON_VERSION="2.7"
install: source build_tools/travis/install.sh
before_script: bash build_tools/travis/before_script.sh
script: bash build_tools/travis/test_script.sh

1
MANIFEST.in 100644
View File

@ -0,0 +1 @@
recursive include packtml/*

140
README.md
View File

@ -1,2 +1,142 @@
# Hands-on-Supervised-Machine-Learning-with-Python
Published by Packt, Hands-on Supervised Machine Learning with Python
### Learn the underpinning os many supervised learning algorithms, and develop rich python coding practices in the process.
*Supervised learning—help teach a machine to think for itself!*
## Overview
These days machine learning is everywhere, and its here to stay. Understanding the core principles that drive how a machine “learns” is a critical skill for any would-be practitioner or consumer alike. This course will introduce you to supervised machine learning, guiding you through the implementation and nuances of many popular machine learning algorithms while facilitating a deep understanding along the way.
In this course, well cover parametric models such as linear and logistic regression, non-parametric methods such as decision trees and boosting, various clustering techniques, and well wrap up with a brief foray into neural networks.
This video course highlights clean coding techniques, object-oriented class design, and general best practices in machine learning
## Target audience
This course is designed for those who would like to understand supervised machine learning algorithms at a deeper level. If youre interested in understanding how and why an algorithm works rather than simply how to call its API, this course might be for you. Intermediate Python knowledge and at least an intermediate understanding of mathematical concepts is assumed. While notions in this course will be broken down into bits as granular as absolutely possible, terms and ideas such as “matrix transposition,” “gradient,” “dot product,” and “time complexity” are assumed to be understood without further explanation.
## What you will learn
* Understand the fundamental and theoretical differences between parametric and non-parametric models, and why you might opt for one over the other.
* Discover how a machine can learn a concept and generalize its understanding to new data
* Implement and grok several well-known supervised learning algorithms from scratch; build out your github portfolio and show off what youre capable of!
* Learn about model families like recommender systems, which are immediately applicable in domains such as ecommerce and marketing.
* Become a much stronger python developer
### Project layout
All **[source code](packtml/)** is within the `packtml` folder, which serves as the python
package for this course. Within the [examples](examples/) directory, you'll find a
number of short Python scripts that serve to demonstrate how various classes in the `packtml`
submodules work. Each respective folder inside the `examples/` directory corresponds to a
submodule inside of the `packtml` python package.
### Getting started
To get your environment set up, make sure you have Anaconda installed and on your path.
Then simply run the following:
```bash
$ conda env create -f environment.yml
```
To activate your environment in a Unix environment:
```bash
$ source activate packt-sml
```
In a Windows environment:
```
activate packt-sml
```
### Set up the python package (in your activated environment):
```bash
(packt-sml) $ python setup.py install
```
## What you'll learn
In this course and within this package, you'll learn to implement a number of
commonly-used supervised learning algorithms, and when best to use one type of
model over another. Below you'll find in-action examples of the various algorithms
we implement within this package.
### Regression
The classic introduction to machine learning, not only will we learn about linear regression,
we'll code one from scratch so you really understand what's happening
[under the hood](packtml/regression/simple_regression.py). Then we'll
[apply one in practice](examples/regression/example_linear_regression.py) so you can see
how you might use it.
<img src="img/regression/example_linear_regression.png" alt="KNN" width="50%"/>
Next, we'll dive into logistic regression, which is linear regression's classification cousin. See
the full logistic regression example [here](examples/regression/example_logistic_regression.py)
or the algorithm's [source code](packtml/regression/simple_logistic.py) if you're interested.
<img src="img/regression/example_logistic_regression.png" alt="KNN" width="50%"/>
### KNN clustering
During our exploration of non-parametric models, we'll explore clustering.
The `packtml` package implements a simple, but effective k-Nearest Neighbor classifier.
Here is its output on the iris dataset. For the full code example, head to the
[examples directory](examples/clustering/example_knn_classifier.py) and then to the
[source code](packtml/clustering/knn.py) to see how it's implemented.
<img src="img/clustering/example_knn_classifier.png" alt="KNN" width="50%"/>
### Decision trees
In this course, we'll also implement a CART decision tree from scratch (for both
regression and classification). Our classification tree's performance and potential
is shown at varying tree depths in the images below. The classification tree example
is located [here](examples/decision_tree/example_classification_decision_tree.py), and
the source code can be found [here](packtml/decision_tree/cart.py).
<img src="img/decision_tree/example_classification_decision_tree.png" alt="CART clf" width="75%"/>
In addition to classification, we can build a tree as a non-linear regression
model, as shown below. The regression tree example is located
[here](examples/decision_tree/example_regression_decision_tree.py). Check out the
[source code](packtml/decision_tree/cart.py) to understand how it works.
<img src="img/decision_tree/example_regression_decision_tree.png" alt="CART reg" width="75%"/>
### Deep learning
One of the hottest topics of machine learning right now is deep learning and neural
networks. In this course, we'll learn how to code a multi-layer perceptron classifier
from scratch. The full example code is located [here](examples/neural_net/example_mlp_classifier.py)
and this is the [source code](packtml/neural_net/mlp.py).
<img src="img/neural_net/example_mlp_classifier.png" alt="MLP" width="75%"/>
Next, we'll show how we can use the weights the MLP has learned on previous data to
learn new classification labels via transfer learning. For further implementation
details, check out the [example code](examples/neural_net/example_transfer_learning.py)
or the [source code](packtml/neural_net/transfer.py).
<img src="img/neural_net/example_transfer_learning.png" alt="MLP transfer" width="75%"/>
### Recommendation algorithms
These days, everything is available for purchase online. E-commerce sites have devoted
lots of research to algorithms that can learn your preferences. In this course, we'll
learn two such algorithms:
* [Item-to-item](packtml/recommendation/itemitem.py) collaborative filtering
* [Alternating least squares](packtml/recommendation/als.py) (matrix factorization)
The [example ALS code](examples/recommendation/example_als_recommender.py) shows how
train error decreases by iteration:
<img src="img/recommendation/example_als_recommender.png" alt="ALS" width="50%"/>

View File

@ -0,0 +1,3 @@
# CI/CD build tools
The scripts contained here are simply used for building the CI/CD pipeline.

View File

@ -0,0 +1,5 @@
#!/bin/bash
# only build on linux for travis, so this will work
set -e
sudo apt-get -qq update

View File

@ -0,0 +1,7 @@
#!/bin/bash
set -e
export DISPLAY=:99.0
sh -e /etc/init.d/xvfb start
sleep 5 # give xvfb some time to start by sleeping for 5 seconds

View File

@ -0,0 +1,41 @@
#!/bin/bash
# This script is meant to be called by the "install" step defined in
# .travis.yml. See http://docs.travis-ci.com/ for more details.
# The behavior of the script is controlled by environment variables defined
# in the .travis.yml in the top level folder of the project.
set -e
echo 'List files from cached directories'
echo 'pip:'
ls $HOME/.cache/pip
# for caching
export CC=/usr/lib/ccache/gcc
export CXX=/usr/lib/ccache/g++
# Useful for debugging how ccache is used
# export CCACHE_LOGFILE=/tmp/ccache.log
# ~60M is used by .ccache when compiling from scratch at the time of writing
ccache --max-size 100M --show-stats
# Deactivate the travis-provided virtual environment and setup a
# conda-based environment instead.
deactivate || echo "No virtualenv or condaenv to deactivate"
# install conda
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
MINICONDA_PATH=/home/travis/miniconda
# append the path, update conda
chmod +x miniconda.sh && ./miniconda.sh -b -p $MINICONDA_PATH
export PATH=$MINICONDA_PATH/bin:$PATH
conda update --yes conda
# Create the conda env and install the requirements
conda create -n testenv --yes python=${PYTHON_VERSION}
source activate testenv
pip install -r requirements.txt
pip install pytest pytest-cov
# set up the package
python setup.py install

View File

@ -0,0 +1,17 @@
#!/bin/bash
set -e
run_tests() {
oldpwd=`pwd`
# Move to another directory to test
cd ..
mkdir -p ${TEST_DIR} && cd ${TEST_DIR}
pytest --cov packtml
# move back to original dir
cd ${oldpwd}
}
run_tests

BIN
curriculum.docx 100644

Binary file not shown.

9
environment.yml 100644
View File

@ -0,0 +1,9 @@
name: packt-sml
dependencies:
- python=3.6
- numpy
- scipy
- scikit-learn
- pandas
- matplotlib

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,53 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from packtml.clustering import KNNClassifier
from packtml.utils.plotting import add_decision_boundary_to_axis
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris
from matplotlib import pyplot as plt
from matplotlib.colors import ListedColormap
import numpy as np
import sys
# #############################################################################
# Create a classification sub-dataset using iris
iris = load_iris()
X = iris.data[:, :2]
y = iris.target
# split data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
# #############################################################################
# Fit a k-nearest neighbor model and get predictions
k=10
clf = KNNClassifier(X_train, y_train, k=k)
pred = clf.predict(X_test)
clf_accuracy = accuracy_score(y_test, pred)
print("Test accuracy: %.3f" % clf_accuracy)
# #############################################################################
# Visualize difference in classes (this is from the scikit-learn KNN
# plotting example:
# http://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html#sphx-glr-auto-examples-neighbors-plot-classification-py)
xx, yy, _ = add_decision_boundary_to_axis(estimator=clf, axis=plt,
nclasses=3, X_data=X_test)
# Plot also the training points
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test,
cmap=ListedColormap(['#FF0000', '#00FF00', '#0000FF']),
edgecolor='k', s=20)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (k=%i)" % k)
# if we're supposed to save it, do so INSTEAD OF showing it
if len(sys.argv) > 1:
plt.savefig(sys.argv[1])
else:
plt.show()

View File

@ -0,0 +1,3 @@
# Demo data
Cached data for the ML demo goes here.

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,63 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from packtml.decision_tree import CARTClassifier
from packtml.utils.plotting import add_decision_boundary_to_axis
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
import sys
# #############################################################################
# Create a classification dataset
rs = np.random.RandomState(42)
covariance = [[1, .75], [.75, 1]]
n_obs = 500
x1 = rs.multivariate_normal(mean=[0, 0], cov=covariance, size=n_obs)
x2 = rs.multivariate_normal(mean=[1, 3], cov=covariance, size=n_obs)
X = np.vstack((x1, x2)).astype(np.float32)
y = np.hstack((np.zeros(n_obs), np.ones(n_obs)))
# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
# #############################################################################
# Fit a simple decision tree classifier and get predictions
shallow_depth = 2
clf = CARTClassifier(X_train, y_train, max_depth=shallow_depth, criterion='gini',
random_state=42)
pred = clf.predict(X_test)
clf_accuracy = accuracy_score(y_test, pred)
print("Test accuracy (depth=%i): %.3f" % (shallow_depth, clf_accuracy))
# Fit a deeper tree and show accuracy increases
clf2 = CARTClassifier(X_train, y_train, max_depth=25, criterion='gini',
random_state=42)
pred2 = clf2.predict(X_test)
clf2_accuracy = accuracy_score(y_test, pred2)
print("Test accuracy (depth=25): %.3f" % clf2_accuracy)
# #############################################################################
# Visualize difference in classification ability
fig, axes = plt.subplots(1, 2, figsize=(12, 8))
add_decision_boundary_to_axis(estimator=clf, axis=axes[0],
nclasses=2, X_data=X_test)
axes[0].scatter(X_test[:, 0], X_test[:, 1], c=pred, alpha=0.4)
axes[0].set_title("Shallow tree (depth=%i) performance: %.3f"
% (shallow_depth, clf_accuracy))
add_decision_boundary_to_axis(estimator=clf2, axis=axes[1],
nclasses=2, X_data=X_test)
axes[1].scatter(X_test[:, 0], X_test[:, 1], c=pred2, alpha=0.4)
axes[1].set_title("Deep tree (depth=25) performance: %.3f" % clf2_accuracy)
# if we're supposed to save it, do so INSTEAD OF showing it
if len(sys.argv) > 1:
plt.savefig(sys.argv[1])
else:
plt.show()

View File

@ -0,0 +1,23 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from packtml.decision_tree.cart import RandomSplitter
from packtml.decision_tree.metrics import InformationGain
import numpy as np
# #############################################################################
# Build the example from the slides (3.3)
X = np.array([[21, 3], [ 4, 2], [37, 2]])
y = np.array([1, 0, 1])
# this is the splitting class; we'll use gini as the criteria
random_state = np.random.RandomState(42)
splitter = RandomSplitter(random_state=random_state,
criterion=InformationGain('gini'),
n_val_sample=3)
# find the best:
best_feature, best_value, best_gain = splitter.find_best(X, y)
print("Best feature=%i, best value=%r, information gain: %.3f"
% (best_feature, best_value, best_gain))

View File

@ -0,0 +1,19 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from packtml.decision_tree.metrics import gini_impurity, InformationGain
import numpy as np
# #############################################################################
# Build the example from the slides
y = np.array([0, 0, 0, 1, 1, 1, 1])
uncertainty = gini_impurity(y)
print("Initial gini impurity: %.4f" % uncertainty)
# now get the information gain of the split from the slides
directions = np.array(["right", "left", "left", "left",
"right", "right", "right"])
mask = directions == "left"
print("Information gain from the split we created: %.4f"
% InformationGain("gini")(target=y, mask=mask, uncertainty=uncertainty))

View File

@ -0,0 +1,53 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from packtml.decision_tree import CARTRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
import sys
# #############################################################################
# Create a classification dataset
rs = np.random.RandomState(42)
X = np.sort(5 * rs.rand(80, 1), axis=0)
y = np.sin(X).ravel()
# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
# #############################################################################
# Fit a simple decision tree regressor and get predictions
clf = CARTRegressor(X_train, y_train, max_depth=3, random_state=42)
pred = clf.predict(X_test)
clf_mse = mean_squared_error(y_test, pred)
print("Test MSE (depth=3): %.3f" % clf_mse)
# Fit a deeper tree and show accuracy increases
clf2 = CARTRegressor(X_train, y_train, max_depth=10, random_state=42)
pred2 = clf2.predict(X_test)
clf2_mse = mean_squared_error(y_test, pred2)
print("Test MSE (depth=10): %.3f" % clf2_mse)
# #############################################################################
# Visualize difference in learning ability
x = X_train.ravel()
xte = X_test.ravel()
fig, axes = plt.subplots(1, 2, figsize=(12, 8))
axes[0].scatter(x, y_train, alpha=0.25, c='r')
axes[0].scatter(xte, pred, alpha=1.)
axes[0].set_title("Shallow tree (depth=3) test MSE: %.3f" % clf_mse)
axes[1].scatter(x, y_train, alpha=0.4, c='r')
axes[1].scatter(xte, pred2, alpha=1.)
axes[1].set_title("Deeper tree (depth=10) test MSE: %.3f" % clf2_mse)
# if we're supposed to save it, do so INSTEAD OF showing it
if len(sys.argv) > 1:
plt.savefig(sys.argv[1])
else:
plt.show()

View File

@ -0,0 +1,78 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from packtml.neural_net import NeuralNetClassifier
from packtml.utils.plotting import add_decision_boundary_to_axis
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
import sys
# #############################################################################
# Create a classification dataset
rs = np.random.RandomState(42)
covariance = [[1, .75], [.75, 1]]
n_obs = 1000
x1 = rs.multivariate_normal(mean=[0, 0], cov=covariance, size=n_obs)
x2 = rs.multivariate_normal(mean=[1, 5], cov=covariance, size=n_obs)
X = np.vstack((x1, x2)).astype(np.float32)
y = np.hstack((np.zeros(n_obs), np.ones(n_obs))).astype(int)
# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rs)
# #############################################################################
# Fit a simple neural network
n_iter = 4
hidden = (10,)
clf = NeuralNetClassifier(X_train, y_train, hidden=hidden, n_iter=n_iter,
learning_rate=0.001, random_state=42)
print("Loss per training iteration: %r" % clf.train_loss)
pred = clf.predict(X_test)
clf_accuracy = accuracy_score(y_test, pred)
print("Test accuracy (hidden=%s): %.3f" % (str(hidden), clf_accuracy))
# #############################################################################
# Fit a more complex neural network
n_iter2 = 150
hidden2 = (25, 25)
clf2 = NeuralNetClassifier(X_train, y_train, hidden=hidden2, n_iter=n_iter2,
learning_rate=0.001, random_state=42)
pred2 = clf2.predict(X_test)
clf_accuracy2 = accuracy_score(y_test, pred2)
print("Test accuracy (hidden=%s): %.3f" % (str(hidden2), clf_accuracy2))
# #############################################################################
# Visualize difference in classification ability
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
add_decision_boundary_to_axis(estimator=clf, axis=axes[0, 0],
nclasses=2, X_data=X_test)
axes[0, 0].scatter(X_test[:, 0], X_test[:, 1], c=pred, alpha=0.4)
axes[0, 0].set_title("Shallow (hidden=%s @ %i iter) test accuracy: %.3f"
% (str(hidden), n_iter, clf_accuracy))
add_decision_boundary_to_axis(estimator=clf2, axis=axes[0, 1],
nclasses=2, X_data=X_test)
axes[0, 1].scatter(X_test[:, 0], X_test[:, 1], c=pred2, alpha=0.4)
axes[0, 1].set_title("Deeper (hidden=%s @ %i iter): test accuracy: %.3f"
% (str(hidden2), n_iter2, clf_accuracy2))
# show the learning rates for each
axes[1, 0].plot(np.arange(len(clf.train_loss)), clf.train_loss)
axes[1, 0].set_title("Training loss by iteration")
axes[1, 1].plot(np.arange(len(clf2.train_loss)), clf2.train_loss)
axes[1, 1].set_title("Training loss by iteration")
# if we're supposed to save it, do so INSTEAD OF showing it
if len(sys.argv) > 1:
plt.savefig(sys.argv[1])
else:
plt.show()

View File

@ -0,0 +1,104 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from packtml.neural_net import NeuralNetClassifier, TransferLearningClassifier
from packtml.utils.plotting import add_decision_boundary_to_axis
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import numpy as np
import sys
# #############################################################################
# Create a classification dataset. This dataset differs from other datsets
# we've created in that there are two majority classes, and one third (tiny)
# class that we'll train the transfer learner over
rs = np.random.RandomState(42)
covariance = [[1, .75], [.75, 1]]
# these are the majority classes
n_obs = 1250
x1 = rs.multivariate_normal(mean=[0, 0], cov=covariance, size=n_obs)
x2 = rs.multivariate_normal(mean=[1, 5], cov=covariance, size=n_obs)
# this is the minority class
x3 = rs.multivariate_normal(mean=[0.85, 3.25], cov=[[1., .5], [1.25, 0.85]],
size=n_obs // 3)
# this is what the FIRST network will be trained on
n_first = int(0.8 * n_obs)
X = np.vstack((x1[:n_first], x2[:n_first])).astype(np.float32)
y = np.hstack((np.zeros(n_first), np.ones(n_first))).astype(int)
# this is what the SECOND network will be trained on
X2 = np.vstack((x1[n_first:], x2[n_first:], x3)).astype(np.float32)
y2 = np.hstack((np.zeros(n_obs - n_first),
np.ones(n_obs - n_first),
np.ones(x3.shape[0]) * 2)).astype(int)
# split the data up
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rs)
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2,
random_state=rs)
# #############################################################################
# Fit the first neural network
hidden = (25, 25)
n_iter = 75
clf = NeuralNetClassifier(X_train, y_train, hidden=hidden, n_iter=n_iter,
learning_rate=0.001, random_state=42)
pred = clf.predict(X_test)
clf_accuracy = accuracy_score(y_test, pred)
print("Test accuracy (hidden=%s): %.3f" % (str(hidden), clf_accuracy))
# #############################################################################
# Fit the transfer network - train one more layer with a new class
t_hidden = (15,)
t_iter = 25
transfer = TransferLearningClassifier(X2_train, y2_train, pretrained=clf,
hidden=t_hidden, n_iter=t_iter,
random_state=42)
t_pred = transfer.predict(X2_test)
trans_accuracy = accuracy_score(y2_test, t_pred)
print("Test accuracy (hidden=%s): %.3f" % (str(hidden + t_hidden),
trans_accuracy))
# #############################################################################
# Visualize how the models learned the classes
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
add_decision_boundary_to_axis(estimator=clf, axis=axes[0, 0],
nclasses=2, X_data=X_test)
axes[0, 0].scatter(X_test[:, 0], X_test[:, 1], c=pred, alpha=0.4)
axes[0, 0].set_title("MLP network (hidden=%s @ %i iter): %.3f"
% (str(hidden), n_iter, clf_accuracy))
add_decision_boundary_to_axis(estimator=transfer, axis=axes[0, 1],
nclasses=3, X_data=X2_test)
axes[0, 1].scatter(X2_test[:, 0], X2_test[:, 1], c=t_pred, alpha=0.4)
axes[0, 1].set_title("Transfer network (hidden=%s @ %i iter): "
"%.3f" % (str(hidden + t_hidden), t_iter,
trans_accuracy))
# show the learning rates for each
axes[1, 0].plot(np.arange(len(clf.train_loss)), clf.train_loss)
axes[1, 0].set_title("Training loss by iteration")
# concat the two training losses together for this plot
trans_train_loss = clf.train_loss + transfer.train_loss
axes[1, 1].plot(np.arange(len(trans_train_loss)), trans_train_loss)
axes[1, 1].set_title("Training loss by iteration")
# Add a verticle line for where the transfer learning begins
axes[1, 1].axvline(x=n_iter, ls="--")
# if we're supposed to save it, do so INSTEAD OF showing it
if len(sys.argv) > 1:
plt.savefig(sys.argv[1])
else:
plt.show()

View File

@ -0,0 +1,54 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from packtml.recommendation import ALS
from packtml.recommendation.data import get_completely_fabricated_ratings_data
from packtml.metrics.ranking import mean_average_precision
from matplotlib import pyplot as plt
import numpy as np
import sys
# #############################################################################
# Use our fabricated data set
R, titles = get_completely_fabricated_ratings_data()
# #############################################################################
# Fit an item-item recommender, predict for user 0
n_iter = 25
rec = ALS(R, factors=5, n_iter=n_iter, random_state=42, lam=0.01)
user0_rec, user_0_preds = rec.recommend_for_user(
R, user=0, filter_previously_seen=True,
return_scores=True)
# print some info about user 0
top_rated = np.argsort(-R[0, :])[:3]
print("User 0's top 3 rated movies are: %r" % titles[top_rated].tolist())
print("User 0's top 3 recommended movies are: %r"
% titles[user0_rec[:3]].tolist())
# #############################################################################
# We can score our recommender as well, to determine how well it actually did
# first, get all user recommendations (top 10, not filtered)
recommendations = list(rec.recommend_for_all_users(
R, n=10, filter_previously_seen=False,
return_scores=False))
# get the TRUE items they've rated (in order)
ground_truth = np.argsort(-R, axis=1)
mean_avg_prec = mean_average_precision(
predictions=recommendations, labels=ground_truth)
print("Mean average precision: %.3f" % mean_avg_prec)
# plot the error
plt.plot(np.arange(n_iter), rec.train_err)
plt.xlabel("Iteration")
plt.ylabel("MSE")
plt.title("Train error by iteration")
# if we're supposed to save it, do so INSTEAD OF showing it
if len(sys.argv) > 1:
plt.savefig(sys.argv[1])
else:
plt.show()

View File

@ -0,0 +1,39 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from packtml.recommendation import ItemItemRecommender
from packtml.recommendation.data import get_completely_fabricated_ratings_data
from packtml.metrics.ranking import mean_average_precision
import numpy as np
# #############################################################################
# Use our fabricated data set
R, titles = get_completely_fabricated_ratings_data()
# #############################################################################
# Fit an item-item recommender, predict for user 0
rec = ItemItemRecommender(R, k=3)
user0_rec, user_0_preds = rec.recommend_for_user(
R, user=0, filter_previously_seen=True,
return_scores=True)
# print some info about user 0
top_rated = np.argsort(-R[0, :])[:3]
print("User 0's top 3 rated movies are: %r" % titles[top_rated].tolist())
print("User 0's top 3 recommended movies are: %r"
% titles[user0_rec[:3]].tolist())
# #############################################################################
# We can score our recommender as well, to determine how well it actually did
# first, get all user recommendations (top 10, not filtered)
recommendations = list(rec.recommend_for_all_users(
R, n=10, filter_previously_seen=False,
return_scores=False))
# get the TRUE items they've rated (in order)
ground_truth = np.argsort(-R, axis=1)
mean_avg_prec = mean_average_precision(
predictions=recommendations, labels=ground_truth)
print("Mean average precision: %.3f" % mean_avg_prec)

View File

@ -0,0 +1,53 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from packtml.regression import SimpleLinearRegression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
import numpy as np
import sys
# #############################################################################
# Create a data-set that perfectly models the linear relationship:
# y = 2a + 1.5b + 0
random_state = np.random.RandomState(42)
X = random_state.rand(500, 2)
y = 2. * X[:, 0] + 1.5 * X[:, 1]
# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state=random_state)
# #############################################################################
# Fit a simple linear regression, produce predictions
lm = SimpleLinearRegression(X_train, y_train)
predictions = lm.predict(X_test)
print("Test sum of residuals: %.3f" % (y_test - predictions).sum())
assert np.allclose(lm.theta, [2., 1.5])
# #############################################################################
# Show that our solution is similar to scikit-learn's
lr = LinearRegression(fit_intercept=True)
lr.fit(X_train, y_train)
assert np.allclose(lm.theta, lr.coef_)
assert np.allclose(predictions, lr.predict(X_test))
# #############################################################################
# Fit another on ONE feature so we can show the plot
X_train = X_train[:, np.newaxis, 0]
X_test = X_test[:, np.newaxis, 0]
lm = SimpleLinearRegression(X_train, y_train)
# create the predictions & plot them as the line
preds = lm.predict(X_test)
plt.scatter(X_test[:, 0], y_test, color='black')
plt.plot(X_test[:, 0], preds, linewidth=3)
# if we're supposed to save it, do so INSTEAD OF showing it
if len(sys.argv) > 1:
plt.savefig(sys.argv[1])
else:
plt.show()

View File

@ -0,0 +1,57 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from packtml.regression import SimpleLogisticRegression
from packtml.utils.plotting import add_decision_boundary_to_axis
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from matplotlib import pyplot as plt
import sys
# #############################################################################
# Create an almost perfectly linearly-separable classification set
X, y = make_classification(n_samples=100, n_features=2, random_state=42,
n_redundant=0, n_repeated=0, n_classes=2,
class_sep=1.0)
# split data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
# #############################################################################
# Fit a simple logistic regression, produce predictions
lm = SimpleLogisticRegression(X_train, y_train, n_steps=50)
predictions = lm.predict(X_test)
acc = accuracy_score(y_test, predictions)
print("Test accuracy: %.3f" % acc)
# Show that our solution is similar to scikit-learn's
lr = LogisticRegression(fit_intercept=True, C=1e16) # almost no regularization
lr.fit(X_train, y_train)
print("Sklearn test accuracy: %.3f" % accuracy_score(y_test,
lr.predict(X_test)))
# #############################################################################
# Plot the data and the boundary we learned.
add_decision_boundary_to_axis(estimator=lm, axis=plt,
nclasses=2, X_data=X_test)
# We have to break this into two plot calls, one for each class to
# have different markers...
c0_mask = y_test == 0
plt.scatter(X_test[c0_mask, 0], X_test[c0_mask, 1],
c=~predictions[c0_mask], marker='o')
plt.scatter(X_test[~c0_mask, 0], X_test[~c0_mask, 1],
c=~predictions[~c0_mask], marker='x')
plt.title("Logistic test performance: %.4f (o=true 0, x=true 1)" % acc)
# if we're supposed to save it, do so INSTEAD OF showing it
if len(sys.argv) > 1:
plt.savefig(sys.argv[1])
else:
plt.show()

View File

@ -0,0 +1,56 @@
# -*- coding: utf-8 -*-
#
# This function is not intended to be run by students (or anyone, for that
# matter). It is intended to be run by me (Taylor) just to automate the
# population of the img/ directory with the output of the example plots.
# Hence its poor documentation and sheer hackiness.
from __future__ import absolute_import
import os
import sys
import subprocess
# determine where the user is calling this from...
here = os.listdir(".")
if "examples" in here:
cwd = "examples"
img_dir = "img"
elif "clustering" in here:
cwd = "."
img_dir = "../img"
else:
raise ValueError("Call this from top-level or from within "
"the examples dir")
# iterate all py files
for root, dirs, files in os.walk(cwd, topdown=False):
for fil in files:
# Only run the ones with the appropriate prefix
if not fil.startswith("example_"):
continue
# Get the module root
module_root = root.split(os.sep)[1]
# If it's "data" we don't want that! That's where we cache the data
# for the demo
if module_root in ("data", ".ipynb_checkpoints"):
print("Skipping dir: %s" % module_root)
continue
# Otherwise create its corresponding path in ../img
image_root = os.path.join(img_dir, module_root) # ../img/clustering
# create the directory in the image dir if it's not there
if not os.path.exists(image_root):
os.mkdir(image_root)
# run it
dest = os.path.join(image_root, fil[:-3] + ".png")
filexec = os.path.join(root, fil)
print("Running %s" % filexec)
subprocess.Popen([sys.executable, filexec, dest])
sys.exit(0)

5
img/README.md 100644
View File

@ -0,0 +1,5 @@
# img
Within this directory, you'll find the output of the various example scripts.
The rendering of these images is automated by the
[examples/run_all_examples.py](../examples/run_all_examples.py) script.

Binary file not shown.

After

Width:  |  Height:  |  Size: 22 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 123 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 50 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 175 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 174 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 25 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 19 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 19 KiB

1
packtml/VERSION 100644
View File

@ -0,0 +1 @@
1.0.3

View File

@ -0,0 +1,32 @@
# -*- coding: utf-8 -*-
import os
# global namespace:
from packtml import clustering
from packtml import decision_tree
from packtml import metrics
from packtml import neural_net
from packtml import recommendation
from packtml import regression
from packtml import utils
# set the version
packtml_location = os.path.abspath(os.path.dirname(__file__))
with open(os.path.join(packtml_location, "VERSION")) as vsn:
__version__ = vsn.read().strip()
# remove from global namespace
del os
del packtml_location
del vsn
__all__ = [
'clustering',
'decision_tree',
'metrics',
'neural_net',
'recommendation',
'regression',
'utils'
]

42
packtml/base.py 100644
View File

@ -0,0 +1,42 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from abc import ABCMeta, abstractmethod
from sklearn.externals import six
__all__ = [
'BaseSimpleEstimator'
]
class BaseSimpleEstimator(six.with_metaclass(ABCMeta)):
"""Base class for packt estimators.
The estimators in the Packt package do not behave exactly like scikit-learn
estimators (by design). They are made to perform the model fit immediately
upon class instantiation. Moreover, many of the hyper-parameter options
are limited to promote readability and avoid confusion.
The constructor of every Packt estimator should resemble the following::
def __init__(self, X, y, *args, **kwargs):
...
where ``X`` is the training matrix, ``y`` is the training target variable,
and ``*args`` and ``**kwargs`` are varargs that will differ for each
estimator.
"""
@abstractmethod
def predict(self, X):
"""Form predictions based on new data.
This function must be implemented by subclasses to generate
predictions given the model fit.
Parameters
----------
X : array-like, shape=(n_samples, n_features)
The test array. Should be only finite values.
"""

View File

@ -0,0 +1,5 @@
# -*- coding: utf-8 -*-
from .knn import *
__all__ = [s for s in dir() if not s.startswith("_")]

View File

@ -0,0 +1,99 @@
# -*- coding: utf-8 -*-
#
# Author: Taylor Smith <taylor.smith@alkaline-ml.com>
#
# An implementation of kNN clustering. Note that this was written to
# maximize readability. To use kNN in a true project setting, you may
# wish to use a more highly optimized library, such as scikit-learn.
from __future__ import absolute_import
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.utils.validation import check_X_y
from sklearn.utils.multiclass import check_classification_targets
from scipy.stats import mode
import numpy as np
from ..base import BaseSimpleEstimator
__all__ = [
'KNNClassifier'
]
class KNNClassifier(BaseSimpleEstimator):
"""Classify points using k-Nearest Neighbors.
The kNN algorithm computes the distances between points in a matrix and
identifies the nearest "neighboring" points to each observation. The idea
is that neighboring points share similar attributes. Therefore, if a
neighbor is of some class, an unknown observation may likely belong to
the same class.
There are several caveats to kNN:
* We have to retain all of the training data, which is expensive.
* Computing the pairwise distance matrix is also expensive.
* You should make sure you've standardized your data (mean 0, stddev 1)
prior to fitting a kNN model
Parameters
----------
X : array-like, shape=(n_samples, n_features)
The training array. Should be a numpy array or array-like structure
with only finite values.
y : array-like, shape=(n_samples,)
The target vector.
k : int, optional (default=10)
The number of neighbors to identify. The higher the ``k`` parameter,
the more likely you are to *under*-fit your data. The lower the ``k``
parameter, the more likely you are to *over*-fit your model.
Notes
-----
This is a very rudimentary implementation of KNN. It does not permit tuning
of distance metrics, optimization of the search algorithm or any other
parameters. It is written to be as simple as possible to maximize
readability. For a more optimal solution, see
``sklearn.neighbors.KNeighborsClassifier``.
"""
def __init__(self, X, y, k=10):
# check the input array
X, y = check_X_y(X, y, accept_sparse=False, dtype=np.float32,
copy=True)
# make sure we're performing classification here
check_classification_targets(y)
# Save the K hyper-parameter so we can use it later
self.k = k
# kNN is a special case where we have to save the training data in
# order to make predictions in the future
self.X = X
self.y = y
def predict(self, X):
# Compute the pairwise distances between each observation in
# the dataset and the training data. This can be relatively expensive
# for very large datasets!!
dists = euclidean_distances(X, self.X)
# Arg sort to find the shortest distance for each row. This sorts
# elements in each row (independent of other rows) to determine the
# order required to sort the rows.
# I.e:
# >>> P = np.array([[4, 5, 1], [3, 1, 6]])
# >>> np.argsort(P, axis=1)
# array([[2, 0, 1],
# [1, 0, 2]])
nearest = np.argsort(dists, axis=1)
# We only care about the top K, really, so get sorted and then truncate
predicted_labels = self.y[nearest][:, :self.k]
# We want the most common along the rows as the predictions
return mode(predicted_labels, axis=-1)[0].ravel()

View File

@ -0,0 +1,3 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import

View File

@ -0,0 +1,33 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from packtml.clustering import KNNClassifier
from sklearn.datasets import load_iris
from numpy.testing import assert_array_equal
import numpy as np
iris = load_iris()
X = iris.data[:, :2]
y = iris.target
def test_knn():
# show we can fit
knn = KNNClassifier(X, y)
# show we can predict
knn.predict(X)
def test_knn2():
X2 = np.array([[0., 0., 0.5],
[0., 0.5, 0.],
[0.5, 0., 0.],
[5., 5., 6.],
[6., 5., 5.]])
y2 = [0, 0, 0, 1, 1]
knn = KNNClassifier(X2, y2, k=3)
preds = knn.predict(X2)
assert_array_equal(preds, y2)

View File

@ -0,0 +1,6 @@
# -*- coding: utf-8 -*-
from .cart import *
from .metrics import *
__all__ = [s for s in dir() if not s.startswith("_")]

View File

@ -0,0 +1,493 @@
# -*- coding: utf-8 -*-
#
# Author: Taylor G Smith <taylor.smith@alkaline-ml.com>
#
# A simplified version of Classification and Regression Trees. This file
# is intended to maximize readability and understanding of how CART trees work.
# For very fast or customizable decision tree solutions, use scikit-learn.
#
# The best order in which to read & understand the contents to best
# grok the entire concept:
#
# 1. metrics.InformationGain & metrics.VarianceReduction
# 2. RandomSplitter
# 3. LeafNode
# 4. BaseCART
from __future__ import absolute_import, division
from sklearn.utils.validation import check_X_y, check_random_state, check_array
from sklearn.utils.multiclass import check_classification_targets
from sklearn.base import ClassifierMixin, RegressorMixin, is_classifier
import numpy as np
from ..base import BaseSimpleEstimator
from .metrics import InformationGain, VarianceReduction
__all__ = [
'CARTRegressor',
'CARTClassifier'
]
try:
xrange
except NameError: # py3
xrange = range
class RandomSplitter(object):
"""Evaluate a split via random values in a feature.
Every feature in the dataset needs to be evaluated in a CART tree. Since
that in itself can be expensive, the random splitter allows us to look at
only a random amount of row splits per feature in order to make the best
splitting decision.
Parameters
----------
random_state : np.random.RandomState
The random state for seeding the choices
criterion : callable
The metric used for evaluating the "goodness" of a split. Either
``InformationGain`` (with entropy or Gini) for classification, or
``VarianceReduction`` for regression.
n_val_sample : float, optional (default=25)
The number of values per feature to sample as a splitting point.
"""
def __init__(self, random_state, criterion, n_val_sample=25):
self.random_state = random_state
self.criterion = criterion # BaseCriterion from metrics
self.n_val_sample = n_val_sample
def find_best(self, X, y):
criterion = self.criterion
rs = self.random_state
# keep track of the best info gain
best_gain = 0.
# keep track of best feature and best value on which to split
best_feature = None
best_value = None
# get the current state of the uncertainty (gini or entropy)
uncertainty = criterion.compute_uncertainty(y)
# iterate over each feature
for col in xrange(X.shape[1]):
feature = X[:, col]
# get all values in the feature
# values = np.unique(feature)
seen_values = set()
# the number of values to sample. Should be defined as the min
# between the prescribed n_val_sample value and the number of
# unique values in the feature.
n_vals = min(self.n_val_sample, np.unique(feature).shape[0])
# For each of n_val_sample iterations, select a random value
# from the feature and create a split. We store whether we've seen
# the value before; if we have, continue. Continue until we've seen
# n_vals unique values. This allows us to more likely select values
# that are high frequency (retains distributional data implicitly)
for v in rs.permutation(feature):
# if we've hit the limit of the number of values we wanted to
# examine, break out
if len(seen_values) == n_vals:
break
# if we've already tried this value, continue
elif v in seen_values: # O(1) lookup
continue
# otherwise, it's a new value we've never tried splitting on.
# add it to the set.
seen_values.add(v)
# create the mask (these values "go left")
mask = feature >= v # type: np.ndarray
# skip this step if this doesn't divide the dataset
if np.unique(mask).shape[0] == 1: # all True or all False
continue
# compute how good this split was
gain = criterion(y, mask, uncertainty=uncertainty)
# if the gain is better, we keep this feature & value &
# update the best gain we've seen so far
if gain > best_gain:
best_feature = col
best_value = v
best_gain = gain
# if best feature is None, it means we never found a viable split...
# this is likely because all of our labels were perfect. In this case,
# we could select any feature and the first value and define that as
# our left split and nothing will go right.
if best_feature is None:
best_feature = 0
best_value = np.squeeze(X[:, best_feature])[0]
best_gain = 0.
# we need to know the best feature, the best value, and the best gain
return best_feature, best_value, best_gain
class LeafNode(object):
"""A tree node class.
Tree node that store the column on which to split and the value above
which to go left vs. right. Additionally, it stores the target statistic
related to this node. For instance, in a classification scenario:
>>> X = np.array([[ 1, 1.5 ],
... [ 2, 0.5 ],
... [ 3, 0.75]])
>>> y = np.array([0, 1, 1])
>>> node = LeafNode(split_col=0, split_val=2,
... class_statistic=_most_common(y))
This means if ``node`` were a terminal node, it would generate predictions
of 1, since that was the most common value in the pre-split ``y``. The
class statistic will differ for splits in the tree, where the most common
value in ``y`` for records in ``X`` that go left is 1, and 0 for that which
goes to the right.
The class statistic is computed for each split as the tree recurses.
Parameters
----------
split_col : int
The column on which to split.
split_val : float or int
The value above which to go left.
"""
def __init__(self, split_col, split_val, split_gain, class_statistic):
self.split_col = split_col
self.split_val = split_val
self.split_gain = split_gain
# the class statistic is the mode or the mean of the targets for
# this split
self.class_statistic = class_statistic
# if these remain None, it's a terminal node
self.left = None
self.right = None
def create_split(self, X, y):
"""Split the next X, y.
Returns
-------
X_left : np.ndarray, shape=(n_samples, n_features)
Rows where ``split_col >= split_val``.
X_right : np.ndarray, shape=(n_samples, n_features)
Rows where ``split_col < split_val``.
y_left : np.ndarray, shape=(n_samples,)
Target where ``split_col >= split_val``.
y_right : np.ndarray, shape=(n_samples,)
Target where ``split_col < split_val``.
"""
# If values in the split column are greater than or equal to the
# split value, we go left.
left_mask = X[:, self.split_col] >= self.split_val
# Otherwise we go to the right
right_mask = ~left_mask # type: np.ndarray
# If the left mask is all False or all True, it means we've achieved
# a perfect split.
all_left = left_mask.all()
all_right = right_mask.all()
# create the left split. If it's all right side, we'll return None
X_left = X[left_mask, :] if not all_right else None
y_left = y[left_mask] if not all_right else None
# create the right split. If it's all left side, we'll return None.
X_right = X[right_mask, :] if not all_left else None
y_right = y[right_mask] if not all_left else None
return X_left, X_right, y_left, y_right
def is_terminal(self):
"""Determine whether the node is terminal.
If there is no left node and no right node, it's a terminal node.
If either is non-None, it is a parent to something.
"""
return self.left is None and self.right is None
def __repr__(self):
"""Get the string representation of the node."""
return "Rule: Go left if x%i >= %r else go right (gain=%.3f)" \
% (self.split_col, self.split_val, self.split_gain)
def predict_record(self, record):
"""Find the terminal node in the tree and return the class statistic"""
# First base case, this is a terminal node:
has_left = self.left is not None
has_right = self.right is not None
if not has_left and not has_right:
return self.class_statistic
# Otherwise, determine whether the record goes right or left
go_left = record[self.split_col] >= self.split_val
# if we go left and there is a left node, delegate the recursion to the
# left side
if go_left and has_left:
return self.left.predict_record(record)
# if we go right, delegate to the right
if not go_left and has_right:
return self.right.predict_record(record)
# if we get here, it means one of two things:
# 1. we were supposed to go left and didn't have a left
# 2. we were supposed to go right and didn't have a right
# for both of these, we return THIS class statistic
return self.class_statistic
def _most_common(y):
# This is essentially just a "mode" function to compute the most
# common value in a vector.
cls, cts = np.unique(y, return_counts=True)
order = np.argsort(-cts)
return cls[order][0]
class _BaseCART(BaseSimpleEstimator):
def __init__(self, X, y, criterion, min_samples_split, max_depth,
n_val_sample, random_state):
# make sure max_depth > 1
if max_depth < 2:
raise ValueError("max depth must be > 1")
# check the input arrays, and if it's classification validate the
# target values in y
X, y = check_X_y(X, y, accept_sparse=False, dtype=None, copy=True)
if is_classifier(self):
check_classification_targets(y)
# hyper parameters so we can later inspect attributes of the model
self.min_samples_split = min_samples_split
self.max_depth = max_depth
self.n_val_sample = n_val_sample
self.random_state = random_state
# create the splitting class
random_state = check_random_state(random_state)
self.splitter = RandomSplitter(random_state, criterion, n_val_sample)
# grow the tree depth first
self.tree = self._find_next_split(X, y, 0)
def _target_stat(self, y):
"""Given a vector, ``y``, decide what value to return as the leaf
node statistic (mean for regression, mode for classification)
"""
def _find_next_split(self, X, y, current_depth):
# base case 1: current depth is the limit, the parent node should
# be a terminal node (child = None)
# base case 2: n samples in X <= min_samples_split
if current_depth == self.max_depth or \
X.shape[0] <= self.min_samples_split:
return None
# create the next split
split_feature, split_value, gain = \
self.splitter.find_best(X, y)
# create the next node based on the best split feature and value
# that we just found. Also compute the "target stat" (mode of y for
# classification problems or mean of y for regression problems) and
# pass that to the node in case it is the terminal node (i.e., the
# decision maker)
node = LeafNode(split_feature, split_value, gain, self._target_stat(y))
# Create the splits based on the criteria we just determined, and then
# recurse down left, right sides
X_left, X_right, y_left, y_right = node.create_split(X, y)
# if either the left or right is None, it means we've achieved a
# perfect split. It is then a terminal node and will remain None.
if X_left is not None:
node.left = self._find_next_split(X_left, y_left,
current_depth + 1)
if X_right is not None:
node.right = self._find_next_split(X_right, y_right,
current_depth + 1)
return node
def predict(self, X):
# Check the array
X = check_array(X, dtype=np.float32) # type: np.ndarray
# For each record in X, find its leaf node in the tree (O(log N))
# to get the predictions. This makes the prediction operation
# O(N log N) runtime complexity
predictions = [self.tree.predict_record(row) for row in X]
return np.asarray(predictions)
class CARTRegressor(_BaseCART, RegressorMixin):
"""Decision tree regression.
Builds a decision tree to solve a regression problem using the CART
algorithm. The estimator builds a binary tree structure, evaluating each
feature at each iteration to recursively split along the best value and
progress down the tree until each leaf node reaches parsimony.
The regression tree uses "variance reduction" to assess the "goodness"
of a split, selecting the split and feature that maximizes the value.
To make predictions, each record is evaluated at each node of the tree
until it reaches a leaf node. For regression, predictions are made by
returning the training target's mean for the leaf node.
Parameters
----------
X : array-like, shape=(n_samples, n_features)
The training array. Should be a numpy array or array-like structure
with only finite values.
y : array-like, shape=(n_samples,)
The target vector.
max_depth : int, optional (default=5)
The maximum depth to which the tree will grow. Note that the tree is
not guaranteed to reach this depth and may stop growing early if the
``min_samples_split`` terminal criterion is met first.
min_samples_split : int, optional (default=1)
A terminal criterion used to halt the growth of a tree. If a leaf
node's split contains <= ``min_samples_split``, it will not grow
any further.
n_val_sample : int, optional (default=25)
The method by which we evaluate splits differs a bit from highly
optimized libraries like scikit-learn, which may evaluate for the
globally optimal split for each feature. We use random splitting
which evaluates a number of unique values for each feature at each
split. The ``n_val_sample`` is the maximum number of values per
feature that will be evaluated as a potential splitting point at
each iteration.
random_state : int, None or RandomState, optional (default=None)
The random state used to seed the RandomSplitter.
Attributes
----------
splitter : RandomSplitter
The feature splitting class. Used for determining optimal splits at
each node.
tree : LeafNode
The actual tree. Each node contains data on the class statistic (i.e.,
mode or mean of the training target at that split), best feature and
best value.
"""
def __init__(self, X, y, max_depth=5, min_samples_split=1,
n_val_sample=25, random_state=None):
super(CARTRegressor, self).__init__(
X, y, criterion=VarianceReduction(),
min_samples_split=min_samples_split, max_depth=max_depth,
n_val_sample=n_val_sample, random_state=random_state)
def _target_stat(self, y):
"""Given a vector, ``y``, get the mean"""
return y.mean()
class CARTClassifier(_BaseCART, ClassifierMixin):
"""Decision tree classication.
Builds a decision tree to solve a classification problem using the CART
algorithm. The estimator builds a binary tree structure, evaluating each
feature at each iteration to recursively split along the best value and
progress down the tree until each leaf node reaches parsimony.
The classification tree uses "information gain" to assess the "goodness"
of a split, selecting the split and feature that maximizes the value.
To make predictions, each record is evaluated at each node of the tree
until it reaches a leaf node. For classification, predictions are made by
returning the training target's mode for the leaf node.
Parameters
----------
X : array-like, shape=(n_samples, n_features)
The training array. Should be a numpy array or array-like structure
with only finite values.
y : array-like, shape=(n_samples,)
The target vector.
criterion : str or unicode, optional (default='gini')
The splitting criterion used for classification problems. CART trees
typically use "gini" but their cousins, C4.5 trees, use "entropy". Both
metrics are extremely similar and will likely not change your tree
structure by much.
max_depth : int, optional (default=5)
The maximum depth to which the tree will grow. Note that the tree is
not guaranteed to reach this depth and may stop growing early if the
``min_samples_split`` terminal criterion is met first.
min_samples_split : int, optional (default=1)
A terminal criterion used to halt the growth of a tree. If a leaf
node's split contains <= ``min_samples_split``, it will not grow
any further.
n_val_sample : int, optional (default=25)
The method by which we evaluate splits differs a bit from highly
optimized libraries like scikit-learn, which may evaluate for the
globally optimal split for each feature. We use random splitting
which evaluates a number of unique values for each feature at each
split. The ``n_val_sample`` is the maximum number of values per
feature that will be evaluated as a potential splitting point at
each iteration.
random_state : int, None or RandomState, optional (default=None)
The random state used to seed the RandomSplitter.
Attributes
----------
splitter : RandomSplitter
The feature splitting class. Used for determining optimal splits at
each node.
tree : LeafNode
The actual tree. Each node contains data on the class statistic (i.e.,
mode or mean of the training target at that split), best feature and
best value.
"""
def __init__(self, X, y, criterion='gini', max_depth=5,
min_samples_split=1, n_val_sample=25, random_state=None):
super(CARTClassifier, self).__init__(
X, y, criterion=InformationGain(criterion), max_depth=max_depth,
min_samples_split=min_samples_split,
n_val_sample=n_val_sample, random_state=random_state)
def _target_stat(self, y):
"""Given a vector, ``y``, get the mode"""
return _most_common(y)

View File

@ -0,0 +1,145 @@
# -*- coding: utf-8 -*-
#
# Author: Taylor Smith <taylor.smith@alkaline-ml.com>
#
# Metrics used for determining how to split a feature in a decision tree.
from __future__ import absolute_import
import numpy as np
__all__ = [
'entropy',
'gini_impurity',
'InformationGain',
'VarianceReduction'
]
def _clf_metric(y, metric):
"""Internal helper. Since this is internal, so no validation performed"""
# get unique classes in y
y = np.asarray(y)
C, cts = np.unique(y, return_counts=True)
# a base case is that there is only one class label
if C.shape[0] == 1:
return 0.
pr_C = cts.astype(float) / y.shape[0] # P(Ci)
# 1 - sum(P(Ci)^2)
if metric == 'gini':
return 1. - pr_C.dot(pr_C) # np.sum(pr_C ** 2)
elif metric == 'entropy':
return np.sum(-pr_C * np.log2(pr_C))
# shouldn't ever get to this point since it is internal
else:
raise ValueError("metric should be one of ('gini', 'entropy'), "
"but encountered %s" % metric)
def entropy(y):
"""Compute the entropy of class labels.
This computes the entropy of training samples. A high entropy means
a relatively uniform distribution, while low entropy indicates a
varying distribution (many peaks and valleys).
References
----------
.. [1] http://www.cs.csi.cuny.edu/~imberman/ai/Entropy%20and%20Information%20Gain.htm
"""
return _clf_metric(y, 'entropy')
def gini_impurity(y):
"""Compute the Gini index on a target variable.
The Gini index gives an idea of how mixed two classes are within a leaf
node. A perfect class separation will result in a Gini impurity of 0 (i.e.,
"perfectly pure").
"""
return _clf_metric(y, 'gini')
class BaseCriterion(object):
"""Splitting criterion.
Base class for InformationGain and VarianceReduction. WARNING - do
not invoke this class directly. Use derived classes only! This is a
loosely-defined abstract class used to prescribe a common interface
for sub-classes.
"""
def compute_uncertainty(self, y):
"""Compute the uncertainty for a vector.
A subclass should override this function to compute the uncertainty
(i.e., entropy or gini) of a vector.
"""
class VarianceReduction(BaseCriterion):
"""Compute the variance reduction after a split.
Variance reduction is a splitting criterion used by CART trees in the
context of regression. It examines the variance in a target before and
after a split to determine whether we've reduced the variability in the
target.
"""
def compute_uncertainty(self, y):
"""Compute the variance of a target."""
return np.var(y)
def __call__(self, target, mask, uncertainty):
left, right = target[mask], target[~mask]
return uncertainty - (self.compute_uncertainty(left) +
self.compute_uncertainty(right))
class InformationGain(BaseCriterion):
"""Compute the information gain after a split.
The information gain metric is used by CART trees in a classification
context. It measures the difference in the gini or entropy before and
after a split to determine whether the split "taught" us anything.
Parameters
----------
metric : str or unicode
The name of the metric to use. Either "gini" (Gini impurity)
or "entropy".
"""
def __init__(self, metric):
# let fail out with a KeyError if an improper metric
self.crit = {'gini': gini_impurity,
'entropy': entropy}[metric]
def compute_uncertainty(self, y):
"""Compute the uncertainty for a vector.
This method computes either the Gini impurity or entropy of a target
vector using the prescribed method.
"""
return self.crit(y)
def __call__(self, target, mask, uncertainty):
"""Compute the information gain of a split.
Parameters
----------
target : np.ndarray
The target feature
mask : np.ndarray
The value mask
uncertainty : float
The gini or entropy of rows pre-split
"""
left, right = target[mask], target[~mask]
p = float(left.shape[0]) / float(target.shape[0])
crit = self.crit # type: callable
return uncertainty - p * crit(left) - (1 - p) * crit(right)

View File

@ -0,0 +1,3 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import

View File

@ -0,0 +1,119 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from numpy.testing import assert_array_equal, assert_almost_equal
import numpy as np
from packtml.decision_tree.metrics import InformationGain
from packtml.decision_tree.cart import (CARTClassifier, CARTRegressor,
RandomSplitter, LeafNode, _most_common)
X = np.array([[0, 1, 2],
[1, 2, 3],
[2, 3, 4]])
y = np.array([0, 1, 1])
X2 = np.array([[0, 1, 2],
[1, 2, 3],
[2, 3, 4],
[3, 4, 5],
[4, 5, 6],
[5, 6, 7]])
y2 = np.array([0, 0, 1, 1, 1, 1])
# a regression dataset
rs = np.random.RandomState(42)
Xreg = np.sort(5 * rs.rand(100, 1), axis=0)
yreg = np.sin(Xreg).ravel()
def test_most_common():
assert _most_common(y) == 1
assert _most_common([1]) == 1
def test_terminal_leaf_node():
node = LeafNode(split_col=0, split_val=1.,
class_statistic=_most_common(y),
split_gain=np.inf)
# show that there are no children
assert node.is_terminal()
# show that the splitting works as expected
X_left, X_right, y_left, y_right = node.create_split(X, y)
assert_array_equal(X_left, X[1:, :])
assert_array_equal(X_right, X[:1, :])
assert_array_equal(y_left, [1, 1])
assert_array_equal(y_right, [0])
# show that predictions work as expected
assert [node.predict_record(r) for r in X] == [1, 1, 1]
def test_complex_leaf_node():
node = LeafNode(split_col=0, split_val=3.,
class_statistic=_most_common(y2),
split_gain=np.inf)
# create the split
X_left, X_right, y_left, y_right = node.create_split(X2, y2)
# show it worked as expected
assert_array_equal(X_left, X2[3:, :])
assert_array_equal(X_right, X2[:3, :])
assert_array_equal(y_left, [1, 1, 1])
assert_array_equal(y_right, [0, 0, 1])
# show that if we CURRENTLY predicted on the bases of node being the
# terminal leaf, we'd get all 1s.
get_preds = (lambda: [node.predict_record(r) for r in X2])
assert get_preds() == [1, 1, 1, 1, 1, 1]
# add a sub node to the right side
right_node = LeafNode(split_col=0, split_val=2.,
class_statistic=_most_common(y_right),
split_gain=np.inf)
assert right_node.class_statistic == 0.
# attach to the original node and assert it's not terminal anymore
node.right = right_node
assert not node.is_terminal()
# now our predictions should differ!
assert get_preds() == [0, 0, 0, 1, 1, 1]
def test_fit_classifier():
# show we can fit a classifier
clf = CARTClassifier(X, y)
# show we can predict
clf.predict(X)
def test_fit_regressor():
# show we can fit a regressor
reg = CARTRegressor(Xreg, yreg)
# show we can predict
reg.predict(Xreg)
def test_random_splitter():
pre_X = np.array([[21, 3], [4, 2], [37, 2]])
pre_y = np.array([1, 0, 1])
# this is the splitting class; we'll use gini as the criteria
random_state = np.random.RandomState(42)
splitter = RandomSplitter(random_state=random_state,
criterion=InformationGain('gini'),
n_val_sample=3)
# find the best:
best_feature, best_value, best_gain = splitter.find_best(pre_X, pre_y)
assert best_feature == 0
assert best_value == 21
assert_almost_equal(best_gain, 0.4444444444, decimal=8)

View File

@ -0,0 +1,52 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from packtml.decision_tree.metrics import (entropy, gini_impurity,
InformationGain)
import numpy as np
from numpy.testing import assert_almost_equal
def test_entropy():
events = np.asarray(9 * [0] + 5 * [1]) # 9/14, 5/14
ent = entropy(events)
assert round(ent, 2) == 0.94, round(ent, 2)
def test_gini_impurity():
x = np.asarray([0] * 10 + [1] * 10)
assert gini_impurity(x) == 0.5
assert gini_impurity(x[:10]) == 0.
# show that no mixing of gini yields 0.0
assert gini_impurity(np.array([0, 0])) == 0.
# with SOME mixing we get 0.5
assert gini_impurity(np.array([0, 1])) == 0.5
# with a lot of mixing we get a number close to 0.8
gi = gini_impurity([0, 1, 2, 3, 4])
assert_almost_equal(gi, 0.8)
def test_information_gain():
X = np.array([
[0, 3],
[1, 3],
[2, 1],
[2, 1],
[1, 3]
])
y = np.array([0, 0, 1, 1, 2])
uncertainty = gini_impurity(y)
assert_almost_equal(uncertainty, 0.63999999)
mask = X[:, 0] == 0
# compute the info gain for this mask
infog = InformationGain("gini")
ig = infog(y, mask, uncertainty)
assert_almost_equal(ig, 0.1399999)

View File

@ -0,0 +1,5 @@
# -*- coding: utf-8 -*-
from .ranking import *
__all__ = [s for s in dir() if not s.startswith("_")]

View File

@ -0,0 +1,266 @@
# -*- coding: utf-8 -*-
#
# Author: Taylor G Smith
#
# Recommender system ranking metrics derived from Spark source for use with
# Python-based recommender systems. See the full gist here:
# https://gist.github.com/tgsmith61591/d8aa96ac7c74c24b33e4b0cb967ca519
from __future__ import absolute_import, division
import numpy as np
import warnings
__all__ = [
'mean_average_precision',
'ndcg_at',
'precision_at',
]
try:
xrange
except NameError: # python 3 does not have an 'xrange'
xrange = range
def _require_positive_k(k):
"""Helper function to avoid copy/pasted code for validating K"""
if k <= 0:
raise ValueError("ranking position k should be positive")
def _mean_ranking_metric(predictions, labels, metric):
"""Helper function for precision_at_k and mean_average_precision"""
# do not zip, as this will require an extra pass of O(N). Just assert
# equal length and index (compute in ONE pass of O(N)).
# if len(predictions) != len(labels):
# raise ValueError("dim mismatch in predictions and labels!")
# return np.mean([
# metric(np.asarray(predictions[i]), np.asarray(labels[i]))
# for i in xrange(len(predictions))
# ])
# Actually probably want lazy evaluation in case preds is a
# generator, since preds can be very dense and could blow up
# memory... but how to assert lengths equal? FIXME
return np.mean([
metric(np.asarray(prd), np.asarray(labels[i]))
for i, prd in enumerate(predictions) # lazy eval if generator
])
def _warn_for_empty_labels():
"""Helper for missing ground truth sets"""
warnings.warn("Empty ground truth set! Check input data")
return 0.
def precision_at(predictions, labels, k=10, assume_unique=True):
"""Compute the precision at K.
Compute the average precision of all the queries, truncated at
ranking position k. If for a query, the ranking algorithm returns
n (n is less than k) results, the precision value will be computed
as #(relevant items retrieved) / k. This formula also applies when
the size of the ground truth set is less than k.
If a query has an empty ground truth set, zero will be used as
precision together with a warning.
Parameters
----------
predictions : array-like, shape=(n_predictions,)
The prediction array. The items that were predicted, in descending
order of relevance.
labels : array-like, shape=(n_ratings,)
The labels (positively-rated items).
k : int, optional (default=10)
The rank at which to measure the precision.
assume_unique : bool, optional (default=True)
Whether to assume the items in the labels and predictions are each
unique. That is, the same item is not predicted multiple times or
rated multiple times.
Examples
--------
>>> # predictions for 3 users
>>> preds = [[1, 6, 2, 7, 8, 3, 9, 10, 4, 5],
... [4, 1, 5, 6, 2, 7, 3, 8, 9, 10],
... [1, 2, 3, 4, 5]]
>>> # labels for the 3 users
>>> labels = [[1, 2, 3, 4, 5], [1, 2, 3], []]
>>> precision_at(preds, labels, 1)
0.33333333333333331
>>> precision_at(preds, labels, 5)
0.26666666666666666
>>> precision_at(preds, labels, 15)
0.17777777777777778
"""
# validate K
_require_positive_k(k)
def _inner_pk(pred, lab):
# need to compute the count of the number of values in the predictions
# that are present in the labels. We'll use numpy in1d for this (set
# intersection in O(1))
if lab.shape[0] > 0:
n = min(pred.shape[0], k)
cnt = np.in1d(pred[:n], lab, assume_unique=assume_unique).sum()
return float(cnt) / k
else:
return _warn_for_empty_labels()
return _mean_ranking_metric(predictions, labels, _inner_pk)
def mean_average_precision(predictions, labels, assume_unique=True):
"""Compute the mean average precision on predictions and labels.
Returns the mean average precision (MAP) of all the queries. If a query
has an empty ground truth set, the average precision will be zero and a
warning is generated.
Parameters
----------
predictions : array-like, shape=(n_predictions,)
The prediction array. The items that were predicted, in descending
order of relevance.
labels : array-like, shape=(n_ratings,)
The labels (positively-rated items).
assume_unique : bool, optional (default=True)
Whether to assume the items in the labels and predictions are each
unique. That is, the same item is not predicted multiple times or
rated multiple times.
Examples
--------
>>> # predictions for 3 users
>>> preds = [[1, 6, 2, 7, 8, 3, 9, 10, 4, 5],
... [4, 1, 5, 6, 2, 7, 3, 8, 9, 10],
... [1, 2, 3, 4, 5]]
>>> # labels for the 3 users
>>> labels = [[1, 2, 3, 4, 5], [1, 2, 3], []]
>>> mean_average_precision(preds, labels)
0.35502645502645497
"""
def _inner_map(pred, lab):
if lab.shape[0]:
# compute the number of elements within the predictions that are
# present in the actual labels, and get the cumulative sum weighted
# by the index of the ranking
n = pred.shape[0]
# Scala code from Spark source:
# var i = 0
# var cnt = 0
# var precSum = 0.0
# val n = pred.length
# while (i < n) {
# if (labSet.contains(pred(i))) {
# cnt += 1
# precSum += cnt.toDouble / (i + 1)
# }
# i += 1
# }
# precSum / labSet.size
arange = np.arange(n, dtype=np.float32) + 1. # this is the denom
present = np.in1d(pred[:n], lab, assume_unique=assume_unique)
prec_sum = np.ones(present.sum()).cumsum()
denom = arange[present]
return (prec_sum / denom).sum() / lab.shape[0]
else:
return _warn_for_empty_labels()
return _mean_ranking_metric(predictions, labels, _inner_map)
def ndcg_at(predictions, labels, k=10, assume_unique=True):
"""Compute the normalized discounted cumulative gain at K.
Compute the average NDCG value of all the queries, truncated at ranking
position k. The discounted cumulative gain at position k is computed as:
sum,,i=1,,^k^ (2^{relevance of ''i''th item}^ - 1) / log(i + 1)
and the NDCG is obtained by dividing the DCG value on the ground truth set.
In the current implementation, the relevance value is binary.
If a query has an empty ground truth set, zero will be used as
NDCG together with a warning.
Parameters
----------
predictions : array-like, shape=(n_predictions,)
The prediction array. The items that were predicted, in descending
order of relevance.
labels : array-like, shape=(n_ratings,)
The labels (positively-rated items).
k : int, optional (default=10)
The rank at which to measure the NDCG.
assume_unique : bool, optional (default=True)
Whether to assume the items in the labels and predictions are each
unique. That is, the same item is not predicted multiple times or
rated multiple times.
Examples
--------
>>> # predictions for 3 users
>>> preds = [[1, 6, 2, 7, 8, 3, 9, 10, 4, 5],
... [4, 1, 5, 6, 2, 7, 3, 8, 9, 10],
... [1, 2, 3, 4, 5]]
>>> # labels for the 3 users
>>> labels = [[1, 2, 3, 4, 5], [1, 2, 3], []]
>>> ndcg_at(preds, labels, 3)
0.3333333432674408
>>> ndcg_at(preds, labels, 10)
0.48791273434956867
References
----------
.. [1] K. Jarvelin and J. Kekalainen, "IR evaluation methods for
retrieving highly relevant documents."
"""
# validate K
_require_positive_k(k)
def _inner_ndcg(pred, lab):
if lab.shape[0]:
# if we do NOT assume uniqueness, the set is a bit different here
if not assume_unique:
lab = np.unique(lab)
n_lab = lab.shape[0]
n_pred = pred.shape[0]
n = min(max(n_pred, n_lab), k) # min(min(p, l), k)?
# similar to mean_avg_prcsn, we need an arange, but this time +2
# since python is zero-indexed, and the denom typically needs +1.
# Also need the log base2...
arange = np.arange(n, dtype=np.float32) # length n
# since we are only interested in the arange up to n_pred, truncate
# if necessary
arange = arange[:n_pred]
denom = np.log2(arange + 2.) # length n
gains = 1. / denom # length n
# compute the gains where the prediction is present in the labels
dcg_mask = np.in1d(pred[:n], lab, assume_unique=assume_unique)
dcg = gains[dcg_mask].sum()
# the max DCG is sum of gains where the index < the label set size
max_dcg = gains[arange < n_lab].sum()
return dcg / max_dcg
else:
return _warn_for_empty_labels()
return _mean_ranking_metric(predictions, labels, _inner_ndcg)

View File

@ -0,0 +1,3 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import

View File

@ -0,0 +1,45 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from packtml.metrics.ranking import (mean_average_precision, ndcg_at,
precision_at)
from numpy.testing import assert_almost_equal
import warnings
preds = [[1, 6, 2, 7, 8, 3, 9, 10, 4, 5],
[4, 1, 5, 6, 2, 7, 3, 8, 9, 10],
[1, 2, 3, 4, 5]]
labels = [[1, 2, 3, 4, 5], [1, 2, 3], []]
def assert_warning_caught(func):
def test_wrapper(*args, **kwargs):
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter("always")
# execute the fxn
func(*args, **kwargs)
assert len(w) # assert there's something there...
return test_wrapper
@assert_warning_caught
def test_map():
assert_almost_equal(
mean_average_precision(preds, labels), 0.35502645502645497)
@assert_warning_caught
def test_pak():
assert_almost_equal(precision_at(preds, labels, 1), 0.33333333333333331)
assert_almost_equal(precision_at(preds, labels, 5), 0.26666666666666666)
assert_almost_equal(precision_at(preds, labels, 15), 0.17777777777777778)
@assert_warning_caught
def test_ndcg():
assert_almost_equal(ndcg_at(preds, labels, 3), 0.3333333432674408)
assert_almost_equal(ndcg_at(preds, labels, 10), 0.48791273434956867)

View File

@ -0,0 +1,6 @@
# -*- coding: utf-8 -*-
from .mlp import *
from .transfer import *
__all__ = [s for s in dir() if not s.startswith("_")]

View File

@ -0,0 +1,33 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from sklearn.externals import six
from abc import ABCMeta, abstractmethod
import numpy as np
__all__ = [
'tanh',
'NeuralMixin'
]
def tanh(X):
"""Hyperbolic tangent.
Compute the tan-h (Hyperbolic tangent) activation function.
This is a very easily-differentiable activation function.
Parameters
----------
X : np.ndarray, shape=(n_samples, n_features)
The transformed X array (X * W + b).
"""
return np.tanh(X)
class NeuralMixin(six.with_metaclass(ABCMeta)):
"""Abstract interface for neural network classes."""
@abstractmethod
def export_weights_and_biases(self, output_layer=True):
"""Return the weights and biases of the network"""

View File

@ -0,0 +1,273 @@
# -*- coding: utf-8 -*-
#
# Author: Taylor G Smith <taylor.smith@alkaline-ml.com>
#
# A simple multilayer perceptron classifier. If you find yourself struggling
# to follow the derivation of the back-propagation, check out this great
# refresher on scalar & matrix calculas + differential equations.
# http://parrt.cs.usfca.edu/doc/matrix-calculus/index.html
from __future__ import absolute_import, division
from sklearn.utils.validation import check_X_y, check_random_state
from sklearn.utils.multiclass import check_classification_targets
import numpy as np
from ..base import BaseSimpleEstimator
from .base import NeuralMixin, tanh
__all__ = [
'NeuralNetClassifier'
]
try:
xrange
except NameError: # py3
xrange = range
def _calculate_loss(truth, preds, weights, l2):
"""Compute the log loss.
Calculate the log loss between the true class labels and the predictions
generated by the softmax layer in our neural network.
Parameters
----------
truth : np.ndarray, shape=(n_samples,)
The true labels
preds : np.ndarray, shape=(n_samples, n_classes)
The predicted class probabilities
weights : list
The list of weights matrices. Used for computing the loss
with the L2 regularization.
l2 : float
The regularization parameter
"""
# get the log probs of the prediction for the true class labels
n_samples = truth.shape[0]
logprobs = -np.log(preds[range(n_samples), truth])
# compute the sum of log probs
sum_logprobs = logprobs.sum()
# add the L2 regularization term
sum_logprobs += l2 / 2. * sum(np.square(W).sum() for W in weights)
return 1. / n_samples * sum_logprobs
def softmax(X):
"""Apply the softmax function.
The softmax function squashes an N-dimensional vector into a K-dimensional
vector whose elements add up to 1, and whose elements are bound in (0, 1).
Parameters
----------
X : np.ndarray, shape=(n_samples, n_features)
The matrix over which to apply softmax along the rows.
"""
# first compute the exponential. This is a step that would take place
# in the sigmoid (logistic) function as well. We can already begin to see
# where this is going to resemble logistic regression...
X_exp = np.exp(X)
return X_exp / np.sum(X_exp, axis=1, keepdims=True)
class NeuralNetClassifier(BaseSimpleEstimator, NeuralMixin):
"""A neural network classifier.
Create a multi-layer perceptron classifier. Note that this is a very
simple implementation of an MLP with only fully-connected layers and
very few tunable parameters. It is designed for readability. For more
optimized neural network code, look into TensorFlow, Keras or other
libraries.
This implementation of a neural net uses the ReLu activation function
*only*, and does not allow early convergence. It will continue for
``n_iter``. There are many other parameters that would typically be
tunable in a network, for instance dropout, regularization, learning
rate, etc. The majority of these parameters are left out of this
implementation to keep it simple.
Parameters
----------
X : array-like, shape=(n_samples, n_features)
The training array. Should be a numpy array or array-like structure
with only finite values.
y : array-like, shape=(n_samples,)
The target vector.
hidden : iterable, optional (default=(25,))
An iterable indicating the number of units per hidden layer.
n_iter : int, optional (default=10)
The default number of iterations to perform.
learning_rate : float, optional (default=0.001)
The rate at which we descend the gradient.
random_state : int, None or RandomState, optional (default=42)
The random state for initializing the weights matrices.
"""
def __init__(self, X, y, hidden=(25,), n_iter=10, learning_rate=0.001,
regularization=0.01, random_state=42):
self.hidden = hidden
self.random_state = random_state
self.n_iter = n_iter
self.learning_rate = learning_rate
self.regularization = regularization
# initialize weights, biases, etc.
X, y, weights, biases = self._init_weights_biases(
X, y, hidden, random_state, last_dim=None)
# we can keep track of the loss for each iter
train_loss = []
# for each iteration, feed X through the network, compute the loss,
# and back-propagate the error to correct the weights.
for _ in xrange(n_iter):
# compute the product of X on the hidden layers (the output of
# the network)
out, layer_results = self._forward_step(X, weights, biases)
# compute the loss on the output
loss = _calculate_loss(truth=y, preds=out, weights=weights,
l2=self.regularization)
train_loss.append(loss)
# now back-propagate to correct the weights and biases via
# gradient descent
self._back_propagate(y, out, layer_results, weights,
biases, learning_rate,
self.regularization)
# save the weights, biases and loss as instance attributes
self.weights = weights
self.biases = biases
self.train_loss = train_loss
@staticmethod
def _init_weights_biases(X, y, hidden, random_state, last_dim=None):
# make sure dims all match in X, y and that we have appropriate
# classification targets
X, y = check_X_y(X, y, copy=False)
check_classification_targets(y)
random_state = check_random_state(random_state)
# initialize the weights and biases. For each layer, we create a new
# matrix of dimensions [last_layer_col_dim, new_col_dim]. This ensures
# we can compute matrix products across the layers and that the
# dimensions all match up. The biases will each be a vector of ones
# in this example, though in other networks that can be initialized
# differently
weights = []
biases = []
# if last dim is undefined, use the column shape of the input data.
# this argument is used to simplify the initialization of weights/
# biases in the transfer learning class...
if last_dim is None:
last_dim = X.shape[1]
for layer_size in hidden:
# initialize to extremely small values
w = random_state.rand(last_dim, layer_size) * 0.01
b = np.ones(layer_size)
last_dim = layer_size
weights.append(w)
biases.append(b)
# we need to add one more layer (the output layer) that is the size of
# the expected output probabilities. We'll apply the softmax function
# to the output of this layer.
n_outputs = np.unique(y).shape[0]
weights.append(random_state.rand(last_dim, n_outputs))
biases.append(np.ones(n_outputs))
return X, y, weights, biases
@staticmethod
def _forward_step(X, weights, biases):
# track the intermediate products
intermediate_results = [X]
# progress through all the layers EXCEPT the very last one.
for w, b in zip(weights[:-1], biases[:-1]):
# apply the activation function to the product of X and the weights
# (after adding the bias vector)
X = tanh(X.dot(w) + b)
# append this layer result
intermediate_results.append(X)
# we handle the very last layer a bit differently, since it's out
# output layer. First compute the product...
X = X.dot(weights[-1]) + biases[-1]
# then rather than apply the activation function (tanh), we apply
# the softmax, which is essentially generalized logistic regression.
return softmax(X), intermediate_results
@staticmethod
def _back_propagate(truth, probas, layer_results, weights,
biases, learning_rate, l2):
# the probabilities are our first delta. Subtract 1 from the
# TRUE labels' probabilities in the predictions
n_samples = truth.shape[0]
# subtract 1 from true idcs. initial deltas are: (y_hat - y)
probas[range(n_samples), truth] -= 1.
# iterate back through the layers computing the deltas (derivatives)
last_delta = probas
for next_weights, next_biases, layer_res in \
zip(weights[::-1], biases[::-1], layer_results[::-1]):
# the gradient for this layer is equivalent to the previous delta
# multiplied by the intermittent layer result
d_W = layer_res.T.dot(last_delta)
# column sums of the (just-computed) delta is the derivative
# of the biases
d_b = np.sum(last_delta, axis=0)
# set the next delta for the next iter
last_delta = last_delta.dot(next_weights.T) * \
(1. - np.power(layer_res, 2.))
# update the weights gradient with the L2 regularization term
d_W += l2 * next_weights
# update the weights in this layer. The learning rate governs how
# quickly we descend the gradient
next_weights += -learning_rate * d_W
next_biases += -learning_rate * d_b
def predict(self, X):
# compute the probabilities and then get the argmax for each class
probas = self.predict_proba(X)
# we want the argmaxes of each row
return np.argmax(probas, axis=1)
def predict_proba(self, X):
# simply compute a forward step (we don't care about idx 1 of the
# tuple, which is just the intermediate products)
return self._forward_step(X, self.weights, self.biases)[0]
def export_weights_and_biases(self, output_layer=True):
w, b = self.weights, self.biases
if output_layer:
return w, b
return w[:-1], b[:-1]

View File

@ -0,0 +1,3 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import

View File

@ -0,0 +1,15 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from packtml.neural_net import NeuralNetClassifier
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
def test_mlp():
# show we can fit and predict
clf = NeuralNetClassifier(X, y, random_state=42)
clf.predict(X)

View File

@ -0,0 +1,52 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from packtml.neural_net import NeuralNetClassifier, TransferLearningClassifier
import numpy as np
def test_transfer_learner():
rs = np.random.RandomState(42)
covariance = [[1, .75], [.75, 1]]
# these are the majority classes
n_obs = 500
x1 = rs.multivariate_normal(mean=[0, 0], cov=covariance, size=n_obs)
x2 = rs.multivariate_normal(mean=[1, 5], cov=covariance, size=n_obs)
# this is the minority class
x3 = rs.multivariate_normal(mean=[0.85, 3.25],
cov=[[1., .5], [1.25, 0.85]],
size=150)
# this is what the FIRST network will be trained on
n_first = 400
X = np.vstack((x1[:n_first], x2[:n_first])).astype(np.float32)
y = np.hstack((np.zeros(n_first), np.ones(n_first))).astype(int)
# this is what the SECOND network will be trained on
X2 = np.vstack((x1[n_first:], x2[n_first:], x3)).astype(np.float32)
y2 = np.hstack((np.zeros(n_obs - n_first),
np.ones(n_obs - n_first),
np.ones(x3.shape[0]) * 2)).astype(int)
# Fit the first neural network
clf = NeuralNetClassifier(X, y, hidden=(25, 25), n_iter=50,
learning_rate=0.001, random_state=42)
# Fit the transfer network - train one more layer with a new class
transfer = TransferLearningClassifier(X2, y2, pretrained=clf, hidden=(15,),
n_iter=10, random_state=42)
# show we can predict
transfer.predict(X2)
# show we can use a transfer learner on an existing transfer learner
transfer2 = TransferLearningClassifier(X2, y2, pretrained=transfer,
hidden=(25,),
random_state=15)
# and show we can still predict
transfer2.predict(X2)

View File

@ -0,0 +1,154 @@
# -*- coding: utf-8 -*-
#
# Author: Taylor G Smith <taylor.smith@alkaline-ml.com>
#
# A simple transfer learning classifier. If you find yourself struggling
# to follow the derivation of the back-propagation, check out this great
# refresher on scalar & matrix calculas + differential equations.
# http://parrt.cs.usfca.edu/doc/matrix-calculus/index.html
from __future__ import absolute_import
import numpy as np
from .base import NeuralMixin, tanh
from ..base import BaseSimpleEstimator
from .mlp import NeuralNetClassifier, _calculate_loss
__all__ = [
'TransferLearningClassifier'
]
try:
xrange
except NameError:
xrange = range
def _pretrained_forward_step(X, pt_weights, pt_biases):
"""Complete a forward step from the pre-trained model"""
# progress through all the layers (the output was already trimmed off)
for w, b in zip(pt_weights, pt_biases):
X = tanh(X.dot(w) + b)
return X
class TransferLearningClassifier(BaseSimpleEstimator, NeuralMixin):
"""A transfer learning classifier.
Create a multi-layer perceptron classifier that learned from a
previously-trained network. No fine-tuning is performed, and no
prior-trained layers can be retrained (i.e., they remain frozen).
Parameters
----------
X : array-like, shape=(n_samples, n_features)
The training array. Should be a numpy array or array-like structure
with only finite values.
y : array-like, shape=(n_samples,)
The target vector.
pretrained : NeuralNetClassifier, TransferLearningClassifier
The pre-trained MLP. The transfer learner leverages the features
extracted from the pre-trained network (the trained weights without
the output layer) and uses them to transform the input data before
training the new layers.
hidden : iterable, optional (default=(25,))
An iterable indicating the number of units per hidden layer.
n_iter : int, optional (default=10)
The default number of iterations to perform.
learning_rate : float, optional (default=0.001)
The rate at which we descend the gradient.
random_state : int, None or RandomState, optional (default=42)
The random state for initializing the weights matrices.
"""
def __init__(self, X, y, pretrained, hidden=(25,), n_iter=10,
regularization=0.01, learning_rate=0.001, random_state=42):
# initialize via the NN static method
self.hidden = hidden
self.random_state = random_state
self.n_iter = n_iter
self.learning_rate = learning_rate
self.regularization = regularization
# this is the previous model
self.model = pretrained
# assert that it's a neural net or we'll break down later
assert isinstance(pretrained, NeuralMixin), \
"Pre-trained model must be a neural network!"
# initialize weights, biases, etc. for THE TRAINABLE LAYERS ONLY!
pt_w, pt_b = pretrained.export_weights_and_biases(output_layer=False)
X, y, weights, biases = NeuralNetClassifier._init_weights_biases(
X, y, hidden, random_state,
# use as the last dim the column dimension of the last weights
# (the ones BEFORE the output layer, that is)
last_dim=pt_w[-1].shape[1])
# we can train this in a similar fashion to the plain MLP we designed:
# for each iteration, feed X through the network, compute the loss,
# and back-propagate the error to correct the weights.
train_loss = []
for _ in xrange(n_iter):
# first, pass the input data through the pre-trained model's
# hidden layers. Do not pass it through the last layer, however,
# since we don't want its output from the softmax layer.
X_transform = _pretrained_forward_step(X, pt_w, pt_b)
# NOW we complete a forward step on THIS model's
# untrained weights/biases
out, layer_results = NeuralNetClassifier._forward_step(
X_transform, weights, biases)
# compute the loss on the output
loss = _calculate_loss(truth=y, preds=out, weights=pt_w + weights,
l2=self.regularization)
train_loss.append(loss)
# now back-propagate to correct THIS MODEL's weights and biases via
# gradient descent. NOTE we do NOT adjust the pre-trained model's
# weights!!!
NeuralNetClassifier._back_propagate(
truth=y, probas=out, layer_results=layer_results,
weights=weights, biases=biases,
learning_rate=learning_rate,
l2=self.regularization)
# save the weights, biases
self.weights = weights
self.biases = biases
self.train_loss = train_loss
def predict(self, X):
# compute the probabilities and then get the argmax for each class
probas = self.predict_proba(X)
# we want the argmaxes of each row
return np.argmax(probas, axis=1)
def predict_proba(self, X):
# Compute a forward step with the pre-trained model first:
pt_w, pt_b = self.model.export_weights_and_biases(output_layer=False)
X_transform = _pretrained_forward_step(X, pt_w, pt_b)
# and then complete a forward step with the trained weights and biases
return NeuralNetClassifier._forward_step(
X_transform, self.weights, self.biases)[0]
def export_weights_and_biases(self, output_layer=True):
pt_weights, pt_biases = \
self.model.export_weights_and_biases(output_layer=False)
w = pt_weights + self.weights
b = pt_biases + self.biases
if output_layer:
return w, b
return w[:-1], b[:-1]

View File

@ -0,0 +1,7 @@
# -*- coding: utf-8 -*-
from .als import *
from .data import *
from .itemitem import *
__all__ = [s for s in dir() if not s.startswith("_")]

View File

@ -0,0 +1,202 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from sklearn.utils.validation import check_random_state, check_array
from numpy.linalg import solve
import numpy as np
from .base import RecommenderMixin
from ..base import BaseSimpleEstimator
__all__ = [
'ALS'
]
try:
xrange
except NameError: # py3 does not have xrange
xrange = range
def mse(R, X, Y, W):
"""Compute the reconstruction MSE. This is our loss function"""
return ((W * (R - X.dot(Y))) ** 2).sum()
class ALS(BaseSimpleEstimator, RecommenderMixin):
r"""Alternating Least Squares for explicit ratings matrices.
Computes the ALS user factors and item factors for explicit ratings
systems. This solves:
R' = XY
where ``X`` is an (m x f) matrix of user factors, and ``Y`` is an
(f x n) matrix of item factors. Note that for very large ratings matrices,
this can quickly grow outside the scope of what will fit into memory!
Parameters
----------
R : array-like, shape=(n_users, n_items)
The ratings matrix. This must be an explicit ratings matrix where
0 indicates an item that a user has not yet rated.
factors : int or float, optional (default=0.25)
The number of factors to learn. Default is ``0.25 * n_items``.
n_iter : int, optional (default=10)
The number of iterations to perform. The larger the number, the
smaller the train error, but the more likely to overfit.
lam : float, optional (default=0.001)
The L2 regularization parameter. The higher ``lam``, the more
regularization is performed, and the more robust the solution. However,
extra iterations are typically required.
random_state : int, None or RandomState, optional (default=None)
The random state for seeding the initial item factors matrix, ``Y``.
Attributes
----------
X : np.ndarray, shape=(n_users, factors)
The user factors
Y : np.ndarray, shape=(factors, n_items)
The item factors
train_err : list
The list of training MSE for each iteration performed
lam : float
The lambda (regularization) value.
Notes
-----
If you plan to use a very large matrix, consider using a sparse CSR matrix
to preserve memory, but you'll have to amend the ``recommend_for_user``
function, which expects dense output.
"""
def __init__(self, R, factors=0.25, n_iter=10, lam=0.001,
random_state=None):
# check the array
R = check_array(R, dtype=np.float32) # type: np.ndarray
n_users, n_items = R.shape
# get the random state
random_state = check_random_state(random_state)
# get the number of factors. If it's a float, compute it
if isinstance(factors, float):
factors = min(np.ceil(factors * n_items).astype(int), n_items)
# the weight matrix is used as a masking matrix when computing the MSE.
# it allows us to only compute the reconstruction MSE on the rated
# items, and not the unrated ones.
W = (R > 0.).astype(np.float32)
# initialize the first array, Y, and X to None
Y = random_state.rand(factors, n_items)
X = None
# the identity matrix (time lambda) is added to the XX or YY product
# at each iteration.
I = np.eye(factors) * lam
# this list will store all of the training errors
train_err = []
# for each iteration, iteratively solve for X, Y, and compute the
# updated MSE
for i in xrange(n_iter):
X = solve(Y.dot(Y.T) + I, Y.dot(R.T)).T
Y = solve(X.T.dot(X) + I, X.T.dot(R))
# update the training error
train_err.append(mse(R, X, Y, W))
# now we have X, Y, which are our user factors and item factors
self.X = X
self.Y = Y
self.train_err = train_err
self.n_factors = factors
self.lam = lam
def predict(self, R, recompute_users=False):
"""Generate predictions for the test set.
Computes the predicted product of ``XY`` given the fit factors.
If recomputing users, will learn the new user factors given the
existing item factors.
"""
R = check_array(R, dtype=np.float32, copy=False) # type: np.ndarray
Y = self.Y # item factors
n_factors, _ = Y.shape
# we can re-compute user factors on their updated ratings, if we want.
# (not always advisable, but can be useful for offline recommenders)
if recompute_users:
I = np.eye(n_factors) * self.lam
X = solve(Y.dot(Y.T) + I, Y.dot(R.T)).T
else:
X = self.X
return X.dot(Y)
def recommend_for_user(self, R, user, n=10, recompute_user=False,
filter_previously_seen=False,
return_scores=True):
"""Generate predictions for a single user.
Parameters
----------
R : array-like, shape=(n_users, n_items)
The test ratings matrix. This must be an explicit ratings matrix
where 0 indicates an item that a user has not yet rated.
user : int
The user index for whom to generate predictions.
n : int or None, optional (default=10)
The number of recommendations to return. Default is 10. For all,
set to None.
recompute_user : bool, optional (default=False)
Whether to recompute the user factors given the test set.
Not always advisable, as it can be considered leakage, but can
be useful in an offline recommender system where refits are
infrequent.
filter_previously_seen : bool, optional (default=False)
Whether to filter out previously-rated items.
return_scores : bool, optional (default=True)
Whether to return the computed scores for the recommended items.
Returns
-------
items : np.ndarray
The top ``n`` items recommended for the user.
scores (optional) : np.ndarray
The corresponding scores for the top ``n`` items for the user.
Only returned if ``return_scores`` is True.
"""
R = check_array(R, dtype=np.float32, copy=False)
# compute the new user vector. Squeeze to make sure it's a vector
user_vec = self.predict(R, recompute_users=recompute_user)[user, :]
item_indices = np.arange(user_vec.shape[0])
# if we are filtering previously seen, remove the prior-rated items
if filter_previously_seen:
rated_mask = R[user, :] != 0.
user_vec = user_vec[~rated_mask]
item_indices = item_indices[~rated_mask]
order = np.argsort(-user_vec)[:n] # descending order of computed scores
items = item_indices[order]
if return_scores:
return items, user_vec[order]
return items

View File

@ -0,0 +1,42 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from sklearn.externals import six
from abc import ABCMeta, abstractmethod
__all__ = [
'RecommenderMixin'
]
try:
xrange
except NameError: # py3
xrange = range
class RecommenderMixin(six.with_metaclass(ABCMeta)):
"""Mixin interface for recommenders.
This class should be inherited by recommender algorithms. It provides an
abstract interface for generating recommendations for a user, and a
function for creating recommendations for all users.
"""
@abstractmethod
def recommend_for_user(self, R, user, n=10, filter_previously_seen=False,
return_scores=True, **kwargs):
"""Generate recommendations for a user.
A method that should be overridden by subclasses to create
recommendations via their own prediction strategy.
"""
def recommend_for_all_users(self, R, n=10,
filter_previously_seen=False,
return_scores=True, **kwargs):
"""Create recommendations for all users."""
return (
self.recommend_for_user(
R, user, n=n, filter_previously_seen=filter_previously_seen,
return_scores=return_scores, **kwargs)
for user in xrange(R.shape[0]))

View File

@ -0,0 +1,77 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
import numpy as np
__all__ = [
'get_completely_fabricated_ratings_data'
]
def get_completely_fabricated_ratings_data():
"""Disclaimer: this is a made-up data set.
Get a ratings data set for use with one of the packtml recommenders.
This data set is a completely made-up ratings matrix consisting of
cult classics, all of which are awesome (seriously, if there are any
you haven't seen, you should).
(Please
don't
sue
me......)
The data contains 5 users and 15 items (movies). Movies:
0) Ghost Busters
1) Ghost Busters 2
2) The Goonies
3) Big Trouble in Little China
4) The Rocky Horror Picture Show
5) A Clockwork Orange
6) Pulp Fiction
7) Bill & Ted's Excellent Adventure
8) Weekend at Bernie's
9) Dumb and Dumber
10) Clerks
11) Jay & Silent Bob Strike Back
12) Tron
13) Total Recall
14) The Princess Bride
Notes
-----
Seriously, I fabricated all of these ratings semi-haphazardly. Don't
take this as me bashing any movies.
"""
return (np.array([
# user 0 is a classic 30-yo millennial who is nostalgic for the 90s
[5.0, 3.5, 5.0, 0.0, 0.0, 0.0, 4.5, 3.0,
0.0, 2.5, 4.0, 4.0, 0.0, 1.5, 3.0],
# user 1 is a 40-yo who only likes action
[1.5, 0.0, 0.0, 1.0, 0.0, 4.0, 5.0, 0.0,
2.0, 0.0, 3.0, 3.5, 0.0, 4.0, 0.0],
# user 2 is a 12-yo whose parents are strict about what she watches.
[4.5, 4.0, 5.0, 0.0, 0.0, 0.0, 0.0, 4.0,
3.5, 5.0, 0.0, 0.0, 0.0, 0.0, 5.0],
# user 3 has just about seen it all, and doesn't really care for
# the goofy stuff. (but seriously, who rates the Goonies 2/5???)
[2.0, 1.0, 2.0, 1.0, 2.5, 4.5, 4.5, 0.5,
1.5, 1.0, 2.0, 2.5, 3.5, 3.5, 2.0],
# user 4 has just opened a netflix account and hasn't had a chance
# to watch too much
[0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 1.5, 4.0, 0.0, 0.0],
]), np.array(["Ghost Busters", "Ghost Busters 2",
"The Goonies", "Big Trouble in Little China",
"The Rocky Horror Picture Show", "A Clockwork Orange",
"Pulp Fiction", "Bill & Ted's Excellent Adventure",
"Weekend at Bernie's", "Dumb and Dumber", "Clerks",
"Jay & Silent Bob Strike Back", "Tron", "Total Recall",
"The Princess Bride" ]))

View File

@ -0,0 +1,140 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from sklearn.utils.validation import check_array
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from .base import RecommenderMixin
from ..base import BaseSimpleEstimator
__all__ = [
'ItemItemRecommender'
]
try:
xrange
except NameError: # py3
xrange = range
class ItemItemRecommender(BaseSimpleEstimator, RecommenderMixin):
"""Item-to-item recommendation system using cosine similarity.
A collaborative filtering recommender algorithm that computes the cosine
similarity between each item and generates recommendations for users'
highly rated items by returning similar items.
Parameters
----------
R : array-like, shape=(n_users, n_items)
The ratings matrix. This must be an explicit ratings matrix where
0 indicates an item that a user has not yet rated.
Attributes
----------
similarity : np.ndarray, shape=(n_items, n_items)
The similarity matrix.
Notes
-----
This implementation is very rudimentary and does not allow tuning of
hyper-parameters apart from ``k``. No similarity metrics apart from cosine
similarity may be used. It is largely written to optimize readability. For
a very highly optimized version, try the "implicit" library.
"""
def __init__(self, R, k=10):
# check the array, but don't copy if not needed
R = check_array(R, dtype=np.float32, copy=False) # type: np.ndarray
# save the hyper param for later use later
self.k = k
# compute the similarity between all the items. This calculates the
# similarity between each ITEM
sim = cosine_similarity(R.T)
# Only keep the similarities of the top K, setting all others to zero
# (negative since we want descending)
not_top_k = np.argsort(-sim, axis=1)[:, k:] # shape=(n_items, k)
if not_top_k.shape[1]: # only if there are cols (k < n_items)
# now we have to set these to zero in the similarity matrix
row_indices = np.repeat(range(not_top_k.shape[0]),
not_top_k.shape[1])
sim[row_indices, not_top_k.ravel()] = 0.
self.similarity = sim
def recommend_for_user(self, R, user, n=10,
filter_previously_seen=False,
return_scores=True, **kwargs):
"""Generate predictions for a single user.
Parameters
----------
R : array-like, shape=(n_users, n_items)
The test ratings matrix. This must be an explicit ratings matrix
where 0 indicates an item that a user has not yet rated.
user : int
The user index for whom to generate predictions.
n : int or None, optional (default=10)
The number of recommendations to return. Default is 10. For all,
set to None.
filter_previously_seen : bool, optional (default=False)
Whether to filter out previously-rated items.
return_scores : bool, optional (default=True)
Whether to return the computed scores for the recommended items.
**kwargs : keyword args
Ignored. Present to match super signature.
Returns
-------
items : np.ndarray
The top ``n`` items recommended for the user.
recommendations (optional) : np.ndarray
The corresponding scores for the top ``n`` items for the
user. Only returned if ``return_scores`` is True.
"""
# check the array and get the user vector
R = check_array(R, dtype=np.float32, copy=False)
user_vector = R[user, :]
# compute the dot product between the user vector and the similarity
# matrix
recommendations = user_vector.dot(self.similarity) # shape=(n_items,)
# if we're filtering previously-seen items, now is the time to do that
item_indices = np.arange(recommendations.shape[0])
if filter_previously_seen:
rated_mask = user_vector != 0.
recommendations = recommendations[~rated_mask]
item_indices = item_indices[~rated_mask]
# now arg sort descending (most similar items first)
order = np.argsort(-recommendations)[:n]
items = item_indices[order]
if return_scores:
return items, recommendations[order]
return items
def predict(self, R):
"""Generate predictions for the test set.
Computes the predicted product of users' rated vectors on the
pre-computed similarity matrix.
"""
R = check_array(R, dtype=np.float32, copy=False) # type: np.ndarray
# compute the product R*sim
return R.dot(self.similarity)

View File

@ -0,0 +1,3 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import

View File

@ -0,0 +1,44 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from packtml.recommendation import ALS
# make up a ratings matrix...
R = [[1., 0., 3.5, 2., 0., 0., 0., 1.5],
[0., 2., 3., 0., 0., 2.5, 0., 0. ],
[3.5, 4., 2., 0., 4.5, 3.5, 0., 2. ],
[3., 3.5, 0., 2.5, 3., 0., 0., 0. ]]
def test_als_simple_fit():
als = ALS(R, factors=3, n_iter=5, random_state=42)
assert len(als.train_err) == 5, als.train_err
assert als.n_factors == 3, als.n_factors
# assert all errors are decreasing over time
errs = list(zip(als.train_err[:-1], als.train_err[1:]))
assert all(new_err < last_err for last_err, new_err in errs), errs
def test_als_predict():
als = ALS(R, factors=4, n_iter=8, random_state=42)
user0, scr = als.recommend_for_user(R, 0, filter_previously_seen=True,
return_scores=True)
# assert previously-rated items not present
rated = (0, 2, 3, 7)
for r in rated: # previously-rated
assert r not in user0
# show the score lengths are the same
assert scr.shape[0] == user0.shape[0]
# now if we do NOT filter, assert those are present again (also, recompute)
user0, scr = als.recommend_for_user(R, 0, filter_previously_seen=False,
return_scores=True,
recompute_user=True)
for r in rated:
assert r in user0
assert user0.shape[0] == scr.shape[0]

View File

@ -0,0 +1,67 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from packtml.recommendation import ItemItemRecommender
import numpy as np
from numpy.testing import assert_array_almost_equal
from types import GeneratorType
# make up a ratings matrix...
R = np.array([[1., 0., 3.5, 2., 0., 0., 0., 1.5],
[0., 2., 3., 0., 0., 2.5, 0., 0. ],
[3.5, 4., 2., 0., 4.5, 3.5, 0., 2. ],
[3., 3.5, 0., 2.5, 3., 0., 0., 0. ]])
def test_itemitem_simple():
rec = ItemItemRecommender(R, k=3)
# assert on the similarity
expected = np.array([
[ 1. , 0.91461057, 0. , 0. , 0.9701687 ,
0. , 0. , 0. ],
[ 0.91461057, 1. , 0. , 0. , 0.92793395,
0. , 0. , 0. ],
[ 0. , 0. , 1. , 0. , 0. ,
0.6708902 , 0. , 0.73632752],
[ 0.62906665, 0.48126166, 0. , 1. , 0. ,
0. , 0. , 0. ],
[ 0.9701687 , 0.92793395, 0. , 0. , 1. ,
0. , 0. , 0. ],
[ 0. , 0.77786258, 0. , 0. , 0.67706717,
1. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. ],
[ 0.72079856, 0. , 0.73632752, 0. , 0. ,
0. , 0. , 1. ]])
assert_array_almost_equal(expected, rec.similarity)
# show we can generate recommendations
rec0, scores0 = rec.recommend_for_user(R, 0)
# we didn't filter, so the rated items should still be present
assert np.in1d([0, 2, 3, 7], rec0).all()
# re-compute and show the previously-rated are not present
rec0_filtered, scores0_filtered = rec.recommend_for_user(
R, 0, filter_previously_seen=True)
assert len(rec0_filtered) == 4, rec0_filtered
assert rec0_filtered.tolist() == [5, 1, 4, 6]
# test the prediction, which is just a big product...
pred = rec.predict(R)
assert pred.shape == R.shape
# get recommendations for ALL users
recommendations = rec.recommend_for_all_users(R, return_scores=False,
filter_previously_seen=False)
assert isinstance(recommendations, GeneratorType)
recs = list(recommendations)
assert len(recs) == 4
assert all(len(x) == 8 for x in recs)

View File

@ -0,0 +1,7 @@
# -*- coding: utf-8 -*-
from .simple_regression import *
from .simple_logistic import *
__all__ = [s for s in dir() if not s.startswith("_")]

View File

@ -0,0 +1,123 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from sklearn.utils.validation import check_X_y, check_array
import numpy as np
from ..utils.extmath import log_likelihood, logistic_sigmoid
from ..utils.validation import assert_is_binary
from ..base import BaseSimpleEstimator
__all__ = [
'SimpleLogisticRegression'
]
try:
xrange
except NameError: # py 3 doesn't have an xrange
xrange = range
class SimpleLogisticRegression(BaseSimpleEstimator):
"""Simple logistic regression.
This class provides a very simple example of straight forward logistic
regression with an intercept. There are few tunable parameters aside from
the number of iterations, & learning rate, and the model is fit upon
class initialization.
Parameters
----------
X : array-like, shape=(n_samples, n_features)
The array of predictor variables. This is the array we will use
to regress on ``y``.
y : array-like, shape=(n_samples,)
This is the target array on which we will regress to build
our model. It should be binary (0, 1).
n_steps : int, optional (default=100)
The number of iterations to perform.
learning_rate : float, optional (default=0.001)
The learning rate.
loglik_interval : int, optional (default=5)
How frequently to compute the log likelihood. This is an expensive
operation--computing too frequently will be very expensive.
Attributes
----------
theta : array-like, shape=(n_features,)
The coefficients
intercept : float
The intercept term
log_likelihood : list
A list of the iterations' log-likelihoods
"""
def __init__(self, X, y, n_steps=100, learning_rate=0.001,
loglik_interval=5):
X, y = check_X_y(X, y, accept_sparse=False, # keep dense for example
y_numeric=True)
# we want to make sure y is binary since that's all our example covers
assert_is_binary(y)
# X should be centered/scaled for logistic regression, much like
# with linear regression
means, stds = X.mean(axis=0), X.std(axis=0)
X = (X - means) / stds
# since we're going to learn an intercept, we can cheat and set the
# intercept to be a new feature that we'll learn with everything else
X_w_intercept = np.hstack((np.ones((X.shape[0], 1)), X))
# initialize the coefficients as zeros
theta = np.zeros(X_w_intercept.shape[1])
# now for each step, we compute the inner product of X and the
# coefficients, transform the predictions with the sigmoid function,
# and adjust the weights by the gradient
ll = []
for iteration in xrange(n_steps):
preds = logistic_sigmoid(X_w_intercept.dot(theta))
residuals = y - preds # The error term
gradient = X_w_intercept.T.dot(residuals)
# update the coefficients
theta += learning_rate * gradient
# you may not always want to do this, since it's expensive. Tune
# the error_interval to increase/reduce this
if (iteration + 1) % loglik_interval == 0:
ll.append(log_likelihood(X_w_intercept, y, theta))
# recall that our theta includes the intercept, so we need to pop
# that off and store it
self.intercept = theta[0]
self.theta = theta[1:]
self.log_likelihood = ll
self.column_means = means
self.column_std = stds
def predict_proba(self, X):
"""Generate the probabilities that a sample belongs to class 1"""
X = check_array(X, accept_sparse=False, copy=False) # type: np.ndarray
# make sure dims match
theta = self.theta
if theta.shape[0] != X.shape[1]:
raise ValueError("Dim mismatch in predictors!")
# scale the data appropriately
X = (X - self.column_means) / self.column_std
# creates a copy
return logistic_sigmoid(np.dot(X, theta.T) + self.intercept)
def predict(self, X):
return np.round(self.predict_proba(X)).astype(int)

View File

@ -0,0 +1,100 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from sklearn.utils.validation import check_X_y, check_array
import numpy as np
from numpy.linalg import lstsq
from ..base import BaseSimpleEstimator
__all__ = [
'SimpleLinearRegression'
]
class SimpleLinearRegression(BaseSimpleEstimator):
"""Simple linear regression.
This class provides a very simple example of straight forward OLS
regression with an intercept. There are no tunable parameters, and
the model fit happens directly on class instantiation.
Parameters
----------
X : array-like, shape=(n_samples, n_features)
The array of predictor variables. This is the array we will use
to regress on ``y``.
y : array-like, shape=(n_samples,)
This is the target array on which we will regress to build
our model.
Attributes
----------
theta : array-like, shape=(n_features,)
The least-squares solution (the coefficients)
rank : int
The rank of the predictor matrix, ``X``
singular_values : array-like, shape=(n_features,)
The singular values of ``X``
X_means : array-like, shape=(n_features,)
The column means of the predictor matrix, ``X``
y_mean : float
The mean of the target variable, ``y``
intercept : float
The intercept term
"""
def __init__(self, X, y):
# First check X, y and make sure they are of equal length, no NaNs
# and that they are numeric
X, y = check_X_y(X, y, y_numeric=True,
accept_sparse=False) # keep it simple
# Next, we want to scale all of our features so X is centered
# We will do the same with our target variable, y
X_means = np.average(X, axis=0)
y_mean = y.mean(axis=0)
# don't do in place, so we get a copy
X = X - X_means
y = y - y_mean
# Let's compute the least squares on X wrt y
# Least squares solves the equation `a x = b` by computing a
# vector `x` that minimizes the Euclidean 2-norm `|| b - a x ||^2`.
theta, _, rank, singular_values = lstsq(X, y)
# finally, we compute the intercept values as the mean of the target
# variable MINUS the inner product of the X_means and the coefficients
intercept = y_mean - np.dot(X_means, theta.T)
# ... and set everything as an instance attribute
self.theta = theta
self.rank = rank
self.singular_values = singular_values
# we have to retain some of the statistics around the data too
self.X_means = X_means
self.y_mean = y_mean
self.intercept = intercept
def predict(self, X):
"""Compute new predictions for X"""
# copy, make sure numeric, etc...
X = check_array(X, accept_sparse=False, copy=False) # type: np.ndarray
# make sure dims match
theta = self.theta
if theta.shape[0] != X.shape[1]:
raise ValueError("Dim mismatch in predictors!")
# creates a copy
return np.dot(X, theta.T) + self.intercept

View File

@ -0,0 +1,3 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import

View File

@ -0,0 +1,27 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from packtml.regression import SimpleLogisticRegression
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
import numpy as np
X, y = make_classification(n_samples=100, n_features=2, random_state=42,
n_redundant=0, n_repeated=0, n_classes=2,
class_sep=1.0)
def test_simple_logistic():
lm = SimpleLogisticRegression(X, y, n_steps=50, loglik_interval=10)
assert np.allclose(lm.theta, np.array([ 1.32320936, -0.03926072]))
# test that we can predict
preds = lm.predict(X)
# show we're better than chance
assert accuracy_score(y, preds) > 0.5
# show that we only computed the log likelihood 5 times
assert len(lm.log_likelihood) == 5, lm.log_likelihood

View File

@ -0,0 +1,21 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from packtml.regression import SimpleLinearRegression
import numpy as np
from numpy.testing import assert_almost_equal
def test_simple_linear_regression():
# y = 2a + 1.5b + 0
random_state = np.random.RandomState(42)
X = random_state.rand(100, 2)
y = 2. * X[:, 0] + 1.5 * X[:, 1]
lm = SimpleLinearRegression(X, y)
predictions = lm.predict(X)
residuals = y - predictions
assert_almost_equal(residuals.sum(), 0.)
assert np.allclose(lm.theta, [2., 1.5])

View File

@ -0,0 +1,8 @@
# -*- coding: utf-8 -*-
from .extmath import *
from .linalg import *
from .plotting import *
from .validation import *
__all__ = [s for s in dir() if not s.startswith("_")]

View File

@ -0,0 +1,60 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
import numpy as np
__all__ = [
'log_likelihood',
'logistic_sigmoid'
]
def log_likelihood(X, y, w):
"""Compute the log-likelihood function.
Computes the log-likelihood function over the training data.
The key to the log-likelihood is that the log of the product of
likelihoods becomes the sum of logs. That is (in pseudo-code),
np.log(np.product([f(i) for i in range(N)]))
is equivalent to:
np.sum([np.log(f(i)) for i in range(N)])
The log-likelihood function is used in computing the gradient for
our loss function since the derivative of the sum (of logs) is equivalent
to the sum of derivatives, which simplifies all of our math.
Parameters
----------
X : np.ndarray, shape=(n_samples, n_features)
The training data.
y : np.ndarray, shape=(n_samples,)
The target vector of 1s or 0s.
w : np.ndarray, shape=(n_features,)
The vector of feature weights (coefficients)
References
----------
.. [1] For a very thorough explanation of the log-likelihood function, see
https://www.coursera.org/learn/ml-classification/lecture/1ZeTC/very-optional-expressing-the-log-likelihood
"""
weighted = X.dot(w)
return (y * weighted - np.log(1. + np.exp(weighted))).sum()
def logistic_sigmoid(x):
"""The logistic function.
Compute the logistic (sigmoid) function over a vector, ``x``.
Parameters
----------
x : np.ndarray, shape=(n_samples,)
A vector to transform.
"""
return 1. / (1. + np.exp(-x))

View File

@ -0,0 +1,28 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from numpy import linalg as la
__all__ = [
'l2_norm'
]
def l2_norm(X, axis=0):
"""Compute the L2 (Euclidean) norm of a matrix.
Computes the L2 norm along the specified axis. If axis is 0,
computes the norms along the columns. If 1, computes along the
rows.
Parameters
----------
X : array-like, shape=(n_samples, n_features)
The matrix on which to compute the norm.
axis : int, optional (default=0)
The axis along which to compute the norm. 0 is for columns,
1 is for rows.
"""
return la.norm(X, ord=None, axis=axis)

View File

@ -0,0 +1,160 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from matplotlib.colors import ListedColormap
from matplotlib import pyplot as plt
from .validation import learning_curve
import numpy as np
__all__ = [
'add_decision_boundary_to_axis',
'plot_learning_curve'
]
def add_decision_boundary_to_axis(estimator, axis, nclasses,
X_data, stepsize=0.02,
colors=('#FFAAAA', '#AAFFFA', '#AAAAFF')):
"""Plot a classification decision boundary on an axis.
Estimates lots of values from a classifier and adds the color map
mesh to an axis. WARNING - use PRIOR to applying scatter values on the
axis!
Parameters
----------
estimator : BaseSimpleEstimator
An estimator that implements ``predict``.
axis : matplotlib.Axis
The axis we're plotting on.
nclasses : int
The number of classes present in the data
X_data : np.ndarray, shape=(n_samples, n_features)
The X data used to fit the data, and along which to plot. Preferably
2 features for plotting. The first two will be used to plot.
stepsize : float, optional (default=0.02)
The size of the steps in the values on which to predict.
colors : tuple or iterable, optional
The color map
Returns
-------
xx : np.ndarray
The x array
yy : np.ndarray
The y array
axis : matplotlib.Axis
The axis
"""
x_min, x_max = X_data[:, 0].min() - 1, X_data[:, 0].max() + 1
y_min, y_max = X_data[:, 1].min() - 1, X_data[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, stepsize),
np.arange(y_min, y_max, stepsize))
Z = estimator.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
axis.pcolormesh(xx, yy, Z, cmap=ListedColormap(list(colors[:nclasses])))
return xx, yy, axis
def plot_learning_curve(model, X, y, n_folds, metric, train_sizes,
seed=None, trace=False, y_lim=None, **kwargs):
"""Fit and plot a CV learning curve.
Fits the model with ``n_folds`` of cross-validation over various
training sizes and computes arrays of scores for the train samples
and the validation fold samples, then plots them.
Parameters
----------
model : BaseSimpleEstimator
The model class that should be fit.
X : array-like, shape=(n_samples, n_features)
The training matrix.
y : array-like, shape=(n_samples,)
The training labels/ground-truth.
metric : callable
The scoring metric
train_sizes : iterable
The size of the training set for each fold.
n_folds : int, optional (default=3)
The number of CV folds
seed : int or None, optional (default=None)
The random seed for cross validation.
trace : bool, optional (default=False)
Whether to print to stdout after each set of folds is fit
for a given train size.
y_lim : iterable or None, optional (default=None)
The y-axis limits
**kwargs : keyword args or dict
The keyword args to pass to the estimator.
Returns
-------
plt : Figure
The matplotlib figure for plotting
References
----------
.. [1] Based on the scikit-learn example:
http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html
"""
# delegate the model fits to the function in .validation
train_scores, val_scores = learning_curve(
model, X, y, train_sizes=train_sizes,
metric=metric, seed=seed, trace=trace,
n_folds=n_folds, **kwargs)
# compute the means/stds of each scores list
train_scores_mean = np.mean(train_scores, axis=1)
val_scores_mean = np.mean(val_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
val_scores_std = np.std(val_scores, axis=1)
# plot the learning curves
plt.figure()
plt.title("Learning curve (model=%s, train sizes=%s)"
% (model.__name__, str(train_sizes)))
plt.xlabel("Training sizes")
plt.ylabel("Score (%s)" % metric.__name__)
plt.grid()
# define the y-axis limit if necessary
if y_lim is not None:
plt.ylim(y_lim)
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, val_scores_mean - val_scores_std,
val_scores_mean + val_scores_std, alpha=0.1,
color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, val_scores_mean, 'o-', color="g",
label="Validation score")
plt.legend(loc="best")
return plt

View File

@ -0,0 +1,3 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import

View File

@ -0,0 +1,23 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from sklearn.datasets import load_iris
from packtml.utils import linalg
from numpy.testing import assert_array_almost_equal
import numpy as np
iris = load_iris()
X, y = iris.data, iris.target
def test_row_norms():
means = np.average(X, axis=0)
X_centered = X - means
norms = linalg.l2_norm(X_centered, axis=0)
assert_array_almost_equal(
norms,
np.array([ 10.10783524, 5.29269308,
21.53749599, 9.31556404]))

View File

@ -0,0 +1,37 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from packtml.utils import validation as val
from packtml.regression import SimpleLogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer
bc = load_breast_cancer()
X, y = bc.data, bc.target
def test_is_iterable():
assert val.is_iterable([1, 2, 3])
assert val.is_iterable((1, 2, 3))
assert val.is_iterable({1, 2, 3})
assert val.is_iterable({1: 'a', 2: 'b'})
assert not val.is_iterable(123)
assert not val.is_iterable(None)
assert not val.is_iterable("a string")
def test_learning_curves():
train_scores, val_scores = \
val.learning_curve(
SimpleLogisticRegression, X, y,
metric=accuracy_score,
train_sizes=(100, 250, 400),
n_folds=3, seed=42, trace=True,
# kwargs:
n_steps=20, loglik_interval=20)
assert train_scores.shape == (3, 3)
assert val_scores.shape == (3, 3)

View File

@ -0,0 +1,169 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from sklearn.externals import six
from sklearn.utils.validation import check_random_state
from sklearn.model_selection import ShuffleSplit
import numpy as np
__all__ = [
'assert_is_binary',
'is_iterable',
'learning_curve'
]
def assert_is_binary(y):
"""Validate that a vector is binary.
Checks that a vector is binary. This utility is used by all of
the simple classifier estimators to validate the input target.
Parameters
----------
y : np.ndarray, shape=(n_samples,)
The target vector
"""
# validate that y is in (0, 1)
unique_y = np.unique(y) # type: np.ndarray
if unique_y.shape[0] != 2 or [0, 1] != unique_y.tolist():
raise ValueError("y must be binary, but got unique values of %s"
% str(unique_y))
def is_iterable(x):
"""Determine whether an item is iterable.
Python 3 introduced the ``__iter__`` functionality to
strings, making them falsely behave like iterables. This
function determines whether an object is an iterable given
the presence of the ``__iter__`` method and that the object
is *not* a string.
Parameters
----------
x : int, object, str, iterable, None
The object in question. Could feasibly be any type.
"""
if isinstance(x, six.string_types):
return False
return hasattr(x, "__iter__")
def learning_curve(model, X, y, metric, train_sizes, n_folds=3,
seed=None, trace=False, **kwargs):
"""Fit a CV learning curve.
Fits the model with ``n_folds`` of cross-validation over various
training sizes and returns arrays of scores for the train samples
and the validation fold samples.
Parameters
----------
model : BaseSimpleEstimator
The model class that should be fit.
X : array-like, shape=(n_samples, n_features)
The training matrix.
y : array-like, shape=(n_samples,)
The training labels/ground-truth.
metric : callable
The scoring metric
train_sizes : iterable
The size of the training set for each fold.
n_folds : int, optional (default=3)
The number of CV folds
seed : int or None, optional (default=None)
The random seed for cross validation.
trace : bool, optional (default=False)
Whether to print to stdout after each set of folds is fit
for a given train size.
**kwargs : keyword args or dict
The keyword args to pass to the estimator.
Returns
-------
train_scores : np.ndarray, shape=(n_trials, n_folds)
The scores for the train samples. Each row represents a
trial (new train size), and each column corresponds to the
fold of the trial, i.e., for ``n_folds=3``, there will be
3 columns.
val_scores : np.ndarray, shape=(n_trials, n_folds)
The scores for the validation folds. Each row represents a
trial (new train size), and each column corresponds to the
fold of the trial, i.e., for ``n_folds=3``, there will be
3 columns.
"""
# Each of these lists will be a 2d array. A row will represent a
# trial for a particular train size, and each column will
# correspond with a fold.
train_scores = []
val_scores = []
# The number of samples in the dataset
n_samples = X.shape[0]
# If the input is a pandas frame, make it a numpy array for indexing
if hasattr(X, "iloc"):
X = X.values
# We need to validate that all of the sizes within the train_sizes
# are less than the number of samples in the dataset!
assert all(s < n_samples for s in train_sizes), \
"All train sizes (%s) must be less than n_samples (%i)" \
% (str(train_sizes), n_samples)
# For each training size, we're going to initialize a new KFold
# cross validation instance and fit the K folds...
for train_size in train_sizes:
cv = ShuffleSplit(n_splits=n_folds,
train_size=train_size,
test_size=n_samples - train_size,
random_state=seed)
# This is the inner list (row) that will represent the
# scores for this train size
inner_train_scores = []
inner_val_scores = []
# get our splits
for train_indices, test_indices in cv.split(X, y):
# get the training samples
train_X = X[train_indices, :]
train_y = y.take(train_indices)
# fit the model
m = model(train_X, train_y, **kwargs)
# score the model on the train set
inner_train_scores.append(
metric(train_y, m.predict(train_X)))
# score the model on the validation set
inner_val_scores.append(
metric(y.take(test_indices),
m.predict(X[test_indices, :])))
# Now attach the inner lists to the outer lists
train_scores.append(inner_train_scores)
val_scores.append(inner_val_scores)
if trace:
print("Completed fitting %i folds for train size=%i"
% (n_folds, train_size))
# Make our train/val arrays into numpy arrays
train_scores = np.asarray(train_scores)
val_scores = np.asarray(val_scores)
return train_scores, val_scores

5
requirements.txt 100644
View File

@ -0,0 +1,5 @@
numpy>=0.11
scipy>=0.19
scikit-learn>=0.18
pandas
matplotlib

54
setup.py 100644
View File

@ -0,0 +1,54 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
import sys
import setuptools
with open("packtml/VERSION", 'r') as vsn:
VERSION = vsn.read().strip()
# Permitted args: "install" only, basically.
UNSUPPORTED_COMMANDS = { # this is a set literal, not a dict
'develop', 'release', 'bdist_egg', 'bdist_rpm',
'bdist_wininst', 'install_egg_info', 'build_sphinx',
'egg_info', 'easy_install', 'upload', 'bdist_wheel',
'--single-version-externally-managed', 'test', 'build_ext'
}
intersect = UNSUPPORTED_COMMANDS.intersection(set(sys.argv))
if intersect:
msg = "The following arguments are unsupported: %s. " \
"To install, please use `python setup.py install`." \
% str(list(intersect))
# if "test" is in the arguments, make sure the user knows how to test.
if "test" in intersect:
msg += " To test, make sure pytest is installed, and after " \
"installation run `pytest packtml`"
raise ValueError(msg)
# get requirements
with open("requirements.txt") as req:
REQUIREMENTS = req.read().strip().split("\n")
py_version_tag = '-%s.%s'.format(sys.version_info[:2])
setuptools.setup(name="packtml",
description="Hands-on Supervised Learning - teach a machine "
"to think for itself!",
author="Taylor G Smith",
author_email="taylor.smith@alkaline-ml.com",
packages=['packtml',
'packtml/clustering',
'packtml/decision_tree',
'packtml/metrics',
'packtml/neural_net',
'packtml/recommendation',
'packtml/regression',
'packtml/utils'],
zip_safe=False,
include_package_data=True,
install_requires=REQUIREMENTS,
package_data={"packtml": ["*"]},
version=VERSION)