It is an utility library for Kaggle and offline competitions. It is particularly focused on experiment tracking, feature engineering, and validation.

experiment-tracking feature-engineering kaggle machine-learning ml

Go to file

nyanp 53c04b1c66 Change license to MIT		2020-01-14 23:36:14 +09:00
.github/workflows	install mlflow in CI	2020-01-07 01:25:42 +09:00
docs	fix code style by code inspection	2020-01-13 23:47:45 +09:00
examples	make id_column optional	2020-01-14 23:13:46 +09:00
nyaggle	make id_column optional	2020-01-14 23:13:46 +09:00
tests	fix test	2020-01-14 23:20:18 +09:00
.gitignore	add GitHub action, python 3.5 support	2019-12-25 23:52:25 +09:00
.readthedocs.yml	refactor requirements	2019-12-31 10:00:46 +09:00
LICENSE	Change license to MIT	2020-01-14 23:36:14 +09:00
MANIFEST.in	Create pythonpublish.yml	2019-12-29 00:07:24 +09:00
README.md	make id_column optional	2020-01-14 23:13:46 +09:00
requirements.txt	add dependency	2019-12-31 12:10:02 +09:00
setup.py	fix code style by code inspection	2020-01-13 23:47:45 +09:00

README.md

nyaggle

nyaggle is a utility library for Kaggle and offline competitions, particularly focused on feature engineering and validation. See the documentation for details.

Feature Engineering
- K-Fold Target Encoding
- BERT Sentence Vectorization
Model Validation
- CV with OOF
- Adversarial Validation
- sklearn compatible time series splitter
Experiment
- Experiment logging
- High-level API for logging gradient boosting experiment
Ensemble
- Blending

Installation

You can install nyaggle via pip:

$pip install nyaggle

Examples

Experiment Logging

experiment_gbdt() is an high-level API for cross validation using gradient boosting algorithm. It outputs parameters, metrics, out of fold predictions, test predictions, feature importance and submission.csv under the specified directory.

It can be combined with mlflow tracking.

from sklearn.model_selection import train_test_split

from nyaggle.experiment import experiment_gbdt
from nyaggle.testing import make_classification_df

X, y = make_classification_df()
X_train, X_test, y_train, y_test = train_test_split(X, y)

params = {
    'n_estimators': 1000,
    'max_depth': 8
}

result = experiment_gbdt('output/',
                         params,
                         X_train,
                         y_train,
                         X_test)

print(result.test_prediction)  # Test prediction in numpy array
print(result.oof_prediction)   # Out-of-fold prediction in numpy array
print(result.models)           # Trained models for each fold
print(result.importance)       # Feature importance for each fold
print(result.scores)           # Evalulation metrics for each fold

Feature Engineering

Target Encoding with K-Fold

import pandas as pd
import numpy as np

from sklearn.model_selection import KFold
from nyaggle.feature.category_encoder import TargetEncoder


train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
all = pd.concat([train, test]).copy()

cat_cols = [c for c in train.columns if train[c].dtype == np.object]
target_col = 'y'

kf = KFold(5)

# Target encoding with K-fold
te = TargetEncoder(kf.split(train))

# use fit/fit_transform to train data, then apply transform to test data
train.loc[:, cat_cols] = te.fit_transform(train[cat_cols], train[target_col])
test.loc[:, cat_cols] = te.transform(test[cat_cols])

# ... or just call fit_transform to concatenated data
all.loc[:, cat_cols] = te.fit_transform(all[cat_cols], all[cat_cols])

Text Vectorization using BERT

You need to install pytorch to your virtual environment to use BertSentenceVectorizer. MaCab and mecab-python3 are also required if you use Japanese BERT model.

import pandas as pd
from nyaggle.feature.nlp import BertSentenceVectorizer


train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
all = pd.concat([train, test]).copy()

text_cols = ['body']
target_col = 'y'
group_col = 'user_id'


# extract BERT-based sentence vector
bv = BertSentenceVectorizer(text_columns=text_cols)

text_vector = bv.fit_transform(train)


# BERT + SVD, with cuda
bv = BertSentenceVectorizer(text_columns=text_cols, use_cuda=True, n_components=40)

text_vector_svd = bv.fit_transform(train)

# Japanese BERT
bv = BertSentenceVectorizer(text_columns=text_cols, lang='jp')

japanese_text_vector = bv.fit_transform(train)

Model Validation

cross_validate() is a handy API to calculate K-fold CV, Out-of-Fold prediction and test prediction at one time. You can pass LGBMClassifier/LGBMRegressor and any other sklearn models.

import pandas as pd
from lightgbm import LGBMClassifier
from sklearn.datasets import make_classification
from sklearn.metrics import roc_auc_score

from nyaggle.validation import cross_validate

X, y = make_classification(n_samples=1024, n_features=20, class_sep=0.98, random_state=0)

models = [LGBMClassifier(n_estimators=300) for _ in range(5)]

pred_oof, pred_test, scores, importance = \
    cross_validate(models, X[:512, :], y[:512], X[512:, :], 
                   nfolds=5, eval_func=roc_auc_score)