2019-12-19 10:01:20 +08:00
|
|
|
# nyaggle
|
2020-03-04 01:04:13 +08:00
|
|
|
|
2020-01-09 21:53:05 +08:00
|
|
|
![GitHub Actions CI Status](https://github.com/nyanp/nyaggle/workflows/Python%20package/badge.svg)
|
2020-08-01 21:19:18 +08:00
|
|
|
![GitHub Actions CI Status](https://github.com/nyanp/nyaggle/workflows/weekly_test/badge.svg)
|
2020-01-09 21:53:05 +08:00
|
|
|
![Python Versions](https://img.shields.io/pypi/pyversions/nyaggle.svg?logo=python&logoColor=white)
|
2020-02-24 09:27:21 +08:00
|
|
|
![Documentation Status](https://readthedocs.org/projects/nyaggle/badge/?version=latest)
|
2020-01-05 00:44:36 +08:00
|
|
|
|
2020-02-23 21:57:26 +08:00
|
|
|
[**Documentation**](https://nyaggle.readthedocs.io/en/latest/index.html)
|
|
|
|
| [**Slide (Japanese)**](https://docs.google.com/presentation/d/1jv3J7DISw8phZT4z9rqjM-azdrQ4L4wWJN5P-gKL6fA/edit?usp=sharing)
|
|
|
|
|
2022-04-24 11:56:29 +08:00
|
|
|
**nyaggle** is an utility library for Kaggle and offline competitions.
|
|
|
|
It is particularly focused on experiment tracking, feature engineering, and validation.
|
2020-02-12 07:21:11 +08:00
|
|
|
|
2020-02-26 22:29:48 +08:00
|
|
|
- **nyaggle.ensemble** - Averaging & stacking
|
2020-02-23 21:57:26 +08:00
|
|
|
- **nyaggle.experiment** - Experiment tracking
|
|
|
|
- **nyaggle.feature_store** - Lightweight feature storage using feather-format
|
|
|
|
- **nyaggle.features** - sklearn-compatible features
|
|
|
|
- **nyaggle.hyper_parameters** - Collection of GBDT hyper-parameters used in past Kaggle competitions
|
|
|
|
- **nyaggle.validation** - Adversarial validation & sklearn-compatible CV splitters
|
2020-01-01 11:11:39 +08:00
|
|
|
|
2019-12-26 22:08:29 +08:00
|
|
|
## Installation
|
2020-03-04 01:04:13 +08:00
|
|
|
|
2019-12-26 22:08:29 +08:00
|
|
|
You can install nyaggle via pip:
|
2020-03-04 01:04:13 +08:00
|
|
|
|
2022-04-24 11:56:29 +08:00
|
|
|
```bash
|
|
|
|
pip install nyaggle
|
2019-12-26 22:08:29 +08:00
|
|
|
```
|
|
|
|
|
|
|
|
## Examples
|
|
|
|
|
2020-02-23 21:57:26 +08:00
|
|
|
### Experiment Tracking
|
2020-03-04 01:04:13 +08:00
|
|
|
|
2022-04-24 11:56:29 +08:00
|
|
|
`run_experiment()` is a high-level API for experiments with cross validation.
|
2020-02-04 22:44:07 +08:00
|
|
|
It outputs parameters, metrics, out of fold predictions, test predictions,
|
2022-04-24 11:56:29 +08:00
|
|
|
feature importance, and submission.csv under the specified directory.
|
2020-01-13 12:08:22 +08:00
|
|
|
|
2022-04-24 11:56:29 +08:00
|
|
|
To enable mlflow tracking, include the optional `with_mlflow=True` parameter.
|
2020-01-13 11:45:42 +08:00
|
|
|
|
|
|
|
```python
|
2020-01-13 22:58:49 +08:00
|
|
|
from sklearn.model_selection import train_test_split
|
|
|
|
|
2020-02-09 15:06:23 +08:00
|
|
|
from nyaggle.experiment import run_experiment
|
2020-01-13 22:58:49 +08:00
|
|
|
from nyaggle.testing import make_classification_df
|
|
|
|
|
|
|
|
X, y = make_classification_df()
|
|
|
|
X_train, X_test, y_train, y_test = train_test_split(X, y)
|
2020-01-13 11:45:42 +08:00
|
|
|
|
|
|
|
params = {
|
2020-01-13 22:58:49 +08:00
|
|
|
'n_estimators': 1000,
|
|
|
|
'max_depth': 8
|
2020-01-13 11:45:42 +08:00
|
|
|
}
|
|
|
|
|
2020-02-04 22:44:07 +08:00
|
|
|
result = run_experiment(params,
|
|
|
|
X_train,
|
|
|
|
y_train,
|
|
|
|
X_test)
|
2020-03-04 01:04:13 +08:00
|
|
|
|
2022-04-24 11:56:29 +08:00
|
|
|
# You can get outputs that are needed in data science competitions with 1 API
|
2020-01-13 11:45:42 +08:00
|
|
|
|
2020-01-13 22:58:49 +08:00
|
|
|
print(result.test_prediction) # Test prediction in numpy array
|
|
|
|
print(result.oof_prediction) # Out-of-fold prediction in numpy array
|
|
|
|
print(result.models) # Trained models for each fold
|
|
|
|
print(result.importance) # Feature importance for each fold
|
2020-01-30 07:07:36 +08:00
|
|
|
print(result.metrics) # Evalulation metrics for each fold
|
2020-01-16 19:06:44 +08:00
|
|
|
print(result.time) # Elapsed time
|
|
|
|
print(result.submission_df) # The output dataframe saved as submission.csv
|
|
|
|
|
|
|
|
# ...and all outputs have been saved under the logging directory (default: output/yyyymmdd_HHMMSS).
|
|
|
|
|
2020-01-13 11:45:42 +08:00
|
|
|
|
2020-01-30 07:07:36 +08:00
|
|
|
# You can use it with mlflow and track your experiments through mlflow-ui
|
2020-02-04 22:44:07 +08:00
|
|
|
result = run_experiment(params,
|
|
|
|
X_train,
|
|
|
|
y_train,
|
|
|
|
X_test,
|
|
|
|
with_mlflow=True)
|
2020-01-16 19:06:44 +08:00
|
|
|
```
|
2020-01-13 11:45:42 +08:00
|
|
|
|
2020-02-23 21:57:26 +08:00
|
|
|
nyaggle also has a low-level API which has similar interface to
|
|
|
|
[mlflow tracking](https://www.mlflow.org/docs/latest/tracking.html) and [wandb](https://www.wandb.com/).
|
|
|
|
|
|
|
|
```python
|
|
|
|
from nyaggle.experiment import Experiment
|
|
|
|
|
|
|
|
with Experiment(logging_directory='./output/') as exp:
|
|
|
|
# log key-value pair as a parameter
|
|
|
|
exp.log_param('lr', 0.01)
|
|
|
|
exp.log_param('optimizer', 'adam')
|
|
|
|
|
|
|
|
# log text
|
|
|
|
exp.log('blah blah blah')
|
|
|
|
|
|
|
|
# log metric
|
|
|
|
exp.log_metric('CV', 0.85)
|
|
|
|
|
|
|
|
# log numpy ndarray, pandas dafaframe and any artifacts
|
|
|
|
exp.log_numpy('predicted', predicted)
|
|
|
|
exp.log_dataframe('submission', sub, file_format='csv')
|
|
|
|
exp.log_artifact('path-to-your-file')
|
|
|
|
```
|
2020-01-13 11:45:42 +08:00
|
|
|
|
2019-12-26 22:08:29 +08:00
|
|
|
### Feature Engineering
|
2019-12-24 00:05:03 +08:00
|
|
|
|
2019-12-26 22:08:29 +08:00
|
|
|
#### Target Encoding with K-Fold
|
2020-03-04 01:04:13 +08:00
|
|
|
|
2019-12-24 00:05:03 +08:00
|
|
|
```python
|
|
|
|
import pandas as pd
|
|
|
|
import numpy as np
|
|
|
|
|
2019-12-24 07:21:44 +08:00
|
|
|
from sklearn.model_selection import KFold
|
2019-12-24 00:05:03 +08:00
|
|
|
from nyaggle.feature.category_encoder import TargetEncoder
|
|
|
|
|
|
|
|
|
|
|
|
train = pd.read_csv('train.csv')
|
|
|
|
test = pd.read_csv('test.csv')
|
|
|
|
all = pd.concat([train, test]).copy()
|
|
|
|
|
|
|
|
cat_cols = [c for c in train.columns if train[c].dtype == np.object]
|
|
|
|
target_col = 'y'
|
|
|
|
|
2019-12-24 07:21:44 +08:00
|
|
|
kf = KFold(5)
|
2019-12-24 00:05:03 +08:00
|
|
|
|
2019-12-24 07:21:44 +08:00
|
|
|
# Target encoding with K-fold
|
2020-01-09 23:44:16 +08:00
|
|
|
te = TargetEncoder(kf.split(train))
|
2019-12-24 00:05:03 +08:00
|
|
|
|
|
|
|
# use fit/fit_transform to train data, then apply transform to test data
|
|
|
|
train.loc[:, cat_cols] = te.fit_transform(train[cat_cols], train[target_col])
|
|
|
|
test.loc[:, cat_cols] = te.transform(test[cat_cols])
|
|
|
|
|
|
|
|
# ... or just call fit_transform to concatenated data
|
|
|
|
all.loc[:, cat_cols] = te.fit_transform(all[cat_cols], all[cat_cols])
|
|
|
|
```
|
|
|
|
|
2019-12-27 22:26:44 +08:00
|
|
|
#### Text Vectorization using BERT
|
2020-03-04 01:04:13 +08:00
|
|
|
|
2019-12-26 22:08:29 +08:00
|
|
|
You need to install pytorch to your virtual environment to use BertSentenceVectorizer.
|
2022-04-24 11:56:29 +08:00
|
|
|
MaCab and mecab-python3 are also required if you use the Japanese BERT model.
|
2019-12-24 00:05:03 +08:00
|
|
|
|
|
|
|
```python
|
|
|
|
import pandas as pd
|
|
|
|
from nyaggle.feature.nlp import BertSentenceVectorizer
|
|
|
|
|
|
|
|
|
|
|
|
train = pd.read_csv('train.csv')
|
|
|
|
test = pd.read_csv('test.csv')
|
|
|
|
all = pd.concat([train, test]).copy()
|
|
|
|
|
|
|
|
text_cols = ['body']
|
|
|
|
target_col = 'y'
|
|
|
|
group_col = 'user_id'
|
|
|
|
|
|
|
|
|
2019-12-24 07:21:44 +08:00
|
|
|
# extract BERT-based sentence vector
|
|
|
|
bv = BertSentenceVectorizer(text_columns=text_cols)
|
|
|
|
|
|
|
|
text_vector = bv.fit_transform(train)
|
|
|
|
|
2019-12-24 00:05:03 +08:00
|
|
|
|
2019-12-24 07:21:44 +08:00
|
|
|
# BERT + SVD, with cuda
|
|
|
|
bv = BertSentenceVectorizer(text_columns=text_cols, use_cuda=True, n_components=40)
|
2019-12-24 00:05:03 +08:00
|
|
|
|
2019-12-24 07:21:44 +08:00
|
|
|
text_vector_svd = bv.fit_transform(train)
|
2019-12-25 07:22:39 +08:00
|
|
|
|
|
|
|
# Japanese BERT
|
2019-12-26 22:39:17 +08:00
|
|
|
bv = BertSentenceVectorizer(text_columns=text_cols, lang='jp')
|
2019-12-25 07:22:39 +08:00
|
|
|
|
|
|
|
japanese_text_vector = bv.fit_transform(train)
|
2019-12-24 00:05:03 +08:00
|
|
|
```
|
2020-02-13 19:16:09 +08:00
|
|
|
|
|
|
|
|
|
|
|
### Adversarial Validation
|
|
|
|
|
|
|
|
```python
|
|
|
|
import pandas as pd
|
|
|
|
from nyaggle.validation import adversarial_validate
|
|
|
|
|
|
|
|
train = pd.read_csv('train.csv')
|
|
|
|
test = pd.read_csv('test.csv')
|
|
|
|
|
|
|
|
auc, importance = adversarial_validate(train, test, importance_type='gain')
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
### Validation Splitters
|
|
|
|
|
2022-04-24 11:56:29 +08:00
|
|
|
nyaggle provides a set of validation splitters that are compatible with sklearn.
|
2020-02-13 19:16:09 +08:00
|
|
|
|
|
|
|
```python
|
|
|
|
import pandas as pd
|
|
|
|
from sklearn.model_selection import cross_validate, KFold
|
|
|
|
from nyaggle.validation import TimeSeriesSplit, Take, Skip, Nth
|
|
|
|
|
|
|
|
train = pd.read_csv('train.csv', parse_dates='dt')
|
|
|
|
|
|
|
|
# time-series split
|
|
|
|
ts = TimeSeriesSplit(train['dt'])
|
|
|
|
ts.add_fold(train_interval=('2019-01-01', '2019-01-10'), test_interval=('2019-01-10', '2019-01-20'))
|
|
|
|
ts.add_fold(train_interval=('2019-01-06', '2019-01-15'), test_interval=('2019-01-15', '2019-01-25'))
|
|
|
|
|
|
|
|
cross_validate(..., cv=ts)
|
|
|
|
|
|
|
|
# take the first 3 folds out of 10
|
|
|
|
cross_validate(..., cv=Take(3, KFold(10)))
|
|
|
|
|
|
|
|
# skip the first 3 folds, and evaluate the remaining 7 folds
|
|
|
|
cross_validate(..., cv=Skip(3, KFold(10)))
|
|
|
|
|
|
|
|
# evaluate 1st fold
|
|
|
|
cross_validate(..., cv=Nth(1, ts))
|
|
|
|
|
2020-02-13 19:56:58 +08:00
|
|
|
```
|
|
|
|
|
|
|
|
### Other Awesome Repositories
|
2020-03-04 01:04:13 +08:00
|
|
|
|
2020-02-13 19:56:58 +08:00
|
|
|
Here is a list of awesome repositories that provide general utility functions for data science competitions.
|
|
|
|
Please let me know if you have another one :)
|
|
|
|
|
|
|
|
- [jeongyoonlee/Kaggler](https://github.com/jeongyoonlee/Kaggler)
|
|
|
|
- [mxbi/mlcrate](https://github.com/mxbi/mlcrate)
|
|
|
|
- [analokmaus/kuma_utils](https://github.com/analokmaus/kuma_utils)
|
|
|
|
- [Far0n/kaggletils](https://github.com/Far0n/kaggletils)
|
2020-02-26 22:29:48 +08:00
|
|
|
- [MLWave/Kaggle-Ensemble-Guide](https://github.com/MLWave/Kaggle-Ensemble-Guide)
|
|
|
|
- [rushter/heamy](https://github.com/rushter/heamy)
|