nyaggle/README.md

# nyaggle

![GitHub Actions CI Status](https://github.com/nyanp/nyaggle/workflows/Python%20package/badge.svg)
![GitHub Actions CI Status](https://github.com/nyanp/nyaggle/workflows/weekly_test/badge.svg)
![Python Versions](https://img.shields.io/pypi/pyversions/nyaggle.svg?logo=python&logoColor=white)
![Documentation Status](https://readthedocs.org/projects/nyaggle/badge/?version=latest)

[**Documentation**](https://nyaggle.readthedocs.io/en/latest/index.html)
| [**Slide (Japanese)**](https://docs.google.com/presentation/d/1jv3J7DISw8phZT4z9rqjM-azdrQ4L4wWJN5P-gKL6fA/edit?usp=sharing)

**nyaggle** is an utility library for Kaggle and offline competitions. 
It is particularly focused on experiment tracking, feature engineering, and validation.

- **nyaggle.ensemble** - Averaging & stacking
- **nyaggle.experiment** - Experiment tracking
- **nyaggle.feature_store** - Lightweight feature storage using feather-format
- **nyaggle.features** - sklearn-compatible features
- **nyaggle.hyper_parameters** - Collection of GBDT hyper-parameters used in past Kaggle competitions
- **nyaggle.validation** - Adversarial validation & sklearn-compatible CV splitters

## Installation

You can install nyaggle via pip:

```bash
pip install nyaggle
```

## Examples

### Experiment Tracking

`run_experiment()` is a high-level API for experiments with cross validation.
It outputs parameters, metrics, out of fold predictions, test predictions,
feature importance, and submission.csv under the specified directory.

To enable mlflow tracking, include the optional `with_mlflow=True` parameter.

```python
from sklearn.model_selection import train_test_split

from nyaggle.experiment import run_experiment
from nyaggle.testing import make_classification_df

X, y = make_classification_df()
X_train, X_test, y_train, y_test = train_test_split(X, y)

params = {
    'n_estimators': 1000,
    'max_depth': 8
}

result = run_experiment(params,
                        X_train,
                        y_train,
                        X_test)

# You can get outputs that are needed in data science competitions with 1 API

print(result.test_prediction)  # Test prediction in numpy array
print(result.oof_prediction)   # Out-of-fold prediction in numpy array
print(result.models)           # Trained models for each fold
print(result.importance)       # Feature importance for each fold
print(result.metrics)          # Evalulation metrics for each fold
print(result.time)             # Elapsed time
print(result.submission_df)    # The output dataframe saved as submission.csv

# ...and all outputs have been saved under the logging directory (default: output/yyyymmdd_HHMMSS).


# You can use it with mlflow and track your experiments through mlflow-ui
result = run_experiment(params,
                        X_train,
                        y_train,
                        X_test,
                        with_mlflow=True)
```

nyaggle also has a low-level API which has similar interface to
[mlflow tracking](https://www.mlflow.org/docs/latest/tracking.html) and [wandb](https://www.wandb.com/).

```python
from nyaggle.experiment import Experiment

with Experiment(logging_directory='./output/') as exp:
    # log key-value pair as a parameter
    exp.log_param('lr', 0.01)
    exp.log_param('optimizer', 'adam')

    # log text
    exp.log('blah blah blah')

    # log metric
    exp.log_metric('CV', 0.85)

    # log numpy ndarray, pandas dafaframe and any artifacts
    exp.log_numpy('predicted', predicted)
    exp.log_dataframe('submission', sub, file_format='csv')
    exp.log_artifact('path-to-your-file')
```

### Feature Engineering

#### Target Encoding with K-Fold

```python
import pandas as pd
import numpy as np

from sklearn.model_selection import KFold
from nyaggle.feature.category_encoder import TargetEncoder


train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
all = pd.concat([train, test]).copy()

cat_cols = [c for c in train.columns if train[c].dtype == np.object]
target_col = 'y'

kf = KFold(5)

# Target encoding with K-fold
te = TargetEncoder(kf.split(train))

# use fit/fit_transform to train data, then apply transform to test data
train.loc[:, cat_cols] = te.fit_transform(train[cat_cols], train[target_col])
test.loc[:, cat_cols] = te.transform(test[cat_cols])

# ... or just call fit_transform to concatenated data
all.loc[:, cat_cols] = te.fit_transform(all[cat_cols], all[cat_cols])
```

#### Text Vectorization using BERT

You need to install pytorch to your virtual environment to use BertSentenceVectorizer. 
MaCab and mecab-python3 are also required if you use the Japanese BERT model.

```python
import pandas as pd
from nyaggle.feature.nlp import BertSentenceVectorizer


train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
all = pd.concat([train, test]).copy()

text_cols = ['body']
target_col = 'y'
group_col = 'user_id'


# extract BERT-based sentence vector
bv = BertSentenceVectorizer(text_columns=text_cols)

text_vector = bv.fit_transform(train)


# BERT + SVD, with cuda
bv = BertSentenceVectorizer(text_columns=text_cols, use_cuda=True, n_components=40)

text_vector_svd = bv.fit_transform(train)

# Japanese BERT
bv = BertSentenceVectorizer(text_columns=text_cols, lang='jp')

japanese_text_vector = bv.fit_transform(train)
```


### Adversarial Validation

```python
import pandas as pd
from nyaggle.validation import adversarial_validate

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

auc, importance = adversarial_validate(train, test, importance_type='gain')

```

### Validation Splitters

nyaggle provides a set of validation splitters that are compatible with sklearn.

```python
import pandas as pd
from sklearn.model_selection import cross_validate, KFold
from nyaggle.validation import TimeSeriesSplit, Take, Skip, Nth

train = pd.read_csv('train.csv', parse_dates='dt')

# time-series split
ts = TimeSeriesSplit(train['dt'])
ts.add_fold(train_interval=('2019-01-01', '2019-01-10'), test_interval=('2019-01-10', '2019-01-20'))
ts.add_fold(train_interval=('2019-01-06', '2019-01-15'), test_interval=('2019-01-15', '2019-01-25'))

cross_validate(..., cv=ts)

# take the first 3 folds out of 10
cross_validate(..., cv=Take(3, KFold(10)))

# skip the first 3 folds, and evaluate the remaining 7 folds
cross_validate(..., cv=Skip(3, KFold(10)))

# evaluate 1st fold
cross_validate(..., cv=Nth(1, ts))

```

### Other Awesome Repositories

Here is a list of awesome repositories that provide general utility functions for data science competitions.
Please let me know if you have another one :)

- [jeongyoonlee/Kaggler](https://github.com/jeongyoonlee/Kaggler)
- [mxbi/mlcrate](https://github.com/mxbi/mlcrate)
- [analokmaus/kuma_utils](https://github.com/analokmaus/kuma_utils)
- [Far0n/kaggletils](https://github.com/Far0n/kaggletils)
- [MLWave/Kaggle-Ensemble-Guide](https://github.com/MLWave/Kaggle-Ensemble-Guide)
- [rushter/heamy](https://github.com/rushter/heamy)
Initial commit 2019-12-19 10:01:20 +08:00			`# nyaggle`
Update README.md 2020-03-04 01:04:13 +08:00
add badge 2020-01-09 21:53:05 +08:00			`![GitHub Actions CI Status](https://github.com/nyanp/nyaggle/workflows/Python%20package/badge.svg)`
add badge 2020-08-01 21:19:18 +08:00			`![GitHub Actions CI Status](https://github.com/nyanp/nyaggle/workflows/weekly_test/badge.svg)`
add badge 2020-01-09 21:53:05 +08:00			`![Python Versions](https://img.shields.io/pypi/pyversions/nyaggle.svg?logo=python&logoColor=white)`
add documentation status badge 2020-02-24 09:27:21 +08:00			`![Documentation Status](https://readthedocs.org/projects/nyaggle/badge/?version=latest)`
Add badge 2020-01-05 00:44:36 +08:00
update docs 2020-02-23 21:57:26 +08:00			`[Documentation](https://nyaggle.readthedocs.io/en/latest/index.html)`
			`\| [Slide (Japanese)](https://docs.google.com/presentation/d/1jv3J7DISw8phZT4z9rqjM-azdrQ4L4wWJN5P-gKL6fA/edit?usp=sharing)`

docs(readme): Grammar fixes 2022-04-24 11:56:29 +08:00			`nyaggle is an utility library for Kaggle and offline competitions.`
			`It is particularly focused on experiment tracking, feature engineering, and validation.`
add link to slide 2020-02-12 07:21:11 +08:00
update docs 2020-02-26 22:29:48 +08:00			`- nyaggle.ensemble - Averaging & stacking`
update docs 2020-02-23 21:57:26 +08:00			`- nyaggle.experiment - Experiment tracking`
			`- nyaggle.feature_store - Lightweight feature storage using feather-format`
			`- nyaggle.features - sklearn-compatible features`
			`- nyaggle.hyper_parameters - Collection of GBDT hyper-parameters used in past Kaggle competitions`
			`- nyaggle.validation - Adversarial validation & sklearn-compatible CV splitters`
Update README.md 2020-01-01 11:11:39 +08:00
update doc 2019-12-26 22:08:29 +08:00			`## Installation`
Update README.md 2020-03-04 01:04:13 +08:00
update doc 2019-12-26 22:08:29 +08:00			`You can install nyaggle via pip:`
Update README.md 2020-03-04 01:04:13 +08:00
docs(readme): Grammar fixes 2022-04-24 11:56:29 +08:00			```bash
			`pip install nyaggle`
update doc 2019-12-26 22:08:29 +08:00			```

			`## Examples`

update docs 2020-02-23 21:57:26 +08:00			`### Experiment Tracking`
Update README.md 2020-03-04 01:04:13 +08:00
docs(readme): Grammar fixes 2022-04-24 11:56:29 +08:00			`run_experiment()` is a high-level API for experiments with cross validation.
refactoring 2020-02-04 22:44:07 +08:00			`It outputs parameters, metrics, out of fold predictions, test predictions,`
docs(readme): Grammar fixes 2022-04-24 11:56:29 +08:00			`feature importance, and submission.csv under the specified directory.`
Update README.md 2020-01-13 12:08:22 +08:00
docs(readme): Grammar fixes 2022-04-24 11:56:29 +08:00			To enable mlflow tracking, include the optional `with_mlflow=True` parameter.
Update README.md 2020-01-13 11:45:42 +08:00
			```python
experiment_gbdt returns importance per fold, update comments 2020-01-13 22:58:49 +08:00			`from sklearn.model_selection import train_test_split`

fix readme 2020-02-09 15:06:23 +08:00			`from nyaggle.experiment import run_experiment`
experiment_gbdt returns importance per fold, update comments 2020-01-13 22:58:49 +08:00			`from nyaggle.testing import make_classification_df`

			`X, y = make_classification_df()`
			`X_train, X_test, y_train, y_test = train_test_split(X, y)`
Update README.md 2020-01-13 11:45:42 +08:00
			`params = {`
experiment_gbdt returns importance per fold, update comments 2020-01-13 22:58:49 +08:00			`'n_estimators': 1000,`
			`'max_depth': 8`
Update README.md 2020-01-13 11:45:42 +08:00			`}`

refactoring 2020-02-04 22:44:07 +08:00			`result = run_experiment(params,`
			`X_train,`
			`y_train,`
			`X_test)`
Update README.md 2020-03-04 01:04:13 +08:00
docs(readme): Grammar fixes 2022-04-24 11:56:29 +08:00			`# You can get outputs that are needed in data science competitions with 1 API`
Update README.md 2020-01-13 11:45:42 +08:00
experiment_gbdt returns importance per fold, update comments 2020-01-13 22:58:49 +08:00			`print(result.test_prediction) # Test prediction in numpy array`
			`print(result.oof_prediction) # Out-of-fold prediction in numpy array`
			`print(result.models) # Trained models for each fold`
			`print(result.importance) # Feature importance for each fold`
fix readme 2020-01-30 07:07:36 +08:00			`print(result.metrics) # Evalulation metrics for each fold`
return submission df 2020-01-16 19:06:44 +08:00			`print(result.time) # Elapsed time`
			`print(result.submission_df) # The output dataframe saved as submission.csv`

			`# ...and all outputs have been saved under the logging directory (default: output/yyyymmdd_HHMMSS).`

Update README.md 2020-01-13 11:45:42 +08:00
fix readme 2020-01-30 07:07:36 +08:00			`# You can use it with mlflow and track your experiments through mlflow-ui`
refactoring 2020-02-04 22:44:07 +08:00			`result = run_experiment(params,`
			`X_train,`
			`y_train,`
			`X_test,`
			`with_mlflow=True)`
return submission df 2020-01-16 19:06:44 +08:00			```
Update README.md 2020-01-13 11:45:42 +08:00
update docs 2020-02-23 21:57:26 +08:00			`nyaggle also has a low-level API which has similar interface to`
			`[mlflow tracking](https://www.mlflow.org/docs/latest/tracking.html) and [wandb](https://www.wandb.com/).`

			```python
			`from nyaggle.experiment import Experiment`

			`with Experiment(logging_directory='./output/') as exp:`
			`# log key-value pair as a parameter`
			`exp.log_param('lr', 0.01)`
			`exp.log_param('optimizer', 'adam')`

			`# log text`
			`exp.log('blah blah blah')`

			`# log metric`
			`exp.log_metric('CV', 0.85)`

			`# log numpy ndarray, pandas dafaframe and any artifacts`
			`exp.log_numpy('predicted', predicted)`
			`exp.log_dataframe('submission', sub, file_format='csv')`
			`exp.log_artifact('path-to-your-file')`
			```
Update README.md 2020-01-13 11:45:42 +08:00
update doc 2019-12-26 22:08:29 +08:00			`### Feature Engineering`
first commit 2019-12-24 00:05:03 +08:00
update doc 2019-12-26 22:08:29 +08:00			`#### Target Encoding with K-Fold`
Update README.md 2020-03-04 01:04:13 +08:00
first commit 2019-12-24 00:05:03 +08:00			```python
			`import pandas as pd`
			`import numpy as np`

add setup.py 2019-12-24 07:21:44 +08:00			`from sklearn.model_selection import KFold`
first commit 2019-12-24 00:05:03 +08:00			`from nyaggle.feature.category_encoder import TargetEncoder`


			`train = pd.read_csv('train.csv')`
			`test = pd.read_csv('test.csv')`
			`all = pd.concat([train, test]).copy()`

			`cat_cols = [c for c in train.columns if train[c].dtype == np.object]`
			`target_col = 'y'`

add setup.py 2019-12-24 07:21:44 +08:00			`kf = KFold(5)`
first commit 2019-12-24 00:05:03 +08:00
add setup.py 2019-12-24 07:21:44 +08:00			`# Target encoding with K-fold`
change splitting arguments to make them compatible with sklearn's api 2020-01-09 23:44:16 +08:00			`te = TargetEncoder(kf.split(train))`
first commit 2019-12-24 00:05:03 +08:00
			`# use fit/fit_transform to train data, then apply transform to test data`
			`train.loc[:, cat_cols] = te.fit_transform(train[cat_cols], train[target_col])`
			`test.loc[:, cat_cols] = te.transform(test[cat_cols])`

			`# ... or just call fit_transform to concatenated data`
			`all.loc[:, cat_cols] = te.fit_transform(all[cat_cols], all[cat_cols])`
			```

minor fix on comment 2019-12-27 22:26:44 +08:00			`#### Text Vectorization using BERT`
Update README.md 2020-03-04 01:04:13 +08:00
update doc 2019-12-26 22:08:29 +08:00			`You need to install pytorch to your virtual environment to use BertSentenceVectorizer.`
docs(readme): Grammar fixes 2022-04-24 11:56:29 +08:00			`MaCab and mecab-python3 are also required if you use the Japanese BERT model.`
first commit 2019-12-24 00:05:03 +08:00
			```python
			`import pandas as pd`
			`from nyaggle.feature.nlp import BertSentenceVectorizer`


			`train = pd.read_csv('train.csv')`
			`test = pd.read_csv('test.csv')`
			`all = pd.concat([train, test]).copy()`

			`text_cols = ['body']`
			`target_col = 'y'`
			`group_col = 'user_id'`


add setup.py 2019-12-24 07:21:44 +08:00			`# extract BERT-based sentence vector`
			`bv = BertSentenceVectorizer(text_columns=text_cols)`

			`text_vector = bv.fit_transform(train)`

first commit 2019-12-24 00:05:03 +08:00
add setup.py 2019-12-24 07:21:44 +08:00			`# BERT + SVD, with cuda`
			`bv = BertSentenceVectorizer(text_columns=text_cols, use_cuda=True, n_components=40)`
first commit 2019-12-24 00:05:03 +08:00
add setup.py 2019-12-24 07:21:44 +08:00			`text_vector_svd = bv.fit_transform(train)`
add GitHub action, python 3.5 support update gitignore 2019-12-25 07:22:39 +08:00
			`# Japanese BERT`
replace enum with str 2019-12-26 22:39:17 +08:00			`bv = BertSentenceVectorizer(text_columns=text_cols, lang='jp')`
add GitHub action, python 3.5 support update gitignore 2019-12-25 07:22:39 +08:00
			`japanese_text_vector = bv.fit_transform(train)`
first commit 2019-12-24 00:05:03 +08:00			```
update README 2020-02-13 19:16:09 +08:00

			`### Adversarial Validation`

			```python
			`import pandas as pd`
			`from nyaggle.validation import adversarial_validate`

			`train = pd.read_csv('train.csv')`
			`test = pd.read_csv('test.csv')`

			`auc, importance = adversarial_validate(train, test, importance_type='gain')`

			```

			`### Validation Splitters`

docs(readme): Grammar fixes 2022-04-24 11:56:29 +08:00			`nyaggle provides a set of validation splitters that are compatible with sklearn.`
update README 2020-02-13 19:16:09 +08:00
			```python
			`import pandas as pd`
			`from sklearn.model_selection import cross_validate, KFold`
			`from nyaggle.validation import TimeSeriesSplit, Take, Skip, Nth`

			`train = pd.read_csv('train.csv', parse_dates='dt')`

			`# time-series split`
			`ts = TimeSeriesSplit(train['dt'])`
			`ts.add_fold(train_interval=('2019-01-01', '2019-01-10'), test_interval=('2019-01-10', '2019-01-20'))`
			`ts.add_fold(train_interval=('2019-01-06', '2019-01-15'), test_interval=('2019-01-15', '2019-01-25'))`

			`cross_validate(..., cv=ts)`

			`# take the first 3 folds out of 10`
			`cross_validate(..., cv=Take(3, KFold(10)))`

			`# skip the first 3 folds, and evaluate the remaining 7 folds`
			`cross_validate(..., cv=Skip(3, KFold(10)))`

			`# evaluate 1st fold`
			`cross_validate(..., cv=Nth(1, ts))`

update docs 2020-02-13 19:56:58 +08:00			```

			`### Other Awesome Repositories`
Update README.md 2020-03-04 01:04:13 +08:00
update docs 2020-02-13 19:56:58 +08:00			`Here is a list of awesome repositories that provide general utility functions for data science competitions.`
			`Please let me know if you have another one :)`

			`- [jeongyoonlee/Kaggler](https://github.com/jeongyoonlee/Kaggler)`
			`- [mxbi/mlcrate](https://github.com/mxbi/mlcrate)`
			`- [analokmaus/kuma_utils](https://github.com/analokmaus/kuma_utils)`
			`- [Far0n/kaggletils](https://github.com/Far0n/kaggletils)`
update docs 2020-02-26 22:29:48 +08:00			`- [MLWave/Kaggle-Ensemble-Guide](https://github.com/MLWave/Kaggle-Ensemble-Guide)`
			`- [rushter/heamy](https://github.com/rushter/heamy)`