nyaggle/README.md

224 lines
6.8 KiB
Markdown
Raw Permalink Normal View History

2019-12-19 10:01:20 +08:00
# nyaggle
2020-03-04 01:04:13 +08:00
2020-01-09 21:53:05 +08:00
![GitHub Actions CI Status](https://github.com/nyanp/nyaggle/workflows/Python%20package/badge.svg)
2020-08-01 21:19:18 +08:00
![GitHub Actions CI Status](https://github.com/nyanp/nyaggle/workflows/weekly_test/badge.svg)
2020-01-09 21:53:05 +08:00
![Python Versions](https://img.shields.io/pypi/pyversions/nyaggle.svg?logo=python&logoColor=white)
2020-02-24 09:27:21 +08:00
![Documentation Status](https://readthedocs.org/projects/nyaggle/badge/?version=latest)
2020-01-05 00:44:36 +08:00
2020-02-23 21:57:26 +08:00
[**Documentation**](https://nyaggle.readthedocs.io/en/latest/index.html)
| [**Slide (Japanese)**](https://docs.google.com/presentation/d/1jv3J7DISw8phZT4z9rqjM-azdrQ4L4wWJN5P-gKL6fA/edit?usp=sharing)
2022-04-24 11:56:29 +08:00
**nyaggle** is an utility library for Kaggle and offline competitions.
It is particularly focused on experiment tracking, feature engineering, and validation.
2020-02-12 07:21:11 +08:00
2020-02-26 22:29:48 +08:00
- **nyaggle.ensemble** - Averaging & stacking
2020-02-23 21:57:26 +08:00
- **nyaggle.experiment** - Experiment tracking
- **nyaggle.feature_store** - Lightweight feature storage using feather-format
- **nyaggle.features** - sklearn-compatible features
- **nyaggle.hyper_parameters** - Collection of GBDT hyper-parameters used in past Kaggle competitions
- **nyaggle.validation** - Adversarial validation & sklearn-compatible CV splitters
2020-01-01 11:11:39 +08:00
2019-12-26 22:08:29 +08:00
## Installation
2020-03-04 01:04:13 +08:00
2019-12-26 22:08:29 +08:00
You can install nyaggle via pip:
2020-03-04 01:04:13 +08:00
2022-04-24 11:56:29 +08:00
```bash
pip install nyaggle
2019-12-26 22:08:29 +08:00
```
## Examples
2020-02-23 21:57:26 +08:00
### Experiment Tracking
2020-03-04 01:04:13 +08:00
2022-04-24 11:56:29 +08:00
`run_experiment()` is a high-level API for experiments with cross validation.
2020-02-04 22:44:07 +08:00
It outputs parameters, metrics, out of fold predictions, test predictions,
2022-04-24 11:56:29 +08:00
feature importance, and submission.csv under the specified directory.
2020-01-13 12:08:22 +08:00
2022-04-24 11:56:29 +08:00
To enable mlflow tracking, include the optional `with_mlflow=True` parameter.
2020-01-13 11:45:42 +08:00
```python
from sklearn.model_selection import train_test_split
2020-02-09 15:06:23 +08:00
from nyaggle.experiment import run_experiment
from nyaggle.testing import make_classification_df
X, y = make_classification_df()
X_train, X_test, y_train, y_test = train_test_split(X, y)
2020-01-13 11:45:42 +08:00
params = {
'n_estimators': 1000,
'max_depth': 8
2020-01-13 11:45:42 +08:00
}
2020-02-04 22:44:07 +08:00
result = run_experiment(params,
X_train,
y_train,
X_test)
2020-03-04 01:04:13 +08:00
2022-04-24 11:56:29 +08:00
# You can get outputs that are needed in data science competitions with 1 API
2020-01-13 11:45:42 +08:00
print(result.test_prediction) # Test prediction in numpy array
print(result.oof_prediction) # Out-of-fold prediction in numpy array
print(result.models) # Trained models for each fold
print(result.importance) # Feature importance for each fold
2020-01-30 07:07:36 +08:00
print(result.metrics) # Evalulation metrics for each fold
2020-01-16 19:06:44 +08:00
print(result.time) # Elapsed time
print(result.submission_df) # The output dataframe saved as submission.csv
# ...and all outputs have been saved under the logging directory (default: output/yyyymmdd_HHMMSS).
2020-01-13 11:45:42 +08:00
2020-01-30 07:07:36 +08:00
# You can use it with mlflow and track your experiments through mlflow-ui
2020-02-04 22:44:07 +08:00
result = run_experiment(params,
X_train,
y_train,
X_test,
with_mlflow=True)
2020-01-16 19:06:44 +08:00
```
2020-01-13 11:45:42 +08:00
2020-02-23 21:57:26 +08:00
nyaggle also has a low-level API which has similar interface to
[mlflow tracking](https://www.mlflow.org/docs/latest/tracking.html) and [wandb](https://www.wandb.com/).
```python
from nyaggle.experiment import Experiment
with Experiment(logging_directory='./output/') as exp:
# log key-value pair as a parameter
exp.log_param('lr', 0.01)
exp.log_param('optimizer', 'adam')
# log text
exp.log('blah blah blah')
# log metric
exp.log_metric('CV', 0.85)
# log numpy ndarray, pandas dafaframe and any artifacts
exp.log_numpy('predicted', predicted)
exp.log_dataframe('submission', sub, file_format='csv')
exp.log_artifact('path-to-your-file')
```
2020-01-13 11:45:42 +08:00
2019-12-26 22:08:29 +08:00
### Feature Engineering
2019-12-24 00:05:03 +08:00
2019-12-26 22:08:29 +08:00
#### Target Encoding with K-Fold
2020-03-04 01:04:13 +08:00
2019-12-24 00:05:03 +08:00
```python
import pandas as pd
import numpy as np
2019-12-24 07:21:44 +08:00
from sklearn.model_selection import KFold
2019-12-24 00:05:03 +08:00
from nyaggle.feature.category_encoder import TargetEncoder
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
all = pd.concat([train, test]).copy()
cat_cols = [c for c in train.columns if train[c].dtype == np.object]
target_col = 'y'
2019-12-24 07:21:44 +08:00
kf = KFold(5)
2019-12-24 00:05:03 +08:00
2019-12-24 07:21:44 +08:00
# Target encoding with K-fold
te = TargetEncoder(kf.split(train))
2019-12-24 00:05:03 +08:00
# use fit/fit_transform to train data, then apply transform to test data
train.loc[:, cat_cols] = te.fit_transform(train[cat_cols], train[target_col])
test.loc[:, cat_cols] = te.transform(test[cat_cols])
# ... or just call fit_transform to concatenated data
all.loc[:, cat_cols] = te.fit_transform(all[cat_cols], all[cat_cols])
```
2019-12-27 22:26:44 +08:00
#### Text Vectorization using BERT
2020-03-04 01:04:13 +08:00
2019-12-26 22:08:29 +08:00
You need to install pytorch to your virtual environment to use BertSentenceVectorizer.
2022-04-24 11:56:29 +08:00
MaCab and mecab-python3 are also required if you use the Japanese BERT model.
2019-12-24 00:05:03 +08:00
```python
import pandas as pd
from nyaggle.feature.nlp import BertSentenceVectorizer
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
all = pd.concat([train, test]).copy()
text_cols = ['body']
target_col = 'y'
group_col = 'user_id'
2019-12-24 07:21:44 +08:00
# extract BERT-based sentence vector
bv = BertSentenceVectorizer(text_columns=text_cols)
text_vector = bv.fit_transform(train)
2019-12-24 00:05:03 +08:00
2019-12-24 07:21:44 +08:00
# BERT + SVD, with cuda
bv = BertSentenceVectorizer(text_columns=text_cols, use_cuda=True, n_components=40)
2019-12-24 00:05:03 +08:00
2019-12-24 07:21:44 +08:00
text_vector_svd = bv.fit_transform(train)
# Japanese BERT
2019-12-26 22:39:17 +08:00
bv = BertSentenceVectorizer(text_columns=text_cols, lang='jp')
japanese_text_vector = bv.fit_transform(train)
2019-12-24 00:05:03 +08:00
```
2020-02-13 19:16:09 +08:00
### Adversarial Validation
```python
import pandas as pd
from nyaggle.validation import adversarial_validate
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
auc, importance = adversarial_validate(train, test, importance_type='gain')
```
### Validation Splitters
2022-04-24 11:56:29 +08:00
nyaggle provides a set of validation splitters that are compatible with sklearn.
2020-02-13 19:16:09 +08:00
```python
import pandas as pd
from sklearn.model_selection import cross_validate, KFold
from nyaggle.validation import TimeSeriesSplit, Take, Skip, Nth
train = pd.read_csv('train.csv', parse_dates='dt')
# time-series split
ts = TimeSeriesSplit(train['dt'])
ts.add_fold(train_interval=('2019-01-01', '2019-01-10'), test_interval=('2019-01-10', '2019-01-20'))
ts.add_fold(train_interval=('2019-01-06', '2019-01-15'), test_interval=('2019-01-15', '2019-01-25'))
cross_validate(..., cv=ts)
# take the first 3 folds out of 10
cross_validate(..., cv=Take(3, KFold(10)))
# skip the first 3 folds, and evaluate the remaining 7 folds
cross_validate(..., cv=Skip(3, KFold(10)))
# evaluate 1st fold
cross_validate(..., cv=Nth(1, ts))
2020-02-13 19:56:58 +08:00
```
### Other Awesome Repositories
2020-03-04 01:04:13 +08:00
2020-02-13 19:56:58 +08:00
Here is a list of awesome repositories that provide general utility functions for data science competitions.
Please let me know if you have another one :)
- [jeongyoonlee/Kaggler](https://github.com/jeongyoonlee/Kaggler)
- [mxbi/mlcrate](https://github.com/mxbi/mlcrate)
- [analokmaus/kuma_utils](https://github.com/analokmaus/kuma_utils)
- [Far0n/kaggletils](https://github.com/Far0n/kaggletils)
2020-02-26 22:29:48 +08:00
- [MLWave/Kaggle-Ensemble-Guide](https://github.com/MLWave/Kaggle-Ensemble-Guide)
- [rushter/heamy](https://github.com/rushter/heamy)