add tutorial docs

pull/32/head
Taiga Noumi 2020-02-10 20:54:50 +09:00
parent 694f51a028
commit b17324ca77
3 changed files with 152 additions and 0 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 343 KiB

View File

@ -11,6 +11,7 @@ Welcome to nyaggle's documentation!
:caption: Contents:
installation
source/tutorial/experiment
source/nyaggle

View File

@ -0,0 +1,151 @@
Concept of nyaggle.experiment
-------------------------------
In a typical tabular data competition, you may probably repeat evaluating your idea with cross validation and
logging its parameters and results to track your experiments.
The ``nyaggle.experiment.run_experiment`` is an API for such situation.
If you are using LightGBM as your model, the code will be quite simple:
.. code-block:: python
import pandas as pd
from nyaggle.experiment import run_experiment
from nyaggle.experiment import make_classification_df
INPUT_DIR = '../input'
target_column = 'target'
X_train = pd.read_csv(f'{INPUT_DIR}/train.csv')
X_test = pd.read_csv(f'{INPUT_DIR}/test.csv')
sample_df = pd.read_csv(f'{INPUT_DIR}/sample_submission.csv') # OPTIONAL
y = X_train[target_column]
X_train = X_train.drop(target_column, axis=1)
lightgbm_params = {
'max_depth': 8
}
result = run_experiment(lightgbm_params, X_train, y, X_test,
sample_submission=sample_df)
The ``run_experiment`` API will perform cross validation and store artifacts to the logging directory. You will see the output files stored as follows:
::
output
└── 20200130123456 # yyyymmssHHMMSS
├── params.txt # Parameters
├── metrics.txt # Metrics (single fold & overall CV score)
├── oof_prediction.npy # Out of fold prediction
├── test_prediction.npy # Test prediction
├── 20200130123456.csv # Submission csv file
├── importances.png # Feature importance plot
├── log.txt # Log file
└── models # The trained models for each fold
├── fold1
├── fold2
├── fold3
├── fold4
└── fold5
.. hint::
The default validation strategy is 5-fold CV, and of course you can change this behavior by passing ``cv`` parameter
(see API reference in detail).
If you want to use XGBoost, CatBoost or other sklearn's estimator, specify the type of algorithm:
.. code-block:: python
# CatBoost
catboost_params = {
'eval_metric': 'Logloss',
'loss_function': 'Logloss',
'depth': 8,
'task_type': 'GPU'
}
result = run_experiment(catboost_params, X_train, y, X_test,
sample_submission=sample_df, algorithm_type='cat')
# XGBoost
xgboost_params = {
'objective': 'reg:linear',
'max_depth': 8
}
result = run_experiment(xgboost_params, X_train, y, X_test,
sample_submission=sample_df, algorithm_type='xgb')
# sklearn estimator
from sklearn.linear_model import Ridge
rigde_params = {
'alpha': 1.0
}
result = run_experiment(rigde_params, X_train, y, X_test,
sample_submission=sample_df, algorithm_type=Ridge)
.. hint::
The parameter will be passed to the constructor of sklearn API (e.g. ``LGBMClassifier``).
Collaborating with mlflow
------------------------------
If you want GUI dashboard to manage your experiments, you can use ``run_experiment`` with mlflow by just
setting ``with_mlfow = True`` (you need to install mlflow beforehand).
.. code-block:: python
result = run_experiment(params, X_train, y, X_test,
sample_submission=sample_df, with_mlflow=True)
In the same directory as the script executed, run
.. code-block:: bash
mlflow ui
and view it at http://localhost:5000 . On this page, you can see the list of experiments with
CV scores and parameters.
.. image:: ../../image/mlflow.png
If you want to customize the behavior of logging, you can call ``run_experiment`` in
the context of mlflow run. If there is an active run, ``run_experiment`` will use the
currently active run instead of creating new one.
.. code-block:: python
mlflow.set_tracking_uri('gs://ok-i-want-to-use-gcs')
with mlflow.start_run(run_name='your-favorite-run-name'):
mlflow.log_param('something-you-want-to-log', 42)
result = run_experiment(params, X_train, y, X_test,
sample_submission=sample_df, with_mlflow=True)
What does ``run_experiment`` not do?
-------------------------------------
``run_experiment`` can be considered as a mere cross-validation API with logging functionality.
Therefore, you have to choose model parameters and perform feature engineering yourself.