add tutorial docs

2020-02-10 20:54:50 +09:00 · 2020-02-10 20:54:50 +09:00 · b17324ca77
parent 694f51a028
commit b17324ca77
3 changed files with 152 additions and 0 deletions
--- a/docs/image/mlflow.png
+++ b/docs/image/mlflow.png
--- a/docs/index.rst
+++ b/docs/index.rst
@ -11,6 +11,7 @@ Welcome to nyaggle's documentation!
   :caption: Contents:

   installation
+   source/tutorial/experiment
   source/nyaggle


--- a/docs/source/tutorial/experiment.rst
+++ b/docs/source/tutorial/experiment.rst
@ -0,0 +1,151 @@
+Concept of nyaggle.experiment
+-------------------------------
+
+
+In a typical tabular data competition, you may probably repeat evaluating your idea with cross validation and
+logging its parameters and results to track your experiments.
+
+The ``nyaggle.experiment.run_experiment`` is an API for such situation.
+If you are using LightGBM as your model, the code will be quite simple:
+
+
+.. code-block:: python
+
+  import pandas as pd
+  from nyaggle.experiment import run_experiment
+  from nyaggle.experiment import make_classification_df
+
+  INPUT_DIR = '../input'
+  target_column = 'target'
+
+  X_train = pd.read_csv(f'{INPUT_DIR}/train.csv')
+  X_test = pd.read_csv(f'{INPUT_DIR}/test.csv')
+  sample_df = pd.read_csv(f'{INPUT_DIR}/sample_submission.csv')  # OPTIONAL
+
+  y = X_train[target_column]
+  X_train = X_train.drop(target_column, axis=1)
+
+  lightgbm_params = {
+      'max_depth': 8
+  }
+
+  result = run_experiment(lightgbm_params, X_train, y, X_test,
+                          sample_submission=sample_df)
+
+
+The ``run_experiment`` API will perform cross validation and store artifacts to the logging directory. You will see the output files stored as follows:
+
+::
+
+    output
+    └── 20200130123456          # yyyymmssHHMMSS
+        ├── params.txt          # Parameters
+        ├── metrics.txt         # Metrics (single fold & overall CV score)
+        ├── oof_prediction.npy  # Out of fold prediction
+        ├── test_prediction.npy # Test prediction
+        ├── 20200130123456.csv  # Submission csv file
+        ├── importances.png     # Feature importance plot
+        ├── log.txt             # Log file
+        └── models              # The trained models for each fold
+              ├── fold1
+              ├── fold2
+              ├── fold3
+              ├── fold4
+              └── fold5
+
+
+
+.. hint::
+  The default validation strategy is 5-fold CV, and of course you can change this behavior by passing ``cv`` parameter
+  (see API reference in detail).
+
+
+If you want to use XGBoost, CatBoost or other sklearn's estimator, specify the type of algorithm:
+
+
+.. code-block:: python
+
+  # CatBoost
+  catboost_params = {
+      'eval_metric': 'Logloss',
+      'loss_function': 'Logloss',
+      'depth': 8,
+      'task_type': 'GPU'
+  }
+  result = run_experiment(catboost_params, X_train, y, X_test,
+                          sample_submission=sample_df, algorithm_type='cat')
+
+  # XGBoost
+  xgboost_params = {
+      'objective': 'reg:linear',
+      'max_depth': 8
+  }
+  result = run_experiment(xgboost_params, X_train, y, X_test,
+                          sample_submission=sample_df, algorithm_type='xgb')
+
+  # sklearn estimator
+  from sklearn.linear_model import Ridge
+  rigde_params = {
+      'alpha': 1.0
+  }
+  result = run_experiment(rigde_params, X_train, y, X_test,
+                          sample_submission=sample_df, algorithm_type=Ridge)
+
+
+
+.. hint::
+  The parameter will be passed to the constructor of sklearn API (e.g. ``LGBMClassifier``).
+
+
+Collaborating with mlflow
+------------------------------
+
+If you want GUI dashboard to manage your experiments, you can use ``run_experiment`` with mlflow by just
+ setting ``with_mlfow = True`` (you need to install mlflow beforehand).
+
+
+.. code-block:: python
+
+  result = run_experiment(params, X_train, y, X_test,
+                          sample_submission=sample_df, with_mlflow=True)
+
+
+
+In the same directory as the script executed, run
+
+.. code-block:: bash
+
+  mlflow ui
+
+
+and view it at http://localhost:5000 . On this page, you can see the list of experiments with
+ CV scores and parameters.
+
+
+.. image:: ../../image/mlflow.png
+
+
+If you want to customize the behavior of logging, you can call ``run_experiment`` in
+the context of mlflow run. If there is an active run, ``run_experiment`` will use the
+currently active run instead of creating new one.
+
+
+.. code-block:: python
+
+  mlflow.set_tracking_uri('gs://ok-i-want-to-use-gcs')
+
+  with mlflow.start_run(run_name='your-favorite-run-name'):
+      mlflow.log_param('something-you-want-to-log', 42)
+
+      result = run_experiment(params, X_train, y, X_test,
+                              sample_submission=sample_df, with_mlflow=True)
+
+
+
+
+
+What does ``run_experiment`` not do?
+-------------------------------------
+
+``run_experiment`` can be considered as a mere cross-validation API with logging functionality.
+Therefore, you have to choose model parameters and perform feature engineering yourself.