update docs

2020-02-23 22:57:26 +09:00 · 2020-02-23 22:57:26 +09:00 · e03e6bbbae
parent 83e319d8c0
commit e03e6bbbae
9 changed files with 85 additions and 20 deletions
--- a/README.md
+++ b/README.md
@ -2,11 +2,17 @@
 ![GitHub Actions CI Status](https://github.com/nyanp/nyaggle/workflows/Python%20package/badge.svg)
 ![Python Versions](https://img.shields.io/pypi/pyversions/nyaggle.svg?logo=python&logoColor=white)

-**nyaggle** is a utility library for Kaggle and offline competitions, 
-particularly focused on experiment logging, feature engineering and validation. 
+[**Documentation**](https://nyaggle.readthedocs.io/en/latest/index.html)
+| [**Slide (Japanese)**](https://docs.google.com/presentation/d/1jv3J7DISw8phZT4z9rqjM-azdrQ4L4wWJN5P-gKL6fA/edit?usp=sharing)

- [documentation](https://nyaggle.readthedocs.io/en/latest/index.html)
- [slide (Japanese)](https://docs.google.com/presentation/d/1jv3J7DISw8phZT4z9rqjM-azdrQ4L4wWJN5P-gKL6fA/edit?usp=sharing)
+**nyaggle** is a utility library for Kaggle and offline competitions, 
+particularly focused on experiment tracking, feature engineering and validation.
+
+- **nyaggle.experiment** - Experiment tracking
+- **nyaggle.feature_store** - Lightweight feature storage using feather-format
+- **nyaggle.features** - sklearn-compatible features
+- **nyaggle.hyper_parameters** - Collection of GBDT hyper-parameters used in past Kaggle competitions
+- **nyaggle.validation** - Adversarial validation & sklearn-compatible CV splitters

 ## Installation
 You can install nyaggle via pip:
@ -16,7 +22,7 @@ $pip install nyaggle

 ## Examples

-### Experiment Logging
+### Experiment Tracking
 `run_experiment()` is an high-level API for experiment with cross validation.
 It outputs parameters, metrics, out of fold predictions, test predictions,
 feature importance and submission.csv under the specified directory.
@ -63,6 +69,28 @@ result = run_experiment(params,
                        with_mlflow=True)
 ```

+nyaggle also has a low-level API which has similar interface to
+[mlflow tracking](https://www.mlflow.org/docs/latest/tracking.html) and [wandb](https://www.wandb.com/).
+
+```python
+from nyaggle.experiment import Experiment
+
+with Experiment(logging_directory='./output/') as exp:
+    # log key-value pair as a parameter
+    exp.log_param('lr', 0.01)
+    exp.log_param('optimizer', 'adam')
+
+    # log text
+    exp.log('blah blah blah')
+
+    # log metric
+    exp.log_metric('CV', 0.85)
+
+    # log numpy ndarray, pandas dafaframe and any artifacts
+    exp.log_numpy('predicted', predicted)
+    exp.log_dataframe('submission', sub, file_format='csv')
+    exp.log_artifact('path-to-your-file')
+```

 ### Feature Engineering

--- a/docs/installation.rst
+++ b/docs/installation.rst
@ -6,10 +6,10 @@ You can install nyaggle via pip:

 .. code-block:: bash

-    pip install nyaggle
+    pip install nyaggle   # Install core parts of nyaggle


-nyaggle does not install the following packages by pip:
+nyaggle does not install the following packages by default:

 - catboost
 - lightgbm
@ -17,12 +17,17 @@ nyaggle does not install the following packages by pip:
 - mlflow
 - pytorch

-You need to install these packages if you want to use them through nyaggle API.
-For example, you need to install xgboost before calling ``run_experiment`` with ``algorithm_type='xgb'``.

-To use :code:`nyaggle.nlp.BertSentenceVectorizer`, you first need to install PyTorch.
-Please refer to `PyTorch installation page <https://pytorch.org/get-started/locally/#start-locally>`_
-to install Pytorch to your environment.
+Modules which depends on these packages won't work until you also install them.
+For example, ``run_experiment`` with ``algorithm_type='xgb'``, ``'lgbm'`` and ``'cat'`` options won't work
+until you also install xgboost, lightgbm and catboost respectively.
+
+If you want to install everything required in nyaggle, This command can be used:
+
+.. code-block:: bash
+
+    pip install nyaggle[all]  # Install everything
+

 If you use :code:`lang=ja` option in :code:`BertSentenceVecorizer`,
 you also need to intall MeCab and mecab-python3 package to your environment.
--- a/docs/source/reference/experiment.rst
+++ b/docs/source/reference/experiment.rst
@ -1,4 +1,4 @@
-experiment
+nyaggle.experiment
 -----------------------

 .. automodule:: nyaggle.experiment
--- a/docs/source/reference/feature_store.rst
+++ b/docs/source/reference/feature_store.rst
@ -1,4 +1,4 @@
-feature_store
+nyaggle.feature_store
 ---------------------------

 .. automodule:: nyaggle.feature_store
--- a/docs/source/reference/features.rst
+++ b/docs/source/reference/features.rst
@ -1,4 +1,4 @@
-feature
+nyaggle.feature
 ----------------------------------------

 .. automodule:: nyaggle.feature.category_encoder
--- a/docs/source/reference/hyper_parameters.rst
+++ b/docs/source/reference/hyper_parameters.rst
@ -1,4 +1,4 @@
-hyper_parameters
+nyaggle.hyper_parameters
 --------------------------

 .. automodule:: nyaggle.hyper_parameters
--- a/docs/source/reference/util.rst
+++ b/docs/source/reference/util.rst
@ -1,4 +1,4 @@
-util
+nyaggle.util
 -----------------------

 .. automodule:: nyaggle.util
--- a/docs/source/reference/validation.rst
+++ b/docs/source/reference/validation.rst
@ -1,4 +1,4 @@
-validation
+nyaggle.validation
 --------------------------

 .. automodule:: nyaggle.validation
--- a/docs/source/tutorial/experiment_advanced.rst
+++ b/docs/source/tutorial/experiment_advanced.rst
@ -55,8 +55,8 @@ If you are familiar with mlflow tracking, you may notice that these APIs are sim



-Log extra parameters to run_experiment
---------------------------------------
+Logging extra parameters to run_experiment
+-------------------------------------------

 By using ``inherit_experiment`` parameter, you can mix any additional logging with the results ``run_experiment`` will create.
 In the following example, nyaggle records the result of ``run_experiment`` under the same experiment as
@ -74,3 +74,35 @@ the parameter and metrics written outside of the function.

      exp.log_metrics('my extra metrics', 0.999)

+
+Tracking seed averaging experiment
+---------------------------------------
+
+If you train a bunch of models with different seeds to ensemble them, tracking individual models with mlflow
+will make GUI filled up with these results and make it difficult to manage.
+A nested run functionality of mlflow is useful to display multiple models together in one result.
+
+.. code-block:: python
+
+  import mlflow
+  from nyaggle.experiment import average_results
+
+  mlflow.start_run()
+  base_logging_dir = './seed-avg/'
+  results = []
+
+  for i in range(3):
+      mlflow.start_run(nested=True)  # use nested-run to place each experiments under the parent run
+      params['seed'] = i
+
+      result = run_experiment(params,
+                              X_train,
+                              y_train,
+                              X_test,
+                              logging_directory=base_logging_dir+f'seed_{i}',
+                              with_mlflow=True)
+      results.append(result)
+
+      mlflow.end_run()
+
+  average_results([base_logging_dir+f'seed_{i}' for i in range(3)], base_logging_dir+'sub.csv')