EDA and Feature Engineering in One Jupyter Notebook and Modeling in the Other

Once you have a Union account, install union:

pip install union

Export the following environment variable to build and push images to your own container registry:

        
# replace with your registry name
export IMAGE_SPEC_REGISTRY="<your-container-registry>"

Then run the following commands to run the workflow:

        
    
$ git clone https://github.com/unionai/unionai-examples
$ cd unionai-examples
$ union run --remote <path/to/file.py> <workflow_name> <params>

The source code for this example can be found here.

In this example, we will implement a simple pipeline that takes hyperparameters, does EDA, feature engineering (step 1: EDA and feature engineering in notebook), and measures the Gradient Boosting model’s performance using mean absolute error (MAE) (step 2: Modeling in notebook).

First, let’s import the libraries we will use in this example.

        
import pathlib

import pandas as pd
from flytekit import Resources, kwtypes, workflow
from flytekitplugins.papermill import NotebookTask

We define a NotebookTask to run the Jupyter notebook (EDA). This notebook returns dummified_data and dataset as the outputs.

dataset is used in this example, and dummified_data is used in the previous example. dataset lets us send the DataFrame as a JSON string to the subsequent notebook because DataFrame input cannot be sent directly to the notebook as per Papermill.

        
    
nb_1 = NotebookTask(
    name="eda-featureeng-nb",
    notebook_path=str(pathlib.Path(__file__).parent.absolute() / "supermarket_regression_1.ipynb"),
    outputs=kwtypes(dummified_data=pd.DataFrame, dataset=str),
    requests=Resources(mem="500Mi"),
)

We define a NotebookTask to run the Jupyter notebook (Modeling). This notebook returns mae_score as the output.

        
    
nb_2 = NotebookTask(
    name="regression-nb",
    notebook_path=str(pathlib.Path(__file__).parent.absolute() / "supermarket_regression_2.ipynb"),
    inputs=kwtypes(
        dataset=str,
        n_estimators=int,
        max_depth=int,
        max_features=str,
        min_samples_split=int,
        random_state=int,
    ),
    outputs=kwtypes(mae_score=float),
    requests=Resources(mem="500Mi"),
)

We define a Workflow to run the notebook tasks.

        
    
@workflow
def notebook_wf(
    n_estimators: int = 150,
    max_depth: int = 3,
    max_features: str = "sqrt",
    min_samples_split: int = 4,
    random_state: int = 2,
) -> float:
    eda_output = nb_1()
    regression_output = nb_2(
        dataset=eda_output.dataset,
        n_estimators=n_estimators,
        max_depth=max_depth,
        max_features=max_features,
        min_samples_split=min_samples_split,
        random_state=random_state,
    )
    return regression_output.mae_score

We can now run the two notebooks locally.

        
if __name__ == "__main__":
    print(notebook_wf())