EDA and Feature Engineering in One Jupyter Notebook and Modeling in the Other
Once you have a Union account, install union
:
pip install union
Export the following environment variable to build and push images to your own container registry:
# replace with your registry name
export IMAGE_SPEC_REGISTRY="<your-container-registry>"
Then run the following commands to run the workflow:
git clone https://github.com/unionai/unionai-examples
cd unionai-examples
union run --remote tutorials/sentiment_classifier/sentiment_classifier.py main --model distilbert-base-uncased
The source code for this tutorial can be found here {octicon}mark-github
.
First, let’s import the libraries we will use in this example.
import pathlib
import pandas as pd
from flytekit import Resources, kwtypes, workflow
from flytekitplugins.papermill import NotebookTask
We define a NotebookTask
to run the Jupyter notebook (EDA).
This notebook returns dummified_data
and dataset
as the outputs.
dataset
is used in this example, and dummified_data
is used in the previous example.
dataset
lets us send the DataFrame as a JSON string to the subsequent notebook because DataFrame input cannot be sent
directly to the notebook as per Papermill.
nb_1 = NotebookTask(
name="eda-featureeng-nb",
notebook_path=str(pathlib.Path(__file__).parent.absolute() / "supermarket_regression_1.ipynb"),
outputs=kwtypes(dummified_data=pd.DataFrame, dataset=str),
requests=Resources(mem="500Mi"),
)
We define a NotebookTask
to run the Jupyter notebook
(Modeling).
This notebook returns mae_score
as the output.
nb_2 = NotebookTask(
name="regression-nb",
notebook_path=str(pathlib.Path(__file__).parent.absolute() / "supermarket_regression_2.ipynb"),
inputs=kwtypes(
dataset=str,
n_estimators=int,
max_depth=int,
max_features=str,
min_samples_split=int,
random_state=int,
),
outputs=kwtypes(mae_score=float),
requests=Resources(mem="500Mi"),
)
We define a Workflow
to run the notebook tasks.
@workflow
def notebook_wf(
n_estimators: int = 150,
max_depth: int = 3,
max_features: str = "sqrt",
min_samples_split: int = 4,
random_state: int = 2,
) -> float:
eda_output = nb_1()
regression_output = nb_2(
dataset=eda_output.dataset,
n_estimators=n_estimators,
max_depth=max_depth,
max_features=max_features,
min_samples_split=min_samples_split,
random_state=random_state,
)
return regression_output.mae_score
We can now run the two notebooks locally.
if __name__ == "__main__":
print(notebook_wf())