Local image builduing

With Flyte, every task in a workflow runs within its own dedicated container. Since a container requires a container image to run, every task in Flyte must have a container image associated with it. You can specify the container image to be used by a task by defining an ImageSpec object and passing it to the container_image parameter of the @fl.task decorator. When you register the workflow, the container image is built locally and pushed to the container registry that you specify. When the workflow is executed, the container image is pulled from that registry and used to run the task.

See the ImageSpec API documentation for full documentation of ImageSpec class parameters and methods.

To illustrate the process, we will walk through an example.

Project structure

├── requirements.txt
└── workflows
    ├── __init__.py
    └── imagespec-simple-example.py

requirements.txt

union
pandas

imagespec-simple-example.py

import typing
import pandas as pd
import flytekit as fl

image_spec = union.ImageSpec(
    registry="ghcr.io/<my-github-org>",
    name="simple-example-image",
    base_image="ghcr.io/flyteorg/flytekit:py3.11-latest",
    requirements="requirements.txt"
)

@fl.task(container_image=image_spec)
def get_pandas_dataframe() -> typing.Tuple[pd.DataFrame, pd.Series]:
    df = pd.read_csv("https://storage.googleapis.com/download.tensorflow.org/data/heart.csv")
    print(df.head())
    return df[["age", "thalach", "trestbps", "chol", "oldpeak"]], df.pop("target")

@fl.workflow()
def wf() -> typing.Tuple[pd.DataFrame, pd.Series]:
    return get_pandas_dataframe()

Install and configure pyflyte and Docker

To install Docker, see Setting up container image handling. To configure pyflyte to connect to your Flyte instance, see Quick start.

Set up an image registry

You will need an image registry where the container image can be stored and pulled by Flyte when the task is executed. You can use any image registry that you have access to, including public registries like Docker Hub or GitHub Container Registry. Alternatively, you can use a registry that is part of your organization’s infrastructure such as AWS Elastic Container Registry (ECR) or Google Artifact Registry (GAR).

The registry that you choose must be one that is accessible to the Flyte instance where the workflow will be executed. Additionally, you will need to ensure that the specific image, once pushed to the registry, is itself publicly accessible.

In this example, we use GitHub’s ghcr.io container registry. See Working with the Container registry for more information.

For an example using Amazon ECR see ImageSpec with ECR. For an example using Google Artifact Registry see ImageSpec with GAR.

Authenticate to the registry

You will need to set up your local Docker client to authenticate with GHCR. This is needed for pyflyte CLI to be able to push the image built according to the ImageSpec to GHCR.

Follow the directions Working with the Container registry > Authenticating to the Container registry.

Set up your project and domain on Flyte

You will need to set up a project on your Flyte instance to which you can register your workflow. See Setting up the project.

Understand the requirements

The requirements.txt file contains the flytekit package and the pandas package, both of which are needed by the task.

Set up a virtual Python environment

Set up a virtual Python environment and install the dependencies defined in the requirements.txt file. Assuming you are in the local project root, run pip install -r requirements.txt.

Run the workflow locally

You can now run the workflow locally. In the project root directory, run: pyflyte run workflows/imagespec-simple-example.py wf. See Running your code for more details.

When you run the workflow in your local Python environment, the image is not built or pushed (in fact, no container image is used at all).

Register the workflow

To register the workflow to Flyte, in the local project root, run:

$ pyflyte register workflows/imagespec-simple-example.py

pyflyte will build the container image and push it to the registry that you specified in the ImageSpec object. It will then register the workflow to Flyte.

To see the registered workflow, go to the UI and navigate to the project and domain that you created above.

Ensure that the image is publicly accessible

If you are using the ghcr.io image registry, you must switch the visibility of your container image to Public before you can run your workflow on Flyte. See Configuring a package’s access control and visibility.

Run the workflow on Flyte

Assuming your image is publicly accessible, you can now run the workflow on Flyte by clicking Launch Workflow.

Make sure your image is accessible

If you try to run a workflow that uses a private container image or an image that is inaccessible for some other reason, the system will return an error:

... Failed to pull image ...
... Error: ErrImagePull
... Back-off pulling image ...
... Error: ImagePullBackOff

Multi-image workflows

You can also specify different images per task within the same workflow. This is particularly useful if some tasks in your workflow have a different set of dependencies where most of the other tasks can use another image.

In this example we specify two tasks: one that uses CPUs and another that uses GPUs. For the former task, we use the default image that ships with flytekit while for the latter task, we specify a pre-built image that enables distributed training with the Kubeflow Pytorch integration.

import numpy as np
import torch.nn as nn

@task(
    requests=Resources(cpu="2", mem="16Gi"),
    container_image="ghcr.io/flyteorg/flytekit:py3.9-latest",
)
def get_data() -> Tuple[np.ndarray, np.ndarray]:
    ...  # get dataset as numpy ndarrays


@task(
    requests=Resources(cpu="4", gpu="1", mem="16Gi"),
    container_image="ghcr.io/flyteorg/flytecookbook:kfpytorch-latest",
)
def train_model(features: np.ndarray, target: np.ndarray) -> nn.Module:
    ...  # train a model using gpus