Serve LLMs with Ollama

Once you have a Union account, install union:

pip install union

Export the following environment variable to build and push images to your own container registry:

# replace with your registry name
export IMAGE_SPEC_REGISTRY="<your-container-registry>"

Then run the following commands to run the workflow:

git clone https://github.com/unionai/unionai-examples
cd unionai-examples
union run --remote tutorials/sentiment_classifier/sentiment_classifier.py main --model distilbert-base-uncased

The source code for this tutorial can be found here {octicon}mark-github.

In this guide, you’ll learn how to locally serve Gemma2 and fine-tuned Llama3 models using Ollama within a Flyte task. Start by importing Ollama from the flytekitplugins.inference package and specifying the desired model name. Below is a straightforward example of serving a Gemma2 model:

from flytekit import ImageSpec, Resources, task
from flytekit.extras.accelerators import A10G
from flytekitplugins.inference import Model, Ollama
from openai import OpenAI

image = ImageSpec(
    name="ollama_serve",
    registry="ghcr.io/flyteorg",
    packages=["flytekitplugins-inference"],
    builder="default",
)

ollama_instance = Ollama(model=Model(name="gemma2"), gpu="1")


@task(
    container_image=image,
    pod_template=ollama_instance.pod_template,
    accelerator=A10G,
    requests=Resources(gpu="0"),
)
def model_serving(user_prompt: str) -> str:
    client = OpenAI(base_url=f"{ollama_instance.base_url}/v1", api_key="ollama")  # api key required but ignored

    completion = client.chat.completions.create(
        model="gemma2",
        messages=[
            {
                "role": "user",
                "content": user_prompt,
            }
        ],
        temperature=0.5,
        top_p=1,
        max_tokens=1024,
    )

    return completion.choices[0].message.content

Replace ghcr.io/flyteorg with a container registry to which you can publish. To upload the image to the local registry in the demo cluster, indicate the registry as localhost:30000. The model_serving task initiates a sidecar service to serve the model, making it accessible on localhost via the base_url property. You can use either the chat or chat completion endpoints. By default, Ollama initializes the server with cpu, gpu, and mem set to 1, 1, and 15Gi, respectively. You can adjust these settings to meet your requirements. To serve a fine-tuned model, provide the model configuration as modelfile within the Model dataclass. Below is an example of specifying a fine-tuned LoRA adapter for a Llama3 Mario model:

from flytekit.types.file import FlyteFile

finetuned_ollama_instance = Ollama(
    model=Model(
        name="llama3-mario",
        modelfile="FROM llama3\nADAPTER {inputs.ggml}\nPARAMETER temperature 1\nPARAMETER num_ctx 4096\nSYSTEM {inputs.system_prompt}",
    ),
    gpu="1",
)


@task(
    container_image=image,
    pod_template=finetuned_ollama_instance.pod_template,
    accelerator=A10G,
    requests=Resources(gpu="0"),
)
def finetuned_model_serving(ggml: FlyteFile, system_prompt: str):
    ...

{inputs.ggml} and {inputs.system_prompt} are materialized at run time, with ggml and system_prompt available as inputs to the task. Ollama models can be integrated into different stages of your AI workflow, including data pre-processing, model inference, and post-processing. Flyte also allows serving multiple Ollama models simultaneously on various instances. This integration enables you to self-host and serve AI models on your own infrastructure, ensuring full control over costs and data security. For more detailed information on the models natively supported by Ollama, visit the Ollama models library.