Serve Generative AI Models with NIM

Once you have a Union account, install union:

pip install union

Export the following environment variable to build and push images to your own container registry:

# replace with your registry name
export IMAGE_SPEC_REGISTRY="<your-container-registry>"

Then run the following commands to run the workflow:

git clone https://github.com/unionai/unionai-examples
cd unionai-examples
union run --remote tutorials/sentiment_classifier/sentiment_classifier.py main --model distilbert-base-uncased

The source code for this tutorial can be found here {octicon}mark-github.

This guide demonstrates how to serve a Llama 3 8B model locally with NIM within a Flyte task. First, instantiate NIM by importing it from the flytekitplugins.inference package and specifying the image name along with the necessary secrets. The ngc_image_secret is required to pull the image from NGC, the ngc_secret_key is used to pull models from NGC after the container is up and running, and secrets_prefix is the environment variable prefix to access {ref}secrets <secrets>. Below is a simple task that serves a Llama NIM container:

from flytekit import ImageSpec, Resources, Secret, task
from flytekit.extras.accelerators import A10G
from flytekitplugins.inference import NIM, NIMSecrets
from openai import OpenAI

image = ImageSpec(
    name="nim",
    registry="ghcr.io/flyteorg",
    packages=["flytekitplugins-inference"],
)

nim_instance = NIM(
    image="nvcr.io/nim/meta/llama3-8b-instruct:1.0.0",
    secrets=NIMSecrets(
        ngc_image_secret="nvcrio-cred",
        ngc_secret_key="ngc-api-key",
        ngc_secret_group="ngc",
        secrets_prefix="_FSEC_",
    ),
)


@task(
    container_image=image,
    pod_template=nim_instance.pod_template,
    accelerator=A10G,
    secret_requests=[
        Secret(
            group="ngc", key="ngc-api-key", mount_requirement=Secret.MountType.ENV_VAR
        )  # must be mounted as an env var
    ],
    requests=Resources(gpu="0"),
)
def model_serving() -> str:
    client = OpenAI(base_url=f"{nim_instance.base_url}/v1", api_key="nim")  # api key required but ignored

    completion = client.chat.completions.create(
        model="meta/llama3-8b-instruct",
        messages=[
            {
                "role": "user",
                "content": "Write a limerick about the wonders of GPU computing.",
            }
        ],
        temperature=0.5,
        top_p=1,
        max_tokens=1024,
    )

    return completion.choices[0].message.content

Replace ghcr.io/flyteorg with a container registry to which you can publish. To upload the image to the local registry in the demo cluster, indicate the registry as localhost:30000. The model_serving task initiates a sidecar service to serve the model, making it accessible on localhost via the base_url property. Both chat and chat completion endpoints can be utilized. You need to mount the secret as an environment variable, as it must be accessed by the NGC_API_KEY environment variable within the NIM container. By default, the NIM instantiation sets cpu, gpu, and mem to 1, 1, and 20Gi, respectively. You can modify these settings as needed. To serve a fine-tuned Llama model, specify the HuggingFace repo ID in hf_repo_ids as [<your-hf-repo-id>] and the LoRa adapter memory as lora_adapter_mem. Set the NIM_PEFT_SOURCE environment variable by including env={"NIM_PEFT_SOURCE": "..."} in the task decorator. Here is an example initialization for a fine-tuned Llama model:

nim_instance = NIM(
    image="nvcr.io/nim/meta/llama3-8b-instruct:1.0.0",
    secrets=NIMSecrets(
        ngc_image_secret="nvcrio-cred",
        ngc_secret_key="ngc-api-key",
        ngc_secret_group="ngc",
        secrets_prefix="_FSEC_",
        hf_token_key="hf-key",
        hf_token_group="hf",
    ),
    hf_repo_ids=["<your-hf-repo-id>"],
    lora_adapter_mem="500Mi",
    env={"NIM_PEFT_SOURCE": "/home/nvs/loras"},
)

Native directory and NGC support for LoRa adapters coming soon. NIM containers can be integrated into different stages of your AI workflow, including data pre-processing, model inference, and post-processing. Flyte also allows serving multiple NIM containers simultaneously, each with different configurations on various instances. This integration enables you to self-host and serve optimized AI models on your own infrastructure, ensuring full control over costs and data security. By eliminating dependence on third-party APIs for AI model access, you gain not only enhanced control but also potentially lower expenses compared to traditional API services. For more detailed information, refer to the NIM documentation by NVIDIA.