Samhita Alla

Union Powers Faster End-to-End AI Application Deployment using NVIDIA NIM

At Union, we understand the complexities of deploying generative AI in production. Unlike local development, production deployment demands meticulous management of numerous configuration settings. Promoting AI models from development to production requires optimizing for latency, throughput, resources, observability, security, and other considerations. Beyond infrastructure, AI engineers can benefit from a simpler deployment user experience (UX) to make the process faster, simpler, and more effective, reducing the time to value.

What we need is a solution that alleviates the infrastructure burden and accelerates both development and deployment processes.

Union abstracts away the low-level details of production-grade AI orchestration. Complementing this, NVIDIA NIM microservices, part of the NVIDIA AI Enterprise software platform, accelerates the development and deployment of generative AI models. Union’s recent NIM integration enables serving of AI models as sidecar services in Kubernetes pods, aka Flyte tasks, eliminating network overhead and the hassle of spinning up Docker containers. The interface facilitates the concurrent serving of multiple AI models and enables parallel inference execution. AI engineers can use NIM with Union to create end-to-end inference solutions in single, centralized pipelines. By simplifying the UX around pipeline management, Union enables AI practitioners to focus on extracting value from data:

  • Infrastructure as code (IaC): With Union, managing resources such as GPUs, CPUs, and memory within Python code is effortless. Adapt to changing computational needs seamlessly without manual intervention.
  • Orchestrate your AI: From typing to caching, reproducibility, setting up secrets, and version control, Union handles it all. With the NIM integration, model serving has been made even more seamless. Focus solely on your business logic!
  • Build unified pipelines, not disjointed fragments: Union enables you to build unified pipelines that cover data processing, fine-tuning, data post-processing, notifications, and more. Easily incorporate NIM into your batch inference or RAG pipelines. 

With Union and NIM, you can rapidly promote your workflows from AI development to production, reducing both operational costs and time to market. This integration enables you to self-host and serve optimized AI models on your own infrastructure powered by Union, ensuring full control over costs and data security. By eliminating dependence on third-party APIs for AI model access, you gain not only enhanced control but also potentially lower expenses compared to traditional API services.  

Figure: Why NIM stands out as the optimal choice

Serve NVIDIA NIM on Union 

Suppose you have a batch of inputs for which you want the model to generate predictions; you can host the model locally with Union and invoke the endpoint without any network overhead. Use cases include text summarization, video transcription, and more. You can incorporate NIM microservices into different stages of your AI workflow: data pre-processing, model inference, and post-processing and analysis. 

Union also allows you to serve multiple NIM microservices simultaneously, each with different configurations on various instances. This flexibility enables:

  • A/B testing of different models
  • Parallel processing for improved throughput
  • Simpler, more flexible way to build specialized applications for varied tasks within a single pipeline

Near real-time inference on Union is also on the horizon—coming soon!

Serving NIM on Union is straightforward. First, install the inference plugin with:

Copied to clipboard!
pip install flytekitplugins-inference

Then, instantiate the `NIM` class and provide the necessary configuration:

Copied to clipboard!
from flytekitplugins.inference import NIM, NIMSecrets

nim_instance = NIM(
    image="nvcr.io/nim/meta/llama3-8b-instruct:1.0.0",
    secrets=NIMSecrets(
        ngc_image_secret=NGC_IMAGE_SECRET,
        ngc_secret_key=NGC_KEY,
        secrets_prefix="_UNION_",
    )
)

Suppose you want to build a pipeline that:

  1. Fine-tunes a Llama 3 8B Instruct model with Parameter Efficient Fine Tuning (PEFT), specifically employing Low-Rank Adaptation (LoRA) 
  2. Uploads the model to Hugging Face Hub
  3. Serves the fine-tuned Llama NIM by mounting the LoRA adapter

Union enables you to build a single, centralized workflow that handles all the aforementioned operations.

You can encapsulate the fine-tuning code in a single Flyte task. In the example workflow, I utilize ORPO and PEFT for fine-tuning, and the Weights & Biases integration to log the metrics.

Here's how you can deploy the NIM Llama container in Union while also mounting the LoRA adapter:

Copied to clipboard!
from flytekit import ImageSpec, Resources, Secret, task
from flytekit.extras.accelerators import A10G
from flytekitplugins.inference import NIM, NIMSecrets
from openai import OpenAI

from constants import BUILDER, HF_KEY, HF_REPO_ID, NGC_IMAGE_SECRET, NGC_KEY, REGISTRY

image = ImageSpec(
    name="nim_serve",
    registry=REGISTRY,
    apt_packages=["git"],
    packages=["flytekitplugins-inference>=1.13.1"],
    builder=BUILDER,
)

nim_instance = NIM(
    image="nvcr.io/nim/meta/llama3-8b-instruct:1.0.0",
    secrets=NIMSecrets(
        ngc_image_secret=NGC_IMAGE_SECRET,
        ngc_secret_key=NGC_KEY,
        secrets_prefix="_UNION_",
        hf_token_key=HF_KEY,
    ),
    hf_repo_ids=[HF_REPO_ID],
    lora_adapter_mem="500Mi",
    env={"NIM_PEFT_SOURCE": "/home/nvs/loras"},
)


@task(
    container_image=image,
    pod_template=nim_instance.pod_template,
    secret_requests=[
        Secret(key=HF_KEY, mount_requirement=Secret.MountType.ENV_VAR),
        Secret(
            key=NGC_KEY, mount_requirement=Secret.MountType.ENV_VAR
        ),  # must be mounted as env vars
    ],
    accelerator=A10G,
    requests=Resources(gpu="0"),
)
def model_serving(questions: list[str], repo_id: str) -> list[str]:
    responses = []
    client = OpenAI(
        base_url=f"{nim_instance.base_url}/v1", api_key="nim"
    )  # api key required but ignored

    for question in questions:
        completion = client.chat.completions.create(
            model=repo_id.split("/")[1],
            messages=[
                {"role": "system", "content": "You are a knowledgeable AI assistant."},
                {"role": "user", "content": question},
            ],
            max_tokens=256,
        )
        responses.append(completion.choices[0].message.content)

    return responses

You can directly invoke your NIM-powered model within a Flyte task. Union simplifies the UX for managing NIM microservices as follows:

  • Secrets management: Easily create secrets using `unionai create secret <secret-name>`, and once mounted as environment variables, they are integrated into NIM microservices. Deleting secrets is just as straightforward with `unionai delete secret <secret-name>`.
  • LoRA adapters: Union supports custom LoRA adapters from Hugging Face repositories. Specify repository ID and token for private repos, memory allocation for adapter downloads, and PEFT source environment variables. Union automatically fetches and mounts adapters on containers. You can provide multiple repository IDs to use multiple LoRA adapters as NIM supports multi-LoRA inference. Native directory and NGC support coming soon.
  • Instance declaration: Setting up GPU instances is easy—just specify the accelerator type, like choosing an L4 instance.

The end-to-end example is available on GitHub.

Get Started

Union and NGC combined will take AI development and deployment up a notch. If you find yourself struggling with serving AI models and orchestrating AI pipelines, give this integration a try. 

Don’t hesitate to reach out to the Union team if you're considering building production-grade AI pipelines. 

NVIDIA
AI Orchestration