Serve Generative AI Models with NIM
Once you have a Union account, install union
:
pip install union
Export the following environment variable to build and push images to your own container registry:
# replace with your registry name
export IMAGE_SPEC_REGISTRY="<your-container-registry>"
Then run the following commands to run the workflow:
git clone https://github.com/unionai/unionai-examples
cd unionai-examples
union run --remote tutorials/sentiment_classifier/sentiment_classifier.py main --model distilbert-base-uncased
The source code for this tutorial can be found here {octicon}mark-github
.
flytekitplugins.inference
package and specifying the image name along with the necessary secrets.
The ngc_image_secret
is required to pull the image from NGC, the ngc_secret_key
is used to pull models
from NGC after the container is up and running, and secrets_prefix
is the environment variable prefix to access {ref}secrets <secrets>
.
Below is a simple task that serves a Llama NIM container:
from flytekit import ImageSpec, Resources, Secret, task
from flytekit.extras.accelerators import A10G
from flytekitplugins.inference import NIM, NIMSecrets
from openai import OpenAI
image = ImageSpec(
name="nim",
registry="ghcr.io/flyteorg",
packages=["flytekitplugins-inference"],
)
nim_instance = NIM(
image="nvcr.io/nim/meta/llama3-8b-instruct:1.0.0",
secrets=NIMSecrets(
ngc_image_secret="nvcrio-cred",
ngc_secret_key="ngc-api-key",
ngc_secret_group="ngc",
secrets_prefix="_FSEC_",
),
)
@task(
container_image=image,
pod_template=nim_instance.pod_template,
accelerator=A10G,
secret_requests=[
Secret(
group="ngc", key="ngc-api-key", mount_requirement=Secret.MountType.ENV_VAR
) # must be mounted as an env var
],
requests=Resources(gpu="0"),
)
def model_serving() -> str:
client = OpenAI(base_url=f"{nim_instance.base_url}/v1", api_key="nim") # api key required but ignored
completion = client.chat.completions.create(
model="meta/llama3-8b-instruct",
messages=[
{
"role": "user",
"content": "Write a limerick about the wonders of GPU computing.",
}
],
temperature=0.5,
top_p=1,
max_tokens=1024,
)
return completion.choices[0].message.content
Replace ghcr.io/flyteorg
with a container registry to which you can publish.
To upload the image to the local registry in the demo cluster, indicate the registry as localhost:30000
.
The model_serving
task initiates a sidecar service to serve the model, making it accessible on localhost via the base_url
property.
Both chat and chat completion endpoints can be utilized.
You need to mount the secret as an environment variable, as it must be accessed by the NGC_API_KEY
environment variable within the NIM container.
By default, the NIM instantiation sets cpu
, gpu
, and mem
to 1
, 1
, and 20Gi
, respectively. You can modify these settings as needed.
To serve a fine-tuned Llama model, specify the HuggingFace repo ID in hf_repo_ids
as [<your-hf-repo-id>]
and the
LoRa adapter memory as lora_adapter_mem
. Set the NIM_PEFT_SOURCE
environment variable by
including env={"NIM_PEFT_SOURCE": "..."}
in the task decorator.
Here is an example initialization for a fine-tuned Llama model:
nim_instance = NIM(
image="nvcr.io/nim/meta/llama3-8b-instruct:1.0.0",
secrets=NIMSecrets(
ngc_image_secret="nvcrio-cred",
ngc_secret_key="ngc-api-key",
ngc_secret_group="ngc",
secrets_prefix="_FSEC_",
hf_token_key="hf-key",
hf_token_group="hf",
),
hf_repo_ids=["<your-hf-repo-id>"],
lora_adapter_mem="500Mi",
env={"NIM_PEFT_SOURCE": "/home/nvs/loras"},
)
Native directory and NGC support for LoRa adapters coming soon. NIM containers can be integrated into different stages of your AI workflow, including data pre-processing, model inference, and post-processing. Flyte also allows serving multiple NIM containers simultaneously, each with different configurations on various instances. This integration enables you to self-host and serve optimized AI models on your own infrastructure, ensuring full control over costs and data security. By eliminating dependence on third-party APIs for AI model access, you gain not only enhanced control but also potentially lower expenses compared to traditional API services. For more detailed information, refer to the NIM documentation by NVIDIA.