Samhita Alla

Serve Fine-tuned LLMs with Ollama

Ollama is a platform designed to simplify running open-source large language models (LLMs) locally. Deploying LLMs is often challenging due to varying model configurations and the lack of a standardized serving method. Ollama takes the headache out of this process by packaging everything you need to run an LLM—from model weights to system prompts—into a single, easy-to-use modelfile. The best part? Serving these models is incredibly straightforward. You can spin up a server, pull or create a custom model, and instantly start generating predictions.

With the Ollama plugin, serving fine-tuned models becomes even more efficient. Whether you’re fine-tuning a model for further evaluation or deploying an LLM for downstream tasks, the plugin makes it easy to integrate Ollama into Flyte tasks without the usual complexities of orchestrating serving infrastructure. Just instantiate the plugin, and you’ll have direct access to your model right within your Flyte task. It’s really that simple!

Simplify and accelerate AI model fine-tuning and serving with full control over infrastructure, costs, and data security

To truly appreciate what the Ollama plugin offers, it's helpful to understand what challenges exist without it.

While Ollama is a powerful tool on its own—and, in my experience, incredibly easy to set up—there are specific scenarios where the plugin really shines:

  1. Serving models on general-purpose hardware: After fine-tuning your model on a GPU, you might need to see how it performs on other types of hardware, whether accelerated or general-purpose. You should be able to switch between these hardware options with ease.
  2. Running batch inference within your AI pipeline: Integrate batch inference into your AI pipelines.

In the first scenario, fine-tuning on a GPU is often essential, but it can be cumbersome to handle fine-tuning and model serving in two separate environments. The Ollama plugin simplifies this by allowing you to preprocess data, fine-tune your model, and generate predictions all within a single, cohesive pipeline. This means you can serve your model right after fine-tuning, without the usual infrastructure hassles. Union abstracts away the complexity of fine-tuning, while the plugin allows you to serve models locally within a Flyte task, minimizing network overhead. Even better, you can easily configure different hardware for model serving, giving you full flexibility and control.

The second scenario highlights batch inference, which is becoming a staple in AI pipelines due to its cost efficiency compared to real-time inference. Real-time processing isn’t always necessary, and batch inference offers a practical alternative. With Union’s orchestration capabilities, you can easily integrate batch inference using the Ollama plugin, serving both pre-trained and custom LLMs. You can also build Retrieval Augmented Generation (RAG) batch inference pipelines with ease using this plugin.

This plugin enables you to self-host and serve optimized AI models on your own infrastructure powered by Union, ensuring full control over costs and data security. By eliminating dependence on third-party APIs for AI model access, you gain not only enhanced control but also potentially lower expenses compared to traditional API services. 

Union also gives you the flexibility to serve multiple Ollama models at the same time, each with its own configuration on different instances. This setup allows you to:

  • Boost throughput with parallel processing: Handle multiple Ollama sidecar container tasks simultaneously for better performance.
  • Build specialized applications with ease: Create and manage a variety of tasks within a single pipeline, tailored to your specific needs.

Fine-tune and serve Phi3 model

The plugin is best understood through a practical example. Let’s walk through the process of creating a PubMed dataset, fine-tuning a Phi3 model, and then serving it. The source code of this workflow can be found here.

The dataset creation can be handled in a single Flyte task. 

Copied to clipboard!
from flytekit import task
from flytekit.types.directory import FlyteDirectory

@task(...)
def create_dataset(queries: list[str], top_n: int) -> FlyteDirectory:
    from ollama.pubmed_dataset import create_dataset

    working_dir = Path(current_context().working_directory)
    output_dir = working_dir / "dataset"

    create_dataset(
        output_dir,
        queries=queries,
        top_n=top_n,
    )
    return FlyteDirectory(path=str(output_dir))

Next, we’ll fine-tune the model on a GPU, allocating the necessary resources to ensure optimal performance. 

Copied to clipboard!
@task(
   ...
   accelerator=T4,
   requests=Resources(mem="10Gi", cpu="2", gpu="1"),
   environment={"TOKENIZERS_PARALLELISM": "false"},
)
def phi3_finetune(
   train_args: TrainingConfig, peft_args: PEFTConfig, dataset_dir: FlyteDirectory
) -> tuple[FlyteDirectory, FlyteDirectory]:
    model = load_model(train_args)
    tokenizer = initialize_tokenizer(train_args.model)
    dataset_splits = prepare_dataset(dataset_dir, train_args, tokenizer)
    trainer = create_trainer(model, train_args, peft_args, dataset_splits, tokenizer)
    save_model(trainer, train_args)

    return FlyteDirectory(train_args.adapter_dir), FlyteDirectory(train_args.output_dir)

Once fine-tuning is complete, the task returns two directories: one containing the LoRA adapter and the other containing the fine-tuned model.

To serve the model with Ollama, the LoRA adapter must be converted to the GGUF format. Alternatively, you can opt to convert it into the Safetensors format, depending on your deployment needs.

Copied to clipboard!
@task(...)
def lora_to_gguf(adapter_dir: FlyteDirectory, model_dir: FlyteDirectory) -> FlyteFile:
    ...
    subprocess.run(
        [
            sys.executable,
            "/root/llama.cpp/convert_lora_to_gguf.py",
            adapter_dir.path,
            "--base",
            model_dir.path,
            "--outfile",
            str(output_dir / "model.gguf"),
            "--outtype",
            "q8_0",  # quantize the model to 8-bit float representation
        ],
        check=True,
    )

    return FlyteFile(str(output_dir / "model.gguf"))

We’ll then instantiate the `Ollama` class, derived from `flytekitplugins.inference`. Make sure to install the plugin first using the command: `pip install flytekitplugins-inference`.

Copied to clipboard!
from flytekitplugins.inference import Model, Ollama

ollama_instance = Ollama(
    model=Model(
        name="phi3-pubmed",
        modelfile='''
FROM phi3:mini-4k
ADAPTER {inputs.gguf}

TEMPLATE """{{ if .System }}<|system|>
{{ .System }}<|end|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|end|>
{{ end }}<|assistant|>
{{ .Response }}<|end|>"""

PARAMETER stop "<|end|>"
PARAMETER stop "<|user|>"
PARAMETER stop "<|assistant|>"
PARAMETER stop "<|system|>"
PARAMETER num_predict 512
PARAMETER seed 42
PARAMETER temperature 0.05

SYSTEM """
You are a medical research assistant AI that has
been fine-tuned on the latest research. Use the latest knowledge beyond your
initial training data cutoff to provide the most up-to-date information.
"""
''',
   )
)

@task(pod_template=ollama_instance.pod_template, ...)
def model_serving(questions: list[str], gguf: FlyteFile) -> list[str]:
    from openai import OpenAI

    client = OpenAI(
        base_url=f"{ollama_instance.base_url}/v1", api_key="ollama"
    )  
    ...

In the code, we define a `modelfile` to reference the fine-tuned model along with a custom system prompt. Flyte will dynamically materialize `{inputs.gguf}` during runtime, and the `gguf` input corresponds to the task. From the instantiated object, you can access the base URL and use it to hit the endpoint for generating predictions.

All these tasks can be encapsulated within a Flyte workflow, as demonstrated below:

Copied to clipboard!
from flytekit import workflow

@workflow
def phi3_ollama(...) -> list[str]:
    dataset_dir = create_dataset(queries=queries, top_n=top_n)
    adapter_dir, model_dir = phi3_finetune(
        train_args=train_args, peft_args=peft_args, dataset_dir=dataset_dir
    )
    gguf_file = lora_to_gguf(model_dir=model_dir)
    return model_serving(
        questions=model_queries,
        gguf=gguf_file,
    )

Union makes managing Ollama workflows a breeze by handling the entire lifecycle. Here’s how it benefits you:

  • Single, Cohesive Pipeline: Bring everything together in one Flyte workflow. From data processing to model serving, keep everything in one place without having to deal with disconnected pieces.
  • Per-Task Resource Allocation: Customize how much CPU, GPU, and memory each task gets. You can allocate more resources for heavy lifting like fine-tuning and scale back for lighter tasks like data processing and model serving.
  • Caching: Save time and avoid redundant work by caching tasks with the same inputs. This means you won’t have to repeat tasks unnecessarily, making your workflow more efficient.

Some sections of the code in this post were contributed by Niels Bantilan.

Start serving your models today

Ready to dive in? With the Ollama plugin, you can start serving your fine-tuned models right away. It’s easy to generate batch predictions by serving models locally within Flyte tasks. Plus, you can plug Ollama into various stages of your AI workflow, from data pre-processing to model inference and even post-processing and analysis.

Thinking about building production-grade AI pipelines? Feel free to reach out to the Union team—we’re here to help you make it happen.

LLMs
Model Training
Model Deployment
Inference
AI Orchestration