Samhita Alla

Building an iOS App to Serve a Fine-Tuned Llama Model with Union and MLC-LLM

So, I had this idea the other day: What if I could get ChatGPT-like AI to run on my iPhone and speak to me in my native language, Telugu? Not through some API or cloud service, but actually running directly on my phone. 

While existing foundation models work well to a certain extent, they aren't particularly fluent in languages other than English. To address this, I’d need to fine-tune an LLM with a dataset of Telugu prompts and responses, package it up, and turn it into an app.

We all know LLMs are demanding when it comes to computing power. They need a lot of resources—whether it’s GPUs, CPUs, or memory—to generate predictions. The reason is simple: even producing a single sentence often requires multiple forward passes through the model, and these models typically have a massive number of parameters.

And I want this app to run on what we call "edge devices." If you're not familiar, edge devices are hardware that operates closer to the end-user or the source of data—like smartphones, tablets, or IoT devices.

In this post, I’ll walk you through how I fine-tuned a Llama 3 8B Instruct model on an A100 using Union Serverless, and then got it to run as an iOS app with the help of MLC-LLM—all at no cost, as Union Serverless offers $30 in free credits!

Fine-tuning Llama 3 on Cohere Aya

First things first: we need to download the Llama 3 8B Instruct model from the Hugging Face hub, along with the Cohere Aya dataset, which contains a diverse collection of prompts and completions across multiple languages.

Dataset for model training

These tasks need to be set up to cache the model and dataset, so we won’t have to download them again in future runs.

Copied to clipboard!
@task(
    cache=True,
    cache_version="0.1",
    ...,
)
def download_dataset(dataset: str, language: str) -> FlyteDirectory:
    from datasets import load_dataset
    ...
    load_dataset(dataset, language, cache_dir=cached_dataset_dir)
    return cached_dataset_dir


@task(
    cache=True,
    cache_version="0.1",
    ...
)
def download_model(model_name: str) -> FlyteDirectory:
    from huggingface_hub import login, snapshot_download
    ...
    snapshot_download(model_name, local_dir=cached_model_dir)
    return cached_model_dir

Next up, we’ll define a task to fine-tune our model. For this, we’ll leverage Quantized Low-Rank Adapters (QLoRA), which is a fantastic method for speeding up the fine-tuning process while using less memory. We’ll be using an A100 GPU, available on Union Serverless, to handle the heavy lifting.

To keep track of the training process, we’ll set up Weights and Biases, which will help us monitor the model’s performance throughout the fine-tuning process. Once we’re done, we’ll save the adapters and return them as a `FlyteDirectory`. For fine-tuning, we’re starting with 1,000 samples. 

Copied to clipboard!
@task(
    cache=True,
    cache_version="0.2",
    accelerator=A100,
    ...
)
@wandb_init(project=WANDB_PROJECT, entity=WANDB_ENTITY, secret=WANDB_SECRET)
def train_model(
    train_args: TrainingArguments,
    dataset_dir: FlyteDirectory,
    model_dir: FlyteDirectory,
) -> FlyteDirectory:
    # QLoRA config
    bnb_config = BitsAndBytesConfig(...)

    # Load model
    model = AutoModelForCausalLM.from_pretrained(...)

    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(...)

    # LoRA config
    peft_config = LoraConfig(...)
    model = get_peft_model(model, peft_config)

    ...

    # Trainer setup
    trainer = SFTTrainer(...)
 
    trainer.train()
    trainer.model.save_pretrained(train_args.output_dir)

    return FlyteDirectory(train_args.output_dir)

Next, we’ll merge the adapter with the original base model, which will allow us to serve the fully integrated model as an iOS app later on. It’s worth noting that we’re using an L4 GPU for this step since the A100 isn't necessary just for merging the models. One of the great features of Union Serverless is how easy it is to switch between GPUs, making it super convenient!

Once the task is complete, it returns an Artifact that includes both the model and the dataset partitions. This means you can easily retrieve the merged model artifact later in other workflows—simply query the artifact by providing the names of the dataset and the model. 

Copied to clipboard!
@task(accelerator=L4, ...)
def merge_model(
    model_name: str,
    dataset: str,
    model_dir: FlyteDirectory,
    adapter_dir: FlyteDirectory,
) -> Annotated[
    FlyteDirectory, ModelArtifact(model=Inputs.model_name, dataset=Inputs.dataset)
]:
    # Reload tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained(...)
    base_model_reload = AutoModelForCausalLM.from_pretrained(...)

    ...

    # Merge adapter with base model
    model = PeftModel.from_pretrained(base_model_reload, adapter_dir.path)
    model = model.merge_and_unload()
   
    ...

    model.save_pretrained(merged_model)
    tokenizer.save_pretrained(merged_model)

    return FlyteDirectory(merged_model)

Converting model weights for MLC compatibility

MLC-LLM is a powerful ML compiler and high-performance deployment engine designed specifically for LLMs. It allows you to deploy your models across various platforms, including iOS, Android, web browsers, and even as a Python or REST API.

To get our merged model up and running with MLC-LLM, we first need to convert the model weights into the MLC format. This is a straightforward process that involves running two commands: `mlc_llm convert_weight` to transform the weights into the MLC format, and `mlc_llm` `gen_config` to generate the chat configuration and process the tokenizers.

Copied to clipboard!
@task(accelerator=L4, ...)
def convert_model_weights_to_mlc(
    merged_model_dir: FlyteDirectory, conversion_template: str, quantization: str
) -> FlyteDirectory:
    ...

    subprocess.run(
        [
            "python",
            "-m",
            "mlc_llm",
            "convert_weight",
            merged_model_dir.path,
            "--quantization",
            quantization,
            "-o",
            str(output_dir / "finetuned-model-MLC"),
        ],
        check=True,
    )

    subprocess.run(
        [
            "python",
            "-m",
            "mlc_llm",
            "gen_config",
            merged_model_dir.path,
            "--quantization",
            quantization,
            "--conv-template",
            conversion_template,
            "-o",
            str(output_dir / "finetuned-model-MLC"),
        ],
        check=True,
    )
    return FlyteDirectory(str(output_dir / "finetuned-model-MLC"))

Next, we’ll push the model to a Hugging Face repository to make it easily accessible and create a launch plan to execute the conversion workflow once the fine-tuned model artifact is generated.

Copied to clipboard!
LaunchPlan.create(
    "finetuning_completion_trigger",
    convert_to_mlc_wf,
    trigger=OnArtifact(
        trigger_on=ModelArtifact,
        inputs={
            "merged_model_dir": ModelArtifact.query(
                model="meta-llama/Meta-Llama-3-8B-Instruct",
                dataset="CohereForAI/aya_collection_language_split",
            )
        },
    ),
)

To run the workflows on Union Serverless, you can find the necessary commands in the README file.

Building the iOS app

To build the iOS app, you’ll need to be working on macOS, as it comes with a few key dependencies that need to be installed first.

Here’s a script that handles all the setup for you.

Once everything is in place, here’s the code that runs behind the scenes to generate the iOS app:

Copied to clipboard!
def ios_local_deployment(
   model_hf_url: str, bundle_weight: bool, model_id: str, estimated_vram_bytes
):
    if not os.path.exists("./mlc-llm"):
        subprocess.run(
            ["git", "clone", "https://github.com/mlc-ai/mlc-llm.git", "./mlc-llm"],
            check=True,
        )

        ...

    # Update the JSON structure
    new_model_entry = {
        "device": "iphone",
        "model_list": [
            {
                "model": model_hf_url,
                "model_id": model_id,
                "estimated_vram_bytes": estimated_vram_bytes,
                "bundle_weight": bundle_weight,
            }
        ],
    }

    json_file_path = Path("./mlc-llm/ios/MLCChat/mlc-package-config.json")
    with open(json_file_path, "w") as f:
        json.dump(new_model_entry, f, indent=4)

    ...

    subprocess.run(
        ["python", "-m", "mlc_llm", "package"],
        cwd=mlc_chat_dir,
        env={"MLC_LLM_SOURCE_DIR": mlc_llm_source_dir},
        check=True,
    )

The `mlc_llm package` command compiles the model, builds the runtime and tokenizer, and creates a `dist/` directory inside the `MLCChat` folder.

At this stage, we’re also bundling the model weights directly into the app to avoid having to download them from Hugging Face every time the app runs—this speeds things up considerably.

Next, we need to open `./ios/MLCChat/MLCChat.xcodeproj` using Xcode (make sure Xcode is installed and you’ve accepted its terms and conditions). Also, ensure you have an active Apple Developer account, as Xcode may prompt you to use your developer team credentials and set up a product bundle identifier.

If you’re just looking to test the app, follow these steps:

  1. Go to Product > Scheme > Edit Scheme and replace “Release” with “Debug” under “Run”.
  2. Skip adding developer certificates.
  3. Use this bundle identifier pattern: `com.yourname.MLCChat`.
  4. Remove the "Extended Virtual Addressing" capability under the Target section.

Check out the demo showcasing the iOS app I created during the 2024 Union Hackathon!

This is encouraging and likely means that a fine-tune on additional data samples would yield great results! You can check out the existing codebase on GitHub here.

Democratizing AI: Your turn to build

Model fine-tuning and deployment should be accessible to everyone, and Union combined with MLC-LLM makes this a reality. With Union, you can fine-tune your models, perform all necessary pre-processing, and generate MLC model artifacts ready for deployment on any platform. Union is an ideal choice for building end-to-end AI solutions, and we’re here to support you every step of the way!

To dive deeper into Union Serverless, check out the documentation. If you have any questions about Serverless, don’t hesitate to join our Slack community in the #union-serverless channel!

Can't wait to see what you'll build! 🚀

Serverless
LLMs
Model Training
Model Deployment
AI Orchestration
Artifacts
GPUs