Host: 
Safe Elliott
Location: 
Virtual

Fine-tune & Serve LLMs with LoRA & QLoRA for Production - LLMOps Workshop

Fine-tune & Serve LLMs with LoRA & QLoRA for Production - LLMOps Workshop

Training complex AI models at scale requires orchestrating multiple steps into a reproducible workflow and understanding how to optimize resource utilization for efficient fine-tuning. Modern MLOps and LLMOps tools help streamline these processes, improving the efficiency and reliability of your AI pipelines. This workshop will introduce you to the basics of MLOps and best practices for building efficient AI pipelines for large language models (LLMs).

By completing this workshop, you'll gain hands-on experience structuring scalable and reproducible AI workflows for fine-tuning LLMs using best practices such as caching, versioning, containerized resource utilization, parameter-efficient fine-tuning (PEFT), and more. We'll use Hugging Face for transformers and datasets, PEFT for implementing LoRA and QLoRA, bitandbytes for quantization, and union.ai for scalable workflows, GPUs, and serving our fine-tuned model.

This workshop will cover

  • MLOps / LLMOps pipeline basics
  • Fine-tune a Hugging Face LLM model with LoRA & QLoRa
  • Build a scalable and reproducible production grade workflow
  • Deploy (Serve) your fine-tuned LLM in a real-time streamlit app
  • Concepts covered can transfer to more complex pipelines and models

What you'll need to follow along

  • A free Union.ai account (union.ai)
  • A GitHub account
  • A Google account for Colab

More Session Details

Part 1

MLOps/LLMOps & Effecent fine-tuning overview

Get introduced to the concepts around reproducible workflows, best practices for implementing efficient AI pipelines, and why parameter-efficient fine-tuning techniques such as Low-Rank Adaptation (LoRA) and QLoRA (Quantized Low-Rank Adaptation) are widespread.

Part 2

Build a scalable workflow and implement parameter-efficient fine-tuning [hands-on]

In this hands-on section, we'll walk through and run all the code to create our parameter-efficient fine-tuning workflow. We'll implement the following tasks:

  • Download Dataset
  • Download Model
  • Visualize Dataset
  • Fine-tune Model (LoRA & QLoRA)
  • Evaluate Model Performance
  • Perform Batch Inference

Part 3

Serve your fine-tuned LLM for real-time inference with a Streamlit UI

We'll pass our fine-tuned model artifact into an application using Streamlit to create a user interface for interaction.

After this section, you'll have the skills to build an end-to-end production-grade AI pipeline for fine-tuning and serving large language models (LLMs)

About the Speaker

Sage Elliott is an AI Engineer with a background in computer vision, LLM evaluation, MLOps, IoT, and Robotics. He's taught thousands of people at live workshops. You can usually find him in Seattle biking around to parks or reading in cafes, catching up on the latest read for AI Book Club.

Connect with Sage: linkedin.com/in/sageelliott

About Union.ai

Our AI workflow and inference platform unifies data, models and compute with the workflows of execution on a single pane of glass.

We also maintain Flyte, an open-source orchestrator that facilitates building production-grade data and ML pipelines.

💬 Join our AI and MLOps Slack Community: slack.flyte.org

⭐ Check out Flyte on GitHub: github.com/flyteorg/flyte

🤝 Learn about everything else we’re doing at union.ai

Workshop