Samhita Alla
Zain Hasan
Kristy Cook

Building Production-Ready Compound AI Applications Just Got a Lot Easier: A RAG Example

Imagine if building compound AI apps was as easy as piecing together Legos. At Union, we're abstracting the infrastructure layer to make this a reality. In collaboration with Together AI, you can effortlessly build and deploy contextual RAG applications directly from your Jupyter notebook, no more context switching between development and deployment environments.

<div class="button-group is-center"><a class="button" href="https://www.union.ai/consultation">Union.ai: Reach out to book a consultation to find out more and experience the future of scalable AI workflows today!</a></div>

 

Building AI applications used to be a real headache. Trust me - I've spent countless nights debugging pipelines and wrestling with infrastructure instead of actually building cool stuff. And I know I'm not alone. While everyone's excited about the potential of generative AI, getting it to work reliably in production is a whole different story.

At Union, we've always focused on making AI infrastructure feel invisible - the kind of platform that just works so you can focus on building what matters. That's why we're particularly excited about our integration with Together AI. Their APIs are exceptional, and now they're woven into Union's orchestration capabilities.

The result? You can build and deploy contextual RAG applications right from your Jupyter notebook. No more context switching between development and deployment environments. No more wrestling with infrastructure when you should be innovating. 

Let me show you how we're making this possible, but first let’s look at why deploying RAG applications is such a challenge today. 

The RAG reality check: Why building production-ready AI applications is harder than it looks

Every breakthrough in AI starts with a promising prototype. But the gap between a notebook demo and a production-ready RAG application is where great ideas go to die. Let's break down the real-world challenges:

Infrastructure Overhead: More DevOps, Less Innovation 

What starts as an exciting AI project quickly turns into a computational Tetris game. Developers find themselves spending more time wrestling with compute instances, managing dependencies, and configuring environments than actually building intelligent systems. 

Scaling: The Performance Bottleneck 

Resource-intensive tasks like embedding generation don't just require computing power—they demand dynamic infrastructure. Most teams hit a wall when their prototype can't handle real-world data volumes or concurrent users. Scaling becomes a complex dance of resource allocation, cost management, and performance optimization.

Data Complexity: The Vector Database Maze 

Vector databases aren't just storage—they're sophisticated engines of semantic search and retrieval. Efficiently updating indices, managing data freshness, and maintaining low-latency retrieval is anything but trivial. What looks simple in a demo becomes a labyrinth of technical challenges in production.

Workflow Fragmentation: The Tool Switching Penalty 

Today's RAG workflows are a Frankenstein's monster of tools: one for embedding, another for vector search, yet another for model serving. Each context switch costs time, introduces potential errors, and fragments the developer's focus. 

Reproducibility: The Invisible Tax 

Sharing a RAG workflow across teams shouldn't feel like passing a complex secret handshake. Inconsistent environments, hidden dependencies, and non-deterministic behaviors make reproducing results feel like an academic research project—not a streamlined engineering process.

Understanding contextual RAG

“… 60% of LLM applications use some form of retrieval-augmented generation (RAG).” – The Shift from Models to Compound AI Systems

RAG (Retrieval-Augmented Generation) is a technique that enhances large language models (LLMs) by dynamically retrieving relevant information from an external knowledge base and passing it into the context of a language model before generating a response.

Traditional RAG works like this:

  1. A query comes in
  2. A retrieval system searches a document collection
  3. The most relevant documents are retrieved
  4. These documents are added to the model's context
  5. The model generates a response using both the original query and retrieved context

Contextual RAG takes this further by understanding not just the immediate retrieved chunks, but the broader context in which they appear in the document. It understands and captures the context behind individual chunks and uses both semantic and keyword based retrieval.

How contextual retrieval preprocessing works, explained (Source: Anthropic)

The Union + Together AI solution: Making contextual RAG work at scale

We've combined Union's orchestration capabilities with Together AI's powerful APIs to create something that just works. No more infrastructure headaches. No more complex deployments. Just clean, efficient RAG applications that scale.

Here's what this integration delivers:

  • Together AI brings industry-leading LLMs, embeddings and reranker APIs to the table.
  • Union handles the messy parts: orchestration, scaling, and deployment, turning your notebook experiments into production applications.

Everything stays in your development flow - from prototyping to deployment. Here's an example using Paul Graham's essays:

  1. Web Scraping: Fetch and process essay content
  2. Text Processing: Split essays into chunks, preserving context
  3. Embeddings: Generate and store vectors using Together AI's API
  4. Search Enhancement: Build keyword indices for faster retrieval
  5. Deployment: Launch a FastAPI endpoint and Gradio interface

You can define the Union workflow like this:

Copied to clipboard!
import union

@union.workflow
def build_indices_wf(
    base_url: str = "https://paulgraham.com/",
    articles_url: str = "articles.html",
    embedding_model: str = "BAAI/bge-large-en-v1.5",
    chunk_size: int = 250,
    overlap: int = 30,
    model: str = "deepseek-ai/DeepSeek-R1",
    local: bool = True,
) -> tuple[
    Annotated[FlyteDirectory, BM25Index], Annotated[FlyteFile, ContextualChunksJSON]
]:
    tocs = parse_main_page(base_url=base_url, articles_url=articles_url, local=local)
    scraped_content = union.map_task(scrape_pg_essays, concurrency=2)(document=tocs)
    chunks = union.map_task(
        functools.partial(create_chunks, chunk_size=chunk_size, overlap=overlap)
    )(document=scraped_content)
    contextual_chunks = union.map_task(functools.partial(generate_context, model=model))(
        document=chunks
    )
    union.map_task(
        functools.partial(
            create_vector_index, embedding_model=embedding_model, local=local
        ), 
        concurrency=2
    )(document=contextual_chunks)
    bm25s_index, contextual_chunks_json_file = create_bm25s_index(
        documents=contextual_chunks
    )
    return bm25s_index, contextual_chunks_json_file

We use Together AI APIs while generating the context and embeddings.

By using Union map tasks, we run operations in parallel while respecting the resource constraints of each task. This approach significantly improves execution speed. 

Union’s execution engine ensures that all artifacts, workflow executions, inputs, and outputs are versioned and reproducible. Teams can share workflows and results without worrying about the common “it works on my machine” scenario.

The final output of this workflow includes the BM25S keyword index and the contextual chunks mapping file, both returned as Union Artifacts. The Milvus vector database will be hosted in the cloud, ensuring easy access and scalability.

Your prototype becomes production-ready without leaving your notebook, running both locally and on a Union cluster. Teams get enterprise-grade AI infrastructure without the traditional overhead.

Smart cost management, built in

I used to think AI infrastructure costs were just a necessary tradeoff—you either paid for overprovisioned resources or dealt with slow, inefficient workloads. But with Union, that’s not the case.

Task-level monitoring gives you a clear view of where your resources are going, and real-time cost tracking helps you catch and optimize expensive operations before they get out of hand. Plus, built-in optimizations like map tasks, reusable containers called actors, and result caching make sure you’re not wasting compute time.

No more surprise bills. No more idle infrastructure eating into your budget. Just efficient, production-grade AI that scales the way you need it to.

Building and deploying apps on Union

Now that the “ingestion” workflow is set up to run locally and remotely on a specific cadence, it’s time to look at deploying apps with Union’s serving capabilities.

FastAPI and Gradio applications are deployed to serve the RAG app via Union. Serving on Union allows for building and serving custom web apps.

  • FastAPI App: Handles queries to retrieve relevant chunks from the vector database and BM25 index, and then uses reciprocal rank fusion (RRF) to rerank retrieval results. For the response generation, we use the DeepSeek-R1 model.
  • Gradio App: Provides a user-friendly interface for contextual RAG applications, powered by the FastAPI backend.

Both apps can be deployed on Union’s serverless platform, which scales automatically based on traffic, ensuring a smooth user experience.

While FastAPI and Gradio are used in this example, any Python-based front-end and API frameworks can be used to define apps.

Copied to clipboard!
fastapi_app = App(
    name="contextual-rag-fastapi-app",
    inputs=[...],
    container_image=union.ImageSpec(...),
    limits=union.Resources(cpu="3", mem="10Gi"),
    port=8080,
    include=["./fastapi_app.py"],
    command=["fastapi", "dev", "--port", "8080"],
    min_replicas=1,
    max_replicas=1,
    secrets=[...],
)

You can find the contextual RAG application code here. Run this notebook directly to deploy your production-grade application on Union serverless!

Next steps

Building and deploying Contextual RAG applications is now simpler than ever with Union and Together AI. By removing infrastructure complexity, offering seamless compute scalability, and ensuring reproducible workflows, this integration enables teams to focus on achieving results faster and more efficiently.

Union.ai: Reach out to book a consultation to find out more and experience the future of scalable AI workflows today!

Together.ai: Try out the best Gen AI models on the [Together AI platform], including DeepSeek-R1 and more!

Machine Learning
Model Deployment
AI Workflows