LLM Observability: A Fireside Chat with John from Arize

In this fireside chat, we talk with Arize’s Head of Developer Relations John Gilhuly about LLM Observability: Tracing, Evaluations, and Real-Time Insights.
This transcript has been modified for reading. Listen to the recording for the full interview!
If you have questions about implementing reproducible workflows in your organization, schedule time with one of our AI Engineers!
<div class="button-group is-center"><a class="button" href="https://www.union.ai/consultation">Book a free consultation</a></div>
LLM Observability Fireside Chat Introduction
Sage Elliott:
Welcome to this Union Fireside Chat! I'll be the host for today's discussion. At Union, we build tools for AI orchestration, making compound AI systems more efficient, and reliable. You might be familiar with our open-source project Flyte, used by thousands of companies to put AI into production. We’re also launching a model and app-serving solution (try now on Union Serverless!), which ties into today’s discussion.
Today's chat is about LLM Observability—how to make LLM-based applications better through observability, tracing, and evaluations. We have John from Arize with us, who will talk about Phoenix, an open-source LLM observability tool.
Before diving in, John, would you like to introduce yourself and share a bit about your background?
Meet John Gilhuly from Arize

John Gilhuly:
Thank you Sage! I’m John, the Head of Developer Relations at Arize. I focus on go-to-market strategy, developer advocacy, events, and content creation. My background is in engineering, sales, and solutions engineering. Before Arize, I worked at a startup called Branch, which powered deep links for apps like Spotify. I later went back to school for my master’s degree and joined Arize in June last year to build out our open-source developer community.
At Arize, we work on helping teams make AI systems more observable, debuggable, and reliable. Our focus with Phoenix is to provide an open-source tool that enables seamless monitoring, analysis, and debugging of LLM-based applications, making it easier for teams to iterate and improve their AI models over time.
What is LLM Observability and Why Does it Matter?
Sage Elliott:
Let’s start with the basics: What is LLM Observability, and why is it crucial for AI applications going into production today?
John Gilhuly:
Observability itself is a common practice in software development that helps track logs, evaluations, and metrics to understand system performance. With LLMs, the output space is much broader, leading to new challenges in monitoring their behavior. LLM Observability ensures we can:
- Understand what happens inside an LLM-powered application. This includes tracking inputs, outputs, and intermediate processing steps to detect issues and improve interpretability.
- Capture metrics and evaluate performance at a granular level. This means collecting data on latency, response accuracy, hallucination rates, bias detection, and overall effectiveness of the model.
- Detecting and mitigating model drift. LLM performance can degrade over time due to changes in data distributions. Continuous monitoring ensures timely retraining and adaptation.
- Build trust in AI applications, preventing unexpected failures. For example, there have been cases where LLM-based chatbots made critical mistakes, such as an AI assistant accidentally listing a Chevy Tahoe for $1. Observability helps catch such errors before they impact real users.
- Improve AI performance iteratively using real-world insights. By monitoring application runs and user interactions, teams can refine prompts, fine-tune models, and create better user experiences over time.
- Ensure compliance and accountability. With growing regulatory scrutiny on AI applications, observability allows organizations to audit AI behavior, ensuring compliance with policies, ethical standards, and legal requirements.
With AI being deployed at scale across industries like healthcare, finance, and customer support, LLM Observability is no longer optional—it’s a necessity for ensuring reliable and responsible AI deployments.
How Phoenix Helps with LLM Observability
Sage Elliott:
How does Phoenix help teams monitor, evaluate, and debug LLM applications?
John Gilhuly:
Phoenix has four key features:
- Tracing: Captures traces of your LLM application to track performance at different steps, helping teams pinpoint where failures occur and optimize response times.
- Evaluation: Supports LLM-based evals, human evals, and code-based evals for assessing AI performance, allowing teams to detect bias, hallucinations, and performance regressions.
- Experimentation: Allows users to test different prompts, test cases, and parameter variations to refine model responses and boost reliability.
- Prompt Optimization: Acts as a prompt hub, enabling A/B testing and optimization of prompts, making it easier to iterate on LLM behavior without requiring full model retraining.
It’s designed to be your second screen while developing—whether you're using VS Code, Cursor, or another IDE, you can keep Phoenix open alongside it for real-time insights.

What is LLM Tracing & Why is it Important?
Sage Elliott:
Let’s dive into tracing. What is LLM tracing, and can you share a real-world example where it helped debug an issue?
John Gilhuly:
Tracing involves tracking individual function calls, inputs, outputs, and processing times within an LLM pipeline. It provides a hierarchical view of all operations, helping developers understand performance bottlenecks.
Real-World Example
A company was building a coding agent that interacted with databases by generating SQL queries. Users reported getting empty responses when relevant data existed. Using Phoenix’s tracing, they discovered:
- The SQL code generation step misformatted column names, causing queries to fail.
- By analyzing traces, they corrected the prompt responsible for query generation, fixing the issue instantly.
- Additionally, tracing helped the team understand that certain API calls were taking too long, leading to timeout errors that affected user experience.
Without tracing, debugging would have been much harder—relying on logging breakpoints or manually reproducing errors. By integrating observability, teams can resolve such issues proactively, ensuring smoother AI performance.
Common Failure Points in LLM Pipelines
Sage Elliott:
What are the most common failure points you see when people build LLM applications?
John Gilhuly:
- Thinking LLM applications are “set and forget” – In reality, AI models must be continuously improved through feedback loops and retraining.
- Jumping into complex evaluation systems too early – Start simple: look at your data, analyze errors, then refine the approach based on real-world performance.
- Lack of observability – Without tracing and evaluations, it’s impossible to diagnose problems effectively.
- Ignoring long-term performance degradation – Without monitoring for model drift and response degradation, teams risk deploying AI solutions that gradually lose accuracy over time.
Evaluations: Best Practices & Metrics
Sage Elliott:
Evaluating LLM performance is tricky. What are the best practices and key metrics for evaluation?
John Gilhuly:
- Use human-labeled datasets to establish ground truth and compare model outputs against known correct responses.
- Leverage code-based evaluations (unit tests, deterministic checks) when possible to automate validation.
- Use LLM-based evaluators to scale evaluations efficiently and analyze linguistic nuances beyond simple accuracy metrics.
- Key Metrics:
- Context Adherence – Ensures responses stay relevant to retrieved data.
- Relevance of Retrieved Chunks – Important for RAG (Retrieval-Augmented Generation) systems.
- Bias/Fairness Metrics – Measures inconsistencies across user demographics.
- Toxicity and Safety Scores – Identifies harmful or inappropriate responses.
- Response Latency & Efficiency – Measures inference time and processing speed for real-time applications.
Phoenix provides tools for evaluating LLMs against ground truth datasets and experimenting with different prompts and models to drive continuous improvement.
Conclusion
Thank you, John, for sharing your insights! If you’re interested in Phoenix, check out their GitHub repo and their course on DeepLearning.AI. Thanks to everyone who joined the discussion!
If you have questions about implementing reproducible workflows in your organization, schedule time with one of our AI Engineers!