Build your own AI Agent Observability System

We’ve all visited websites with the chatbot icon on the lower right. Honestly, I think those can be pretty great solutions to help you find the answers to frequently asked questions. They do tend to fall flat however when you ask more complicated questions and that’s usually when you’d route to a human. But humans aren’t always available so adding AI as a better-than-business-rules and more-available-than-humans solution to these chat boxes is pretty appealing.

So you add AI to your chat window and it starts answering questions and solving everyone’s problems and everyone’s happy… until it starts promising discounts on your products.

Now you need to not only keep track of what your customers are asking for in the chat, but also what your AI Agents are up to behind the scenes. What API calls are they making? Where are they getting the data they’re using to answer your questions?

These questions can be answered through “AI Observability”, a new application of the venerable observability ecosystem. I’m not going to attempt to cover the intricacies of observability. Nor am I going to claim to be an observability expert. There are many large enterprises dedicated to doing observability for all your systems. That’s great; they’re great. You probably want to use one of them.

BUT let’s build our own so we can learn something.

Press enter or click to view image in full size

The omni-present lower-right chat window

AI Agents are quickly becoming the next exciting development in the AI world. Running one is easy, running thousands requires a bit more technology. How do you track agent success or failures over time? How do you drill into failure reasons? Being able to answer these questions as you scale your AI Agent capabilities is crucial for long-term AI success.

In this walk through, we’ll use LangChain, Rudderstack and Clickhouse to build our own AI agent observability system.

Building an Agentic AI Observability System

There are three core aspects of agent observability which ring true for any system that requires monitoring at scale:

The agent (or source) needs to emit events as they happen with associated metadata (run IDs, timestamps, etc)
The observability system needs to process and handle those events efficiently while maintaining data resilience if the source or destination encounters issues
The destination (a datastore of some kind) needs to be able to handle large volumes of events while being able to stitch related events together to help tell a broader story.

There are many technologies that can perform these tasks for us. Some dedicated systems like LangSmith, Datadog, and more. But if you want or need total control of the system, you’d need to implement each part yourself. It’s also a great way to learn how LLM frameworks work internally!

What are Agents?

Press enter or click to view image in full size

How AI imagines what “AI using tools” looks like.

Agents in AI operate a few levels above your basic chat or prompt-based LLM interaction. They can have access to “the real world” through Tools which typically interact with APIs on your behalf. By giving an Agent multiple tools, it can creatively solve problems within those realms.

For example, if you were to ask a non-agent LLM “tell me the weather”, it’d likely respond with “I’m sorry, I don’t have access to the current weather.” That’s because the base LLM is trained on historical data — up-to-date weather data can’t be included in an LLM’s training data set.

To get around this limitation, we can give the LLM access to a weather API and by doing so, create an Agent. Now we can ask “tell me the weather” and, based on that context, the Agent will know to reach out to the API to return a real answer: “It’s currently raining because a storm front is approaching your location.” That is a very useful response which provides very specific, timely, and important information.

Understanding Agentic Workflows

A single step (checking the weather) is relatively simple to understand and from a human validation perspective, simple to verify (look outside, is it raining?).

Things can quickly get more complicated with more tools being given to an Agent. Let’s expand this example and imagine our agent will help us plan a weekend trip — what tools would it need?

Weather tool — Will it be sunny, raining, or snowing?
Attendee tool — Who is available and what are their preferences?
Navigation tool — How can the attendees get to the weekend trip?
Booking tool — What reservations at hotels, restaurants, and museums are needed?

Press enter or click to view image in full size

AI checking the weather

These broad tool descriptions greatly oversimplify how tools really work but they illustrate the separation of duties between tools. Normally you’d want to be more specific so that each tool does one thing well. A real implementation of a weekend trip planning agent could have dozens of tools.

If we consider the high level steps to planning a trip, we might expect something like this:

Check the weather and come up with a few options
Poll the possible attendees and see who would prefer which locations
Decide on the location and tell the attendees
Based on who can attend, plan and present travel and lodging preferences and accommodations
Gather some initial activity ideas and poll the group of attendees what they’d prefer to do
Base activities on the weather forecast, prioritizing museum visits during non-sunny days
Monitor and adjust the schedule as situations change (late flights, last minute cancellations, dangerous weather)

While this is a toy example, a late flight could alter an entire days’ plans for certain members of the trip. How would the Agent respond to that? If you were providing this Agent as a service to your customers, how would you monitor not just this single trip but the thousands of other trips happening that weekend? How would you show the Agent’s thought process as it modified plans to a customer?

Get James Barney’s stories in your inbox

Join Medium for free to get updates from this writer.

All these questions are impossible to answer without a rigorous agent observability system. So let’s build one!

Emitting Events

Many LLM frameworks have a capability to alter data sent or generated as it occurs using functions known as Callbacks. Generally, callbacks are triggered when various processes occur within a broader system. In the case of LLM frameworks, callbacks can be triggered when you prompt and LLM, when it responds, or when it begins and finishes processing data.

Press enter or click to view image in full size

An actual image of AI emitting events.

To illustrate how unrelated technology can accelerate your AI development, I’ll use Rudderstack to track and send events generated in the LLM callbacks. Rudderstack is particularly great since they have many connections preconfigured for you which makes the movement of this data very easy. It is so important to be able to trust in the data you’re basing AI Agent development decisions off of. Rudderstack can reliably handle this data for you with almost no additional engineering on your part.

LangChain is a popular framework with over 250,000 stars on GitHub and provides native callback functionality that integrates with their own LangSmith service. But we can build our own through the BaseTracer class within langchain core. To make our own callback tracer, we can just extend it:

text


from langchain_core.tracers.base import BaseTracer

import rudderstack.analytics as rudder_analytics

rudder_analytics.write_key = "your_write_key"

rudder_analytics.dataPlaneUrl = "https://your_dataplane_url"

rudder_analytics.sync_mode = True #this sends events as they occur, IMPORTANT!

class RudderstackTracer(BaseTracer):

name: str = "RudderstackTracer"

def __init__(self, **kwargs: Any) -

> None:

super().__init__(**kwargs)

def _run_to_dict(self, run: Run) -

> dict:

"""

Use this method to tweak what you actually send to your downstream system.

"""

with warnings.catch_warnings():

warnings.simplefilter("ignore", category=PydanticDeprecationWarning)

run_dict = run.dict(exclude={"child_runs", "inputs", "outputs"})

#run_dict.pop("my_sensitive_data_key", None) #example key removal

return {

**run_dict,

"inputs": run.inputs.copy() if run.inputs is not None else None,

"outputs": run.outputs.copy() if run.outputs is not None else None,

def handle_event(self, event_type: str, run: Run) -

> None:

run_id = str(run.id)

rudder_analytics.track(run_id, event_type, _run_to_dict(run=run))

# it's important to start these functions with underscores '_' so that the abstract class BaseTracer can use them to actually send events!

def _on_llm_start(self, run: Run) -

> None:

self.handle_event('on_llm_start', run)

def _on_llm_end(self, run: Run) -

> None:

self.handle_event('on_llm_end', run)

# the rest of the "on_*" callbacks listed here https://github.com/langchain-ai/langchain/blob/3796e143f83a57b258f518efa13118be704e48c3/libs/core/langchain_core/tracers/base.py

def _persist_run(self, run: Run) -

> None:

pass

Handling Events

Rudderstack makes routing these events super easy thanks to their python SDK: pip install rudder-sdk-python

Follow Rudderstack’s onboarding instructions to get fully set up but it’s very simple.

Don’t forget to configure a destination for the events, I picked ClickHouse. Then connect them!

Press enter or click to view image in full size

Connecting your sources to destinations is very easy — Rudderstack handles all the intricacies of ETL for you.

Once you have the Rudderstack SDK installed and the write key for your source, you can start sending events with the RudderstackTracer class from above and an existing LangChain agent:

text


from rudderstack_tracer import RudderstackTracer

human = """{input}"""

prompt = ChatPromptTemplate.from_messages(

("system", "You're a helpful AI assistant"),\

("human", human),\

agent = create_structured_chat_agent(llm, tools, prompt)

agent_executor = AgentExecutor(

agent=agent, verbose=True, handle_parsing_errors=True

agent_executor.invoke(

{"input": "Hello AI!"},

{"callbacks": [RudderstackTracer()]} # this is where you add the event callback!

Now execute the script you’ve written. If you look at your Rudderstack dashboard, you’ll be able to see the events trickling in.

Press enter or click to view image in full size

AI events from my test script streaming in

You’ll see the various event types coming in, all depending on which methods you’ve configured and what your agents are doing.

Each of these event types will land in their own table in your destination!

Press enter or click to view image in full size

Event Types and Counts

Querying Events

Just having a pile of data isn’t worth much. We can begin to understand what our agents are doing by grouping these events by run and by tool execution. I’ll show how to recreate a chain run by joining all these tables.

First here’s an example schema form the on_chain_start. There’s a lot of useful columns:

text


CREATE TABLE default.on_chain_start

`parent_run_id` Nullable(String),

`serialized_kwargs_optional_variables` Nullable(String),

`timestamp` Nullable(DateTime),

`id` String,

`serialized_lc` Nullable(Int64),

`serialized_kwargs_messages` Nullable(String),

`trace_id` Nullable(String),

`serialized_kwargs_partial_variables_tool_names` Nullable(String),

`original_timestamp` Nullable(DateTime),

`serialized_id` Nullable(String),

`serialized_type` Nullable(String),

`context_source_id` Nullable(String),

`context_request_ip` Nullable(String),

`context_library_version` Nullable(String),

`run_type` Nullable(String),

`channel` Nullable(String),

`session_name` Nullable(String),

`context_destination_type` Nullable(String),

`event` LowCardinality(String),

`event_text` LowCardinality(String),

`received_at` DateTime,

`uuid_ts` Nullable(DateTime),

`tags` Nullable(String),

`events` Nullable(String),

`context_ip` Nullable(String),

`serialized_kwargs_input_variables` Nullable(String),

`start_time` Nullable(DateTime),

`inputs_input` Nullable(String),

`context_library_name` Nullable(String),

`user_id` Nullable(String),

`dotted_order` Nullable(String),

`context_source_type` Nullable(String),

`sent_at` Nullable(DateTime),

`serialized_kwargs_partial_variables_tools` Nullable(String),

`context_destination_id` Nullable(String),

`serialized_name` Nullable(String),

`name` Nullable(String)

ENGINE = SharedReplacingMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}')

PARTITION BY toDate(received_at)

ORDER BY (received_at, id)

SETTINGS index_granularity = 8192

And here’s the union to start grouping these events together:

text


SELECT

parent_run_id,

original_timestamp,

event,

event_text,

FROM

SELECT

parent_run_id,

original_timestamp,

event,

inputs_input as event_text,

FROM

on_chain_end

UNION ALL

SELECT

parent_run_id,

original_timestamp,

event,

error as event_text,

FROM

on_chain_error

UNION ALL

SELECT

parent_run_id,

original_timestamp,

event,

inputs_input as event_text,

FROM

on_chain_start

UNION ALL

SELECT

parent_run_id,

original_timestamp,

event,

error as event_text,

FROM

on_llm_error

UNION ALL

SELECT

parent_run_id,

original_timestamp,

event,

inputs_prompts as event_text,

FROM

on_llm_start

where parent_run_id = '{a_particular_run_id}'

ORDER BY

original_timestamp

Note that the parent_run_id column is what groups these events all together. If you have more or less tables, you’ll need to add those as you see fit.

Press enter or click to view image in full size

A single Agent session with an error captured!

Now that we have events coming into our database, we can easily create rules and metrics on top of these events! Errors could be routed to a Slack channel, metrics about a given type of Agent could be aggregated, or sentiment from the customer’s responses could be extracted as they interact with your Agents. Perhaps a particular Tool starts throwing more errors after an update — you could detect that before it becomes an issue.

Learnings

AI Agents will soon integrate deeply with many existing systems. We’ll need to track the actions those Agents take so if mistakes happen, we (or another error-fixing Agent!) can unwind them.
Writing custom callbacks is very simple and can let you do some very interesting things with AI Agents
Existing technologies can be leveraged to build your own AI observability capabilities. Or you can use purpose-built AI observability frameworks.

Collecting and understanding the workflow data that your AI agents are generating, especially as your AI capability grows, is essential towards scaling and monitoring your AI ecosystem. Without this data, you’ll be stuck sending screenshots, relying on whether prompt changes “feel more accurate”, or just not knowing what your AI is doing.

You don’t have to build your own system, but understanding how the various pieces work together can give you appreciation and deeper insight into the complexities of running these AI workloads at scale.

Content

Build your own AI Agent Observability System

Building an Agentic AI Observability System

What are Agents?

Understanding Agentic Workflows

Get James Barney’s stories in your inbox

Emitting Events

Handling Events

Querying Events

Learnings

Builder

Applications & Impact

Tags