We’ve all visited websites with the chatbot icon on the lower right. Honestly, I think those can be pretty great solutions to help you find the answers to frequently asked questions. They do tend to fall flat however when you ask more complicated questions and that’s usually when you’d route to a human. But humans aren’t always available so adding AI as a better-than-business-rules and more-available-than-humans solution to these chat boxes is pretty appealing.
Now you need to not only keep track of what your customers are asking for in the chat, but also what your AI Agents are up to behind the scenes. What API calls are they making? Where are they getting the data they’re using to answer your questions?
BUT let’s build our own so we can learn something.
Press enter or click to view image in full size
The omni-present lower-right chat window
AI Agents are quickly becoming the next exciting development in the AI world. Running one is easy, running thousands requires a bit more technology. How do you track agent success or failures over time? How do you drill into failure reasons? Being able to answer these questions as you scale your AI Agent capabilities is crucial for long-term AI success.
There are three core aspects of agent observability which ring true for any system that requires monitoring at scale:
The agent (or source) needs to emit events as they happen with associated metadata (run IDs, timestamps, etc)
The observability system needs to process and handle those events efficiently while maintaining data resilience if the source or destination encounters issues
The destination (a datastore of some kind) needs to be able to handle large volumes of events while being able to stitch related events together to help tell a broader story.
There are many technologies that can perform these tasks for us. Some dedicated systems like LangSmith, Datadog, and more. But if you want or need total control of the system, you’d need to implement each part yourself. It’s also a great way to learn how LLM frameworks work internally!
Press enter or click to view image in full size
How AI imagines what “AI using tools” looks like.
Agents in AI operate a few levels above your basic chat or prompt-based LLM interaction. They can have access to “the real world” through Tools which typically interact with APIs on your behalf. By giving an Agent multiple tools, it can creatively solve problems within those realms.
For example, if you were to ask a non-agent LLM “tell me the weather”, it’d likely respond with “I’m sorry, I don’t have access to the current weather.” That’s because the base LLM is trained on historical data — up-to-date weather data can’t be included in an LLM’s training data set.
A single step (checking the weather) is relatively simple to understand and from a human validation perspective, simple to verify (look outside, is it raining?).
Things can quickly get more complicated with more tools being given to an Agent. Let’s expand this example and imagine our agent will help us plan a weekend trip — what tools would it need?
Weather tool — Will it be sunny, raining, or snowing?
Attendee tool — Who is available and what are their preferences?
Navigation tool — How can the attendees get to the weekend trip?
Booking tool — What reservations at hotels, restaurants, and museums are needed?
Press enter or click to view image in full size
AI checking the weather
These broad tool descriptions greatly oversimplify how tools really work but they illustrate the separation of duties between tools. Normally you’d want to be more specific so that each tool does one thing well. A real implementation of a weekend trip planning agent could have dozens of tools.
If we consider the high level steps to planning a trip, we might expect something like this:
Check the weather and come up with a few options
Poll the possible attendees and see who would prefer which locations
Decide on the location and tell the attendees
Based on who can attend, plan and present travel and lodging preferences and accommodations
Gather some initial activity ideas and poll the group of attendees what they’d prefer to do
Base activities on the weather forecast, prioritizing museum visits during non-sunny days
Monitor and adjust the schedule as situations change (late flights, last minute cancellations, dangerous weather)
While this is a toy example, a late flight could alter an entire days’ plans for certain members of the trip. How would the Agent respond to that? If you were providing this Agent as a service to your customers, how would you monitor not just this single trip but the thousands of other trips happening that weekend? How would you show the Agent’s thought process as it modified plans to a customer?
Join Medium for free to get updates from this writer.
Subscribe
Subscribe
All these questions are impossible to answer without a rigorous agent observability system. So let’s build one!
Many LLM frameworks have a capability to alter data sent or generated as it occurs using functions known as Callbacks. Generally, callbacks are triggered when various processes occur within a broader system. In the case of LLM frameworks, callbacks can be triggered when you prompt and LLM, when it responds, or when it begins and finishes processing data.
Press enter or click to view image in full size
An actual image of AI emitting events.
from langchain_core.tracers.base import BaseTracer
import rudderstack.analytics as rudder_analytics
rudder_analytics.write_key = "your_write_key"
rudder_analytics.dataPlaneUrl = "https://your_dataplane_url"
rudder_analytics.sync_mode = True #this sends events as they occur, IMPORTANT!
class RudderstackTracer(BaseTracer):
name: str = "RudderstackTracer"
def __init__(self, **kwargs: Any) -
> None:
super().__init__(**kwargs)
def _run_to_dict(self, run: Run) -
> dict:
"""
Use this method to tweak what you actually send to your downstream system.
"""
with warnings.catch_warnings():
warnings.simplefilter("ignore", category=PydanticDeprecationWarning)
run_dict = run.dict(exclude={"child_runs", "inputs", "outputs"})
#run_dict.pop("my_sensitive_data_key", None) #example key removal
return {
**run_dict,
"inputs": run.inputs.copy() if run.inputs is not None else None,
"outputs": run.outputs.copy() if run.outputs is not None else None,
def handle_event(self, event_type: str, run: Run) -
> None:
run_id = str(run.id)
rudder_analytics.track(run_id, event_type, _run_to_dict(run=run))
# it's important to start these functions with underscores '_' so that the abstract class BaseTracer can use them to actually send events!
def _on_llm_start(self, run: Run) -
> None:
self.handle_event('on_llm_start', run)
def _on_llm_end(self, run: Run) -
> None:
self.handle_event('on_llm_end', run)
# the rest of the "on_*" callbacks listed here https://github.com/langchain-ai/langchain/blob/3796e143f83a57b258f518efa13118be704e48c3/libs/core/langchain_core/tracers/base.py
def _persist_run(self, run: Run) -
> None:
pass
pip install rudder-sdk-python
Press enter or click to view image in full size
Connecting your sources to destinations is very easy — Rudderstack handles all the intricacies of ETL for you.
Once you have the Rudderstack SDK installed and the write key for your source, you can start sending events with the RudderstackTracer class from above and an existing LangChain agent:
from rudderstack_tracer import RudderstackTracer
human = """{input}"""
prompt = ChatPromptTemplate.from_messages(
("system", "You're a helpful AI assistant"),\
("human", human),\
agent = create_structured_chat_agent(llm, tools, prompt)
agent_executor = AgentExecutor(
agent=agent, verbose=True, handle_parsing_errors=True
agent_executor.invoke(
{"input": "Hello AI!"},
{"callbacks": [RudderstackTracer()]} # this is where you add the event callback!
Now execute the script you’ve written. If you look at your Rudderstack dashboard, you’ll be able to see the events trickling in.
Press enter or click to view image in full size
AI events from my test script streaming in
You’ll see the various event types coming in, all depending on which methods you’ve configured and what your agents are doing.
Each of these event types will land in their own table in your destination!
Press enter or click to view image in full size
Event Types and Counts
Just having a pile of data isn’t worth much. We can begin to understand what our agents are doing by grouping these events by run and by tool execution. I’ll show how to recreate a chain run by joining all these tables.
First here’s an example schema form the on_chain_start. There’s a lot of useful columns:
CREATE TABLE default.on_chain_start
`parent_run_id` Nullable(String),
`serialized_kwargs_optional_variables` Nullable(String),
`timestamp` Nullable(DateTime),
`id` String,
`serialized_lc` Nullable(Int64),
`serialized_kwargs_messages` Nullable(String),
`trace_id` Nullable(String),
`serialized_kwargs_partial_variables_tool_names` Nullable(String),
`original_timestamp` Nullable(DateTime),
`serialized_id` Nullable(String),
`serialized_type` Nullable(String),
`context_source_id` Nullable(String),
`context_request_ip` Nullable(String),
`context_library_version` Nullable(String),
`run_type` Nullable(String),
`channel` Nullable(String),
`session_name` Nullable(String),
`context_destination_type` Nullable(String),
`event` LowCardinality(String),
`event_text` LowCardinality(String),
`received_at` DateTime,
`uuid_ts` Nullable(DateTime),
`tags` Nullable(String),
`events` Nullable(String),
`context_ip` Nullable(String),
`serialized_kwargs_input_variables` Nullable(String),
`start_time` Nullable(DateTime),
`inputs_input` Nullable(String),
`context_library_name` Nullable(String),
`user_id` Nullable(String),
`dotted_order` Nullable(String),
`context_source_type` Nullable(String),
`sent_at` Nullable(DateTime),
`serialized_kwargs_partial_variables_tools` Nullable(String),
`context_destination_id` Nullable(String),
`serialized_name` Nullable(String),
`name` Nullable(String)
ENGINE = SharedReplacingMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}')
PARTITION BY toDate(received_at)
ORDER BY (received_at, id)
SETTINGS index_granularity = 8192
And here’s the union to start grouping these events together:
SELECT
parent_run_id,
original_timestamp,
event,
event_text,
FROM
SELECT
parent_run_id,
original_timestamp,
event,
inputs_input as event_text,
FROM
on_chain_end
UNION ALL
SELECT
parent_run_id,
original_timestamp,
event,
error as event_text,
FROM
on_chain_error
UNION ALL
SELECT
parent_run_id,
original_timestamp,
event,
inputs_input as event_text,
FROM
on_chain_start
UNION ALL
SELECT
parent_run_id,
original_timestamp,
event,
error as event_text,
FROM
on_llm_error
UNION ALL
SELECT
parent_run_id,
original_timestamp,
event,
inputs_prompts as event_text,
FROM
on_llm_start
where parent_run_id = '{a_particular_run_id}'
ORDER BY
original_timestamp
parent_run_id
column is what groups these events all together. If you have more or less tables, you’ll need to add those as you see fit.Press enter or click to view image in full size
A single Agent session with an error captured!
Now that we have events coming into our database, we can easily create rules and metrics on top of these events! Errors could be routed to a Slack channel, metrics about a given type of Agent could be aggregated, or sentiment from the customer’s responses could be extracted as they interact with your Agents. Perhaps a particular Tool starts throwing more errors after an update — you could detect that before it becomes an issue.
AI Agents will soon integrate deeply with many existing systems. We’ll need to track the actions those Agents take so if mistakes happen, we (or another error-fixing Agent!) can unwind them.
Writing custom callbacks is very simple and can let you do some very interesting things with AI Agents
Existing technologies can be leveraged to build your own AI observability capabilities. Or you can use purpose-built AI observability frameworks.
Collecting and understanding the workflow data that your AI agents are generating, especially as your AI capability grows, is essential towards scaling and monitoring your AI ecosystem. Without this data, you’ll be stuck sending screenshots, relying on whether prompt changes “feel more accurate”, or just not knowing what your AI is doing.
You don’t have to build your own system, but understanding how the various pieces work together can give you appreciation and deeper insight into the complexities of running these AI workloads at scale.
Business, governance, and adoption-focused material. Real-world implementations, case studies, and industry impact.