By the end of the tutorial we will have done the following:
Fetch and preprocess documents that will be used for retrieval.
Index those documents for semantic search and create a retriever tool for the agent.
Build an agentic RAG system that can decide when to use the retriever tool.
Let's download the required packages and set our API keys:
%%capture --no-stderr
%pip install -U --quiet langgraph "langchain[openai]" langchain-community langchain-text-splitters
md-code__content
import getpass
import os
def _set_env(key: str):
if key not in os.environ:
os.environ[key] = getpass.getpass(f"{key}:")
_set_env("OPENAI_API_KEY")
Tip
WebBaseLoader
utility:
md-code__content
from langchain_community.document_loaders import WebBaseLoader
urls = [\
"https://lilianweng.github.io/posts/2024-11-28-reward-hacking/",\
"https://lilianweng.github.io/posts/2024-07-07-hallucination/",\
"https://lilianweng.github.io/posts/2024-04-12-diffusion-video/",\
]
docs = [WebBaseLoader(url).load() for url in urls]
md-code__content
docs[0][0].page_content.strip()[:1000]
md-code__content
from langchain_text_splitters import RecursiveCharacterTextSplitter
docs_list = [item for sublist in docs for item in sublist]
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
chunk_size=100, chunk_overlap=50
)
doc_splits = text_splitter.split_documents(docs_list)
md-code__content
doc_splits[0].page_content.strip()
Now that we have our split documents, we can index them into a vector store that we'll use for semantic search.
md-code__content
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings
vectorstore = InMemoryVectorStore.from_documents(
documents=doc_splits, embedding=OpenAIEmbeddings()
)
retriever = vectorstore.as_retriever()
create_retriever_tool
:
md-code__content
from langchain.tools.retriever import create_retriever_tool
retriever_tool = create_retriever_tool(
retriever,
"retrieve_blog_posts",
"Search and return information about Lilian Weng blog posts.",
)
md-code__content
retriever_tool.invoke({"query": "types of reward hacking"})
MessagesState
— graph state that contains a messages
key with a list of chat messages.generate_query_or_respond
node. It will call an LLM to generate a response based on the current graph state (list of messages). Given the input messages, it will decide to retrieve using the retriever tool, or respond directly to the user. Note that we're giving the chat model access to the retriever_tool
we created earlier via .bind_tools
:
md-code__content
from langgraph.graph import MessagesState
from langchain.chat_models import init_chat_model
response_model = init_chat_model("openai:gpt-4.1", temperature=0)
def generate_query_or_respond(state: MessagesState):
"""Call the model to generate a response based on the current state. Given
the question, it will decide to retrieve using the retriever tool, or simply respond to the user.
"""
response = (
response_model
.bind_tools([retriever_tool]).invoke(state["messages"])
)
return {"messages": [response]}
md-code__content
input = {"messages": [{"role": "user", "content": "hello!"}]}
generate_query_or_respond(input)["messages"][-1].pretty_print()
Output:
md-code__content
================================== Ai Message ==================================
Hello! How can I help you today?
md-code__content
input = {
"messages": [\
{\
"role": "user",\
"content": "What does Lilian Weng say about types of reward hacking?",\
}\
]
}
generate_query_or_respond(input)["messages"][-1].pretty_print()
Output:
md-code__content
================================== Ai Message ==================================
Tool Calls:
retrieve_blog_posts (call_tYQxgfIlnQUDMdtAhdbXNwIM)
Call ID: call_tYQxgfIlnQUDMdtAhdbXNwIM
Args:
query: types of reward hacking
grade_documents
— to determine whether the retrieved documents are relevant to the question. We will use a model with a structured output schema GradeDocuments
for document grading. The grade_documents
function will return the name of the node to go to based on the grading decision ( generate_answer
or rewrite_question
):
md-code__content
from pydantic import BaseModel, Field
from typing import Literal
GRADE_PROMPT = (
"You are a grader assessing relevance of a retrieved document to a user question.
"
"Here is the retrieved document:
{context}
"
"Here is the user question: {question}
"
"If the document contains keyword(s) or semantic meaning related to the user question, grade it as relevant.
"
"Give a binary score 'yes' or 'no' score to indicate whether the document is relevant to the question."
)
class GradeDocuments(BaseModel):
"""Grade documents using a binary score for relevance check."""
binary_score: str = Field(
description="Relevance score: 'yes' if relevant, or 'no' if not relevant"
)
grader_model = init_chat_model("openai:gpt-4.1", temperature=0)
def grade_documents(
state: MessagesState,
) -
> Literal["generate_answer", "rewrite_question"]:
"""Determine whether the retrieved documents are relevant to the question."""
question = state["messages"][0].content
context = state["messages"][-1].content
prompt = GRADE_PROMPT.format(question=question, context=context)
response = (
grader_model
.with_structured_output(GradeDocuments).invoke(
[{"role": "user", "content": prompt}]
)
)
score = response.binary_score
if score == "yes":
return "generate_answer"
else:
return "rewrite_question"
md-code__content
from langchain_core.messages import convert_to_messages
input = {
"messages": convert_to_messages(
[\
{\
"role": "user",\
"content": "What does Lilian Weng say about types of reward hacking?",\
},\
{\
"role": "assistant",\
"content": "",\
"tool_calls": [\
{\
"id": "1",\
"name": "retrieve_blog_posts",\
"args": {"query": "types of reward hacking"},\
}\
],\
},\
{"role": "tool", "content": "meow", "tool_call_id": "1"},\
]
)
}
grade_documents(input)
md-code__content
input = {
"messages": convert_to_messages(
[\
{\
"role": "user",\
"content": "What does Lilian Weng say about types of reward hacking?",\
},\
{\
"role": "assistant",\
"content": "",\
"tool_calls": [\
{\
"id": "1",\
"name": "retrieve_blog_posts",\
"args": {"query": "types of reward hacking"},\
}\
],\
},\
{\
"role": "tool",\
"content": "reward hacking can be categorized into two types: environment or goal misspecification, and reward tampering",\
"tool_call_id": "1",\
},\
]
)
}
grade_documents(input)
rewrite_question
node. The retriever tool can return potentially irrelevant documents, which indicates a need to improve the original user question. To do so, we will call the rewrite_question
node:
md-code__content
REWRITE_PROMPT = (
"Look at the input and try to reason about the underlying semantic intent / meaning.
"
"Here is the initial question:"
"
-------
"
"{question}"
"
-------
"
"Formulate an improved question:"
)
def rewrite_question(state: MessagesState):
"""Rewrite the original user question."""
messages = state["messages"]
question = messages[0].content
prompt = REWRITE_PROMPT.format(question=question)
response = response_model.invoke([{"role": "user", "content": prompt}])
return {"messages": [{"role": "user", "content": response.content}]}
md-code__content
input = {
"messages": convert_to_messages(
[\
{\
"role": "user",\
"content": "What does Lilian Weng say about types of reward hacking?",\
},\
{\
"role": "assistant",\
"content": "",\
"tool_calls": [\
{\
"id": "1",\
"name": "retrieve_blog_posts",\
"args": {"query": "types of reward hacking"},\
}\
],\
},\
{"role": "tool", "content": "meow", "tool_call_id": "1"},\
]
)
}
response = rewrite_question(input)
print(response["messages"][-1]["content"])
Output:
md-code__content
What are the different types of reward hacking described by Lilian Weng, and how does she explain them?
generate_answer
node: if we pass the grader checks, we can generate the final answer based on the original question and the retrieved context:
md-code__content
GENERATE_PROMPT = (
"You are an assistant for question-answering tasks. "
"Use the following pieces of retrieved context to answer the question. "
"If you don't know the answer, just say that you don't know. "
"Use three sentences maximum and keep the answer concise.
"
"Question: {question}
"
"Context: {context}"
)
def generate_answer(state: MessagesState):
"""Generate an answer."""
question = state["messages"][0].content
context = state["messages"][-1].content
prompt = GENERATE_PROMPT.format(question=question, context=context)
response = response_model.invoke([{"role": "user", "content": prompt}])
return {"messages": [response]}
md-code__content
input = {
"messages": convert_to_messages(
[\
{\
"role": "user",\
"content": "What does Lilian Weng say about types of reward hacking?",\
},\
{\
"role": "assistant",\
"content": "",\
"tool_calls": [\
{\
"id": "1",\
"name": "retrieve_blog_posts",\
"args": {"query": "types of reward hacking"},\
}\
],\
},\
{\
"role": "tool",\
"content": "reward hacking can be categorized into two types: environment or goal misspecification, and reward tampering",\
"tool_call_id": "1",\
},\
]
)
}
response = generate_answer(input)
response["messages"][-1].pretty_print()
Output:
md-code__content
================================== Ai Message ==================================
Lilian Weng categorizes reward hacking into two types: environment or goal misspecification, and reward tampering. She considers reward hacking as a broad concept that includes both of these categories. Reward hacking occurs when an agent exploits flaws or ambiguities in the reward function to achieve high rewards without performing the intended behaviors.
generate_query_or_respond
and determine if we need to call retriever_tool
tools_condition
:generate_query_or_respond
returned tool_calls
, call retriever_tool
to retrieve contextOtherwise, respond directly to the user
grade_documents
) and route to next step:rewrite_question
and then call generate_query_or_respond
againgenerate_answer
and generate final response using the ToolMessage
with the retrieved document contextAPI Reference: StateGraph | START | END | ToolNode | tools_condition
md-code__content
from langgraph.graph import StateGraph, START, END
from langgraph.prebuilt import ToolNode
from langgraph.prebuilt import tools_condition
workflow = StateGraph(MessagesState)
# Define the nodes we will cycle between
workflow.add_node(generate_query_or_respond)
workflow.add_node("retrieve", ToolNode([retriever_tool]))
workflow.add_node(rewrite_question)
workflow.add_node(generate_answer)
workflow.add_edge(START, "generate_query_or_respond")
# Decide whether to retrieve
workflow.add_conditional_edges(
"generate_query_or_respond",
# Assess LLM decision (call `retriever_tool` tool or respond to the user)
tools_condition,
{
# Translate the condition outputs to nodes in our graph
"tools": "retrieve",
END: END,
},
)
# Edges taken after the `action` node is called.
workflow.add_conditional_edges(
"retrieve",
# Assess agent decision
grade_documents,
)
workflow.add_edge("generate_answer", END)
workflow.add_edge("rewrite_question", "generate_query_or_respond")
# Compile
graph = workflow.compile()
Visualize the graph:
md-code__content
from IPython.display import Image, display
display(Image(graph.get_graph().draw_mermaid_png()))
md-code__content
for chunk in graph.stream(
{
"messages": [\
{\
"role": "user",\
"content": "What does Lilian Weng say about types of reward hacking?",\
}\
]
}
):
for node, update in chunk.items():
print("Update from node", node)
update["messages"][-1].pretty_print()
print("
")
Output:
md-code__content
Update from node generate_query_or_respond
================================== Ai Message ==================================
Tool Calls:
retrieve_blog_posts (call_NYu2vq4km9nNNEFqJwefWKu1)
Call ID: call_NYu2vq4km9nNNEFqJwefWKu1
Args:
query: types of reward hacking
Update from node retrieve
================================= Tool Message ==================================
Name: retrieve_blog_posts
(Note: Some work defines reward tampering as a distinct category of misalignment behavior from reward hacking. But I consider reward hacking as a broader concept here.)
At a high level, reward hacking can be categorized into two types: environment or goal misspecification, and reward tampering.
Why does Reward Hacking Exist?
#
Pan et al. (2022) investigated reward hacking as a function of agent capabilities, including (1) model size, (2) action space resolution, (3) observation space noise, and (4) training time. They also proposed a taxonomy of three types of misspecified proxy rewards:
Let's Define Reward Hacking
#
Reward shaping in RL is challenging. Reward hacking occurs when an RL agent exploits flaws or ambiguities in the reward function to obtain high rewards without genuinely learning the intended behaviors or completing the task as designed. In recent years, several related concepts have been proposed, all referring to some form of reward hacking:
Update from node generate_answer
================================== Ai Message ==================================
Lilian Weng categorizes reward hacking into two types: environment or goal misspecification, and reward tampering. She considers reward hacking as a broad concept that includes both of these categories. Reward hacking occurs when an agent exploits flaws or ambiguities in the reward function to achieve high rewards without performing the intended behaviors.
Business, governance, and adoption-focused material. Real-world implementations, case studies, and industry impact.