ChainForge is a data flow prompt engineering environment for analyzing and evaluating LLM responses. It enables rapid-fire, quick-and-dirty comparison of prompts, models, and response quality that goes beyond ad-hoc chatting with individual LLMs. With ChainForge, you can:
Query multiple LLMs at once to test prompt ideas and variations quickly and effectively.
Compare response quality across prompt permutations, across models, and across model settings to choose the best prompt and model for your use case.
Setup evaluation metrics (scoring function) and immediately visualize results across prompts, prompt parameters, models, and model settings.
Use AI to streamline this entire process: Create synthetic tables and input examples with built-in genAI features, or supercharge writing evals by prompting a model to give you starter code.
For user-curated resources and learning materials, check out the ðAwesome ChainForge repo!
You can install ChainForge locally, or try it out on the web at https://chainforge.ai/play/. The web version of ChainForge has a limited feature set. In a locally installed version you can load API keys automatically from environment variables, write Python code to evaluate LLM responses, or query locally-run models hosted via Ollama.
To install Chainforge on your machine, make sure you have Python 3.8 or higher, then run
pip install chainforge
Once installed, do
chainforge serve
Dockerfile
: docker build -t chainforge .
shell
docker run -p 8000:8000 chainforge
http://127.0.0.1:8000
.OpenAI
Anthropic
Google (Gemini, PaLM2)
DeepSeek
HuggingFace (Inference and Endpoints)
Together.ai
Microsoft Azure OpenAI Endpoints
Amazon Bedrock-hosted on-demand inference, including Anthropic Claude 3
{game}
:You can also conduct ground truth evaluations using Tabular Data nodes. For instance, we can compare each LLM's ability to answer math problems by comparing each response to the expected answer:
Just import a dataset, hook it up to a template variable in a Prompt Node, and press run.
Compare across models and prompt variables with an interactive response inspector, including a formatted table and exportable data:
The key power of ChainForge lies in combinatorial power: ChainForge takes the cross product of inputs to prompt templates, meaning you can produce every combination of input values. This is incredibly effective at sending off hundreds of queries at once to verify model behavior more robustly than one-off prompting.
Simply click Share to generate a unique link for your flow and copy it to your clipboard:
Note To prevent abuse, you can only share up to 10 flows at a time, and each flow must be <5MB after compression. If you share more than 10 flows, the oldest link will break, so make sure to always Export important flows tocforge
files, and use Share to only pass data ephemerally.
A key goal of ChainForge is facilitating comparison and evaluation of prompts and models. Overall, you can:
Compare across prompts and prompt parameters: Find the best set of prompts that maximizes your eval target metrics (e.g., lowest code error rate). Or, see how changing parameters in a prompt template affects the quality of responses.
Compare across models: Compare responses for every prompt across models and different model settings, to find the best model for your use case.
The features that enable this area:
Prompt permutations: Setup a prompt template and feed it variations of input variables. ChainForge will prompt all selected LLMs with all possible permutations of the input prompt, so that you can get a better sense of prompt quality. You can also chain prompt templates at arbitrary depth (e.g., to compare templates).
Model settings: Change the settings of supported models, and compare across settings. For instance, you can measure the impact of a system message on ChatGPT by adding several ChatGPT models, changing individual settings, and nicknaming each one. ChainForge will send out queries to each version of the model.
Evaluation nodes: Probe LLM responses in a chain and test them (classically) for some desired behavior. At a basic level, this is Python script based. We plan to add preset evaluator nodes for common use cases in the near future (e.g., name-entity recognition). Note that you can also chain LLM responses into prompt templates to help evaluate outputs cheaply before more extensive evaluation methods.
Visualization nodes: Visualize evaluation results on plots like grouped box-and-whisker (for numeric metrics) and histograms (for boolean metrics). Currently we only support numeric and boolean metrics. We aim to provide users more control and options for plotting in the future.
Chat turns: Go beyond prompts and template follow-up chat messages, just like prompts. You can test how the wording of the user's query might change an LLM's output, or compare quality of later responses across multiple chat models (or the same chat model with different settings!).
xlsx
). To do this, attach an Inspect node to the output of a Prompt node and click Export Data
.This work was partially funded by the NSF grants IIS-2107391, IIS-2040880, and IIS-1
We provide ongoing releases of this tool in the hopes that others find it useful for their projects.
"AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts" (Wu et al., CHI â22)
Unlike these projects, we are focusing on supporting evaluation across prompts, prompt parameters, and models.
bibtex
@inproceedings{arawjo2024chainforge,
title={ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis Testing},
author={Arawjo, Ian and Swoopes, Chelse and Vaithilingam, Priyan and Wattenberg, Martin and Glassman, Elena L},
booktitle={Proceedings of the CHI Conference on Human Factors in Computing Systems},
pages={1--18},
year={2024}
}
ChainForge is released under the MIT License.