- Debug failed runs by inspecting the exact inputs, outputs, and errors at every step
- Evaluate output quality at scale with calibrated judges that score your traces automatically
- Optimize prompts and models by comparing versions against real production data with prompt optimization
- Monitor cost, latency, and error rates across sessions, traces, and spans
How it works
Instrument your code
Add a few lines to your application. The SDK automatically captures LLM
calls, or you can create custom spans for any operation.
Traces flow into ZeroEval
Every agent run becomes a trace — a tree of spans showing what happened, in
what order, with full inputs and outputs.
Organize with sessions and tags
Group related traces into sessions and tag them with metadata for filtering.
Attach human feedback or let judges evaluate outputs automatically.
Get started
Create an API key from Settings → API Keys, then pick your integration path:Python SDK
Decorators and context managers for Python apps. Auto-instruments OpenAI,
LangChain, Gemini, and more.
TypeScript SDK
Wrapper functions for Node.js and Bun. Auto-instruments OpenAI and Vercel AI
SDK.
REST API
Send spans, traces, and sessions directly over HTTP from any language.
OpenTelemetry
Route OTLP traces from any OpenTelemetry-instrumented app to ZeroEval.