Calibrated LLM judges are AI evaluators that watch your traces, sessions, or spans and score behavior according to criteria you define. They get better over time the more you refine and correct their evaluations.

When to use

Use a calibrated judge when you want consistent, scalable evaluation of:
  • Hallucinations, safety/policy violations
  • Response quality (helpfulness, tone, structure)
  • Latency, cost, and error patterns tied to behaviors