Judges get better the more you correct them. Each time you mark an evaluation as right or wrong, that correction is stored and used to refine future scoring. This is calibration.
Calibrating in the dashboard
For each evaluated item in the console, you can mark the judge’s assessment as correct or incorrect and optionally provide the expected answer.
Calibrating programmatically
Submit corrections via the SDK or REST API. This is useful for bulk calibration from automated pipelines, custom review workflows, or external labeling tools.
Finding the right IDs
Judge evaluations involve two related spans:
| ID | Description |
|---|
| Source Span ID | The original LLM call that was evaluated |
| Judge Call Span ID | The span created when the judge ran its evaluation |
| ID | Where to Find It |
|---|
| Task Slug | In the judge settings, or the URL when editing the judge’s prompt |
| Span ID | In the evaluation modal, or via get_judge_evaluations() |
| Judge ID | In the URL when viewing a judge (/judges/{judge_id}) |
The easiest way to get the correct IDs: open a judge evaluation in the
dashboard, expand “SDK Integration”, and click “Copy” to get pre-filled code.
Binary judges
Mark a judge evaluation as correct or incorrect:
import zeroeval as ze
ze.send_feedback(
prompt_slug="your-judge-task-slug",
completion_id="span-id-here",
thumbs_up=True,
reason="Judge correctly identified the issue",
judge_id="automation-id-here",
)
Scored judges
For judges using scored rubrics, provide the expected score and direction:
ze.send_feedback(
prompt_slug="quality-scorer",
completion_id="span-id-here",
thumbs_up=False,
judge_id="automation-id-here",
expected_score=3.5,
score_direction="too_high",
reason="Score should have been lower due to grammar issues",
)
Per-criterion feedback
For scored judges with multiple criteria, correct individual criterion scores:
ze.send_feedback(
prompt_slug="quality-scorer",
completion_id="span-id-here",
thumbs_up=False,
judge_id="automation-id-here",
reason="Criterion-level score adjustments",
criteria_feedback={
"CTA_text": {
"expected_score": 4.0,
"reason": "CTA is clear and prominent"
},
"CX-004": {
"expected_score": 1.0,
"reason": "Required phone number is missing"
}
}
)
To discover valid criterion keys before sending per-criterion feedback:
criteria = ze.get_judge_criteria(
project_id="your-project-id",
judge_id="automation-id-here",
)
for c in criteria["criteria"]:
print(c["key"], c.get("description"))
Parameters
| Parameter | Type | Required | Description |
|---|
prompt_slug | str | Yes | Task slug associated with the judge |
completion_id | str | Yes | Span ID being evaluated |
thumbs_up | bool | Yes | True if judge was correct, False if wrong |
reason | str | No | Explanation of the correction |
judge_id | str | Yes | Judge automation ID |
expected_score | float | No | Expected score (scored judges only) |
score_direction | str | No | "too_high" or "too_low" (scored judges only) |
criteria_feedback | dict | No | Per-criterion corrections (scored judges only) |
Bulk calibration
Iterate through evaluations and submit corrections programmatically:
evaluations = ze.get_judge_evaluations(
project_id="your-project-id",
judge_id="your-judge-id",
limit=100,
)
for eval in evaluations["evaluations"]:
is_correct = your_review_logic(eval)
ze.send_feedback(
prompt_slug="your-judge-task-slug",
completion_id=eval["span_id"],
thumbs_up=is_correct,
reason="Automated review",
judge_id="your-judge-id",
)