Skip to main content
Judges get better the more you correct them. Each time you mark an evaluation as right or wrong, that correction is stored and used to refine future scoring. This is calibration.

Calibrating in the dashboard

For each evaluated item in the console, you can mark the judge’s assessment as correct or incorrect and optionally provide the expected answer. Calibrating your judge

Calibrating programmatically

Submit corrections via the SDK or REST API. This is useful for bulk calibration from automated pipelines, custom review workflows, or external labeling tools.

Finding the right IDs

Judge evaluations involve two related spans:
IDDescription
Source Span IDThe original LLM call that was evaluated
Judge Call Span IDThe span created when the judge ran its evaluation
IDWhere to Find It
Task SlugIn the judge settings, or the URL when editing the judge’s prompt
Span IDIn the evaluation modal, or via get_judge_evaluations()
Judge IDIn the URL when viewing a judge (/judges/{judge_id})
The easiest way to get the correct IDs: open a judge evaluation in the dashboard, expand “SDK Integration”, and click “Copy” to get pre-filled code.

Binary judges

Mark a judge evaluation as correct or incorrect:
import zeroeval as ze

ze.send_feedback(
    prompt_slug="your-judge-task-slug",
    completion_id="span-id-here",
    thumbs_up=True,
    reason="Judge correctly identified the issue",
    judge_id="automation-id-here",
)

Scored judges

For judges using scored rubrics, provide the expected score and direction:
ze.send_feedback(
    prompt_slug="quality-scorer",
    completion_id="span-id-here",
    thumbs_up=False,
    judge_id="automation-id-here",
    expected_score=3.5,
    score_direction="too_high",
    reason="Score should have been lower due to grammar issues",
)

Per-criterion feedback

For scored judges with multiple criteria, correct individual criterion scores:
ze.send_feedback(
    prompt_slug="quality-scorer",
    completion_id="span-id-here",
    thumbs_up=False,
    judge_id="automation-id-here",
    reason="Criterion-level score adjustments",
    criteria_feedback={
        "CTA_text": {
            "expected_score": 4.0,
            "reason": "CTA is clear and prominent"
        },
        "CX-004": {
            "expected_score": 1.0,
            "reason": "Required phone number is missing"
        }
    }
)
To discover valid criterion keys before sending per-criterion feedback:
criteria = ze.get_judge_criteria(
    project_id="your-project-id",
    judge_id="automation-id-here",
)

for c in criteria["criteria"]:
    print(c["key"], c.get("description"))

Parameters

ParameterTypeRequiredDescription
prompt_slugstrYesTask slug associated with the judge
completion_idstrYesSpan ID being evaluated
thumbs_upboolYesTrue if judge was correct, False if wrong
reasonstrNoExplanation of the correction
judge_idstrYesJudge automation ID
expected_scorefloatNoExpected score (scored judges only)
score_directionstrNo"too_high" or "too_low" (scored judges only)
criteria_feedbackdictNoPer-criterion corrections (scored judges only)

Bulk calibration

Iterate through evaluations and submit corrections programmatically:
evaluations = ze.get_judge_evaluations(
    project_id="your-project-id",
    judge_id="your-judge-id",
    limit=100,
)

for eval in evaluations["evaluations"]:
    is_correct = your_review_logic(eval)

    ze.send_feedback(
        prompt_slug="your-judge-task-slug",
        completion_id=eval["span_id"],
        thumbs_up=is_correct,
        reason="Automated review",
        judge_id="your-judge-id",
    )