Calibration - ZeroEval Documentation

Judges get better the more you correct them. Each time you mark an evaluation as right or wrong, that correction is stored and used to refine future scoring. This is calibration.

Calibrating in the dashboard

For each evaluated item in the console, you can mark the judge’s assessment as correct or incorrect and optionally provide the expected answer.

Calibrating programmatically

Submit corrections via the SDK or REST API. This is useful for bulk calibration from automated pipelines, custom review workflows, or external labeling tools.

Finding the right IDs

Judge evaluations involve two related spans:

ID	Description
Source Span ID	The original LLM call that was evaluated
Judge Call Span ID	The span created when the judge ran its evaluation

ID	Where to Find It
Task Slug	In the judge settings, or the URL when editing the judge’s prompt
Span ID	In the evaluation modal, or via `get_judge_evaluations()`
Judge ID	In the URL when viewing a judge (`/judges/{judge_id}`)

The easiest way to get the correct IDs: open a judge evaluation in the dashboard, expand “SDK Integration”, and click “Copy” to get pre-filled code.

Binary judges

Mark a judge evaluation as correct or incorrect:

import zeroeval as ze

ze.send_feedback(
    prompt_slug="your-judge-task-slug",
    completion_id="span-id-here",
    thumbs_up=True,
    reason="Judge correctly identified the issue",
    judge_id="automation-id-here",
)

Scored judges

For judges using scored rubrics, provide the expected score and direction:

ze.send_feedback(
    prompt_slug="quality-scorer",
    completion_id="span-id-here",
    thumbs_up=False,
    judge_id="automation-id-here",
    expected_score=3.5,
    score_direction="too_high",
    reason="Score should have been lower due to grammar issues",
)

Per-criterion feedback

For scored judges with multiple criteria, correct individual criterion scores:

ze.send_feedback(
    prompt_slug="quality-scorer",
    completion_id="span-id-here",
    thumbs_up=False,
    judge_id="automation-id-here",
    reason="Criterion-level score adjustments",
    criteria_feedback={
        "CTA_text": {
            "expected_score": 4.0,
            "reason": "CTA is clear and prominent"
        },
        "CX-004": {
            "expected_score": 1.0,
            "reason": "Required phone number is missing"
        }
    }
)

To discover valid criterion keys before sending per-criterion feedback:

criteria = ze.get_judge_criteria(
    project_id="your-project-id",
    judge_id="automation-id-here",
)

for c in criteria["criteria"]:
    print(c["key"], c.get("description"))

Parameters

Parameter	Type	Required	Description
`prompt_slug`	`str`	Yes	Task slug associated with the judge
`completion_id`	`str`	Yes	Span ID being evaluated
`thumbs_up`	`bool`	Yes	`True` if judge was correct, `False` if wrong
`reason`	`str`	No	Explanation of the correction
`judge_id`	`str`	Yes	Judge automation ID
`expected_score`	`float`	No	Expected score (scored judges only)
`score_direction`	`str`	No	`"too_high"` or `"too_low"` (scored judges only)
`criteria_feedback`	`dict`	No	Per-criterion corrections (scored judges only)

Bulk calibration

Iterate through evaluations and submit corrections programmatically:

evaluations = ze.get_judge_evaluations(
    project_id="your-project-id",
    judge_id="your-judge-id",
    limit=100,
)

for eval in evaluations["evaluations"]:
    is_correct = your_review_logic(eval)

    ze.send_feedback(
        prompt_slug="your-judge-task-slug",
        completion_id=eval["span_id"],
        thumbs_up=is_correct,
        reason="Automated review",
        judge_id="your-judge-id",
    )

​Calibrating in the dashboard

​Calibrating programmatically

​Finding the right IDs

​Binary judges

​Scored judges

​Per-criterion feedback

​Parameters

​Bulk calibration

Calibrating in the dashboard

Calibrating programmatically

Finding the right IDs

Binary judges

Scored judges

Per-criterion feedback

Parameters

Bulk calibration