A/B Tests

TL;DR

Pull (or create) a dataset
Write a task – any Python function that maps a row ➜ model output
Optionally write evaluators – functions that score (row, output)
Wrap them in ze.Experiment and call .run()

Minimal example

import zeroeval as ze

ze.init()
dataset = ze.Dataset.pull("Capitals")

def task(row):
    # imagine calling an LLM here
    return row["input"].upper()

def exact_match(row, output):
    return row["output"].upper() == output

exp = ze.Experiment(
    dataset=dataset,
    task=task,
    evaluators=[exact_match],
    name="Capitals-baseline"
)

Tracing

Autotune

Calibrated Judges

Experiments

LLM Gateway

TL;DR

Minimal example

Tracing

Autotune

Calibrated Judges

Experiments

LLM Gateway

​TL;DR

​Minimal example

TL;DR

Minimal example