TL;DR

  1. Pull (or create) a dataset
  2. Write a task – any Python function that maps a row ➜ model output
  3. Optionally write evaluators – functions that score (row, output)
  4. Wrap them in ze.Experiment and call .run()

Minimal example

import zeroeval as ze

ze.init()
dataset = ze.Dataset.pull("Capitals")

def task(row):
    # imagine calling an LLM here
    return row["input"].upper()

def exact_match(row, output):
    return row["output"].upper() == output

exp = ze.Experiment(
    dataset=dataset,
    task=task,
    evaluators=[exact_match],
    name="Capitals-baseline"
)