TL;DR

  1. Pull (or create) a dataset
  2. Write a task – any Python function that maps a row ➜ model output
  3. Optionally write evaluators – functions that score (row, output)
  4. Wrap them in ze.Experiment and call .run()

Minimal example

import zeroeval as ze

ze.init()
dataset = ze.Dataset.pull("Capitals")

def task(row):
    # imagine calling an LLM here
    return row["input"].upper()

def exact_match(row, output):
    return row["output"].upper() == output

exp = ze.Experiment(
    dataset=dataset,
    task=task,
    evaluators=[exact_match],
    name="Capitals-baseline"
)

results = exp.run()     # uploads outputs + scores
results is a list of ExperimentResult objects and also appears in the ZeroEval web console.

Running just the task

Need to iterate fast? Run the task first, add evaluators later:
results = exp.run_task()                       # only model calls
exp.run_evaluators([exact_match], results)     # score later (or change evaluator)

Subsets & slices

exp.run(dataset[:100])   # only the first 100 rows

Tracing and spans

If you have the tracing decorators installed, any spans inside your task are automatically picked up:
import zeroeval as ze

@ze.span(name="model_call")
def task(row):
    # autotraced
    return call_llm(row["input"])

Multimodal experiments

Experiments work with datasets that contain images/audio/video exactly the same way – simply pass those fields to your model API and write the right evaluator. Full multimodal example lives in zeroeval-sdk/examples/experiments.2.multimodal.py.