Skip to main content

Installation

pip install zeroeval

Core Functions

init()

Initializes the ZeroEval SDK. Must be called before using any other SDK features.
def init(
    api_key: str = None,
    workspace_name: str = "Personal Organization",
    organization_name: str = None,
    debug: bool = False,
    api_url: str = None,
    disabled_integrations: list[str] = None,
    enabled_integrations: list[str] = None,
    setup_otlp: bool = True,
    service_name: str = "zeroeval-app",
    tags: dict[str, str] = None,
    sampling_rate: float = None
) -> None
ParameterTypeDefaultDescription
api_keystrNoneAPI key. Falls back to ZEROEVAL_API_KEY env var
workspace_namestr"Personal Organization"Deprecated — use organization_name
organization_namestrNoneOrganization name
debugboolFalseEnable debug logging with colors
api_urlstr"https://api.zeroeval.com"API endpoint URL
disabled_integrationslist[str]NoneIntegrations to disable (e.g. ["langchain"])
enabled_integrationslist[str]NoneOnly enable these integrations
setup_otlpboolTrueConfigure OpenTelemetry OTLP export
service_namestr"zeroeval-app"OTLP service name
tagsdict[str, str]NoneGlobal tags applied to all spans
sampling_ratefloatNoneSampling rate 0.0-1.0 (1.0 = sample all)
Example:
import zeroeval as ze

ze.init(
    api_key="your-api-key",
    sampling_rate=0.1,
    disabled_integrations=["langchain"],
    debug=True
)

Decorators

@span

Decorator and context manager for creating spans around code blocks.
@span(
    name: str,
    session_id: Optional[str] = None,
    session: Optional[Union[str, dict[str, str]]] = None,
    attributes: Optional[dict[str, Any]] = None,
    input_data: Optional[str] = None,
    output_data: Optional[str] = None,
    tags: Optional[dict[str, str]] = None
)
Parameters:
  • name (str): Name of the span
  • session_id (str, optional): Deprecated - Use session parameter instead
  • session (Union[str, dict], optional): Session information. Can be:
    • A string containing the session ID
    • A dict with {"id": "...", "name": "..."}
  • attributes (dict, optional): Additional attributes to attach to the span
  • input_data (str, optional): Manual input data override
  • output_data (str, optional): Manual output data override
  • tags (dict, optional): Tags to attach to the span
Usage as Decorator:
import zeroeval as ze

@ze.span(name="calculate_sum")
def add_numbers(a: int, b: int) -> int:
    return a + b  # Parameters and return value automatically captured

# With manual I/O
@ze.span(name="process_data", input_data="manual input", output_data="manual output")
def process():
    # Process logic here
    pass

# With session
@ze.span(name="user_action", session={"id": "123", "name": "John's Session"})
def user_action():
    pass
Usage as Context Manager:
import zeroeval as ze

with ze.span(name="data_processing") as current_span:
    result = process_data()
    current_span.set_io(input_data="input", output_data=str(result))

artifact_span

Ergonomic wrapper for creating artifact-bearing spans. Produces a span with kind="llm" and the completion_artifact_* attributes pre-filled so the prompt completions page surfaces it as a first-class artifact.
artifact_span(
    name: str,
    *,
    artifact_type: str,
    role: str = "primary",
    label: Optional[str] = None,
    kind: str = "llm",
    session_id: Optional[str] = None,
    session: Optional[Union[str, dict[str, str]]] = None,
    attributes: Optional[dict[str, Any]] = None,
    input_data: Optional[str] = None,
    output_data: Optional[str] = None,
    tags: Optional[dict[str, str]] = None,
)
Parameters:
  • name (str): Name of the span
  • artifact_type (str): Artifact type identifier (e.g. "final_decision", "customer_card")
  • role (str): "primary" (used for row preview) or "secondary". Defaults to "primary"
  • label (str, optional): Human-friendly label shown in the artifact switcher. Defaults to the span name
  • kind (str): Span kind. Defaults to "llm"
  • session (Union[str, dict], optional): Session information
  • attributes (dict, optional): Additional attributes merged with artifact metadata. Artifact keys take precedence
  • input_data (str, optional): Manual input data override
  • output_data (str, optional): Manual output data override
  • tags (dict, optional): Tags to attach to the span
Usage as Context Manager:
import zeroeval as ze

with ze.artifact_span(
    name="final-decision",
    artifact_type="final_decision",
    role="primary",
    label="Final Decision",
    tags={"judge_target": "support_ops_final_decision"},
) as s:
    s.set_io(input_data="ticket text", output_data=decision_json)
Usage as Decorator:
@ze.artifact_span(
    name="generate-card",
    artifact_type="customer_card",
    role="secondary",
    label="Customer Card",
)
def generate_card(ticket):
    return render_card(ticket)
ze.artifact_span is available in the Python SDK only for now.

@experiment

Decorator that attaches dataset and model information to a function.
@experiment(
    dataset: Optional[Dataset] = None,
    model: Optional[str] = None
)
Parameters:
  • dataset (Dataset, optional): Dataset to use for the experiment
  • model (str, optional): Model identifier
Example:
import zeroeval as ze

dataset = ze.Dataset.pull("my-dataset")

@ze.experiment(dataset=dataset, model="gpt-4")
def my_experiment():
    # Experiment logic
    pass

Classes

Dataset

A class to represent a named collection of dictionary records.

Constructor

Dataset(
    name: str,
    data: list[dict[str, Any]],
    description: Optional[str] = None
)
Parameters:
  • name (str): The name of the dataset
  • data (list[dict]): A list of dictionaries containing the data
  • description (str, optional): A description of the dataset
Example:
dataset = Dataset(
    name="Capitals",
    description="Country to capital mapping",
    data=[
        {"input": "France", "output": "Paris"},
        {"input": "Germany", "output": "Berlin"}
    ]
)

Methods

push()
Push the dataset to the backend, creating a new version if it already exists.
def push(self, create_new_version: bool = False) -> Dataset
Parameters:
  • self: The Dataset instance
  • create_new_version (bool, optional): For backward compatibility. This parameter is no longer needed as new versions are automatically created when a dataset name already exists. Defaults to False
Returns: Returns self for method chaining
pull()
Static method to pull a dataset from the backend.
@classmethod
def pull(
    cls,
    dataset_name: str,
    version_number: Optional[int] = None
) -> Dataset
Parameters:
  • cls: The Dataset class itself (automatically provided when using @classmethod)
  • dataset_name (str): The name of the dataset to pull from the backend
  • version_number (int, optional): Specific version number to pull. If not provided, pulls the latest version
Returns: A new Dataset instance populated with data from the backend
add_rows()
Add new rows to the dataset.
def add_rows(self, new_rows: list[dict[str, Any]]) -> None
Parameters:
  • self: The Dataset instance
  • new_rows (list[dict]): A list of dictionaries representing the rows to add
add_image()
Add an image to a specific row.
def add_image(
    self,
    row_index: int,
    column_name: str,
    image_path: str
) -> None
Parameters:
  • self: The Dataset instance
  • row_index (int): Index of the row to update (0-based)
  • column_name (str): Name of the column to add the image to
  • image_path (str): Path to the image file to add
add_audio()
Add audio to a specific row.
def add_audio(
    self,
    row_index: int,
    column_name: str,
    audio_path: str
) -> None
Parameters:
  • self: The Dataset instance
  • row_index (int): Index of the row to update (0-based)
  • column_name (str): Name of the column to add the audio to
  • audio_path (str): Path to the audio file to add
add_media_url()
Add a media URL to a specific row.
def add_media_url(
    self,
    row_index: int,
    column_name: str,
    media_url: str,
    media_type: str = "image"
) -> None
Parameters:
  • self: The Dataset instance
  • row_index (int): Index of the row to update (0-based)
  • column_name (str): Name of the column to add the media URL to
  • media_url (str): URL pointing to the media file
  • media_type (str, optional): Type of media - “image”, “audio”, or “video”. Defaults to “image”

Properties

  • name (str): The name of the dataset
  • description (str): The description of the dataset
  • columns (list[str]): List of all unique column names
  • data (list[dict]): List of the data portion for each row
  • backend_id (str): The ID in the backend (after pushing)
  • version_id (str): The version ID in the backend
  • version_number (int): The version number in the backend

Example

import zeroeval as ze

# Create a dataset
dataset = ze.Dataset(
    name="Capitals",
    description="Country to capital mapping",
    data=[
        {"input": "France", "output": "Paris"},
        {"input": "Germany", "output": "Berlin"}
    ]
)

# Push to backend
dataset.push()

# Pull from backend
dataset = ze.Dataset.pull("Capitals", version_number=1)

# Add rows
dataset.add_rows([{"input": "Italy", "output": "Rome"}])

# Add multimodal data
dataset.add_image(0, "flag", "flags/france.png")
dataset.add_audio(0, "anthem", "anthems/france.mp3")
dataset.add_media_url(0, "video_url", "https://example.com/video.mp4", "video")

Experiment

Represents an experiment that runs a task on a dataset with optional evaluators.

Constructor

Experiment(
    dataset: Dataset,
    task: Callable[[Any], Any],
    evaluators: Optional[list[Callable[[Any, Any], Any]]] = None,
    name: Optional[str] = None,
    description: Optional[str] = None
)
Parameters:
  • dataset (Dataset): The dataset to run the experiment on
  • task (Callable): Function that processes each row and returns output
  • evaluators (list[Callable], optional): List of evaluator functions that take (row, output) and return evaluation result
  • name (str, optional): Name of the experiment. Defaults to task function name
  • description (str, optional): Description of the experiment. Defaults to task function’s docstring
Example:
import zeroeval as ze

ze.init()

# Pull dataset
dataset = ze.Dataset.pull("Capitals")

# Define task
def capitalize_task(row):
    return row["input"].upper()

# Define evaluator
def exact_match(row, output):
    return row["output"].upper() == output

# Create and run experiment
exp = ze.Experiment(
    dataset=dataset,
    task=capitalize_task,
    evaluators=[exact_match],
    name="Capital Uppercase Test"
)

results = exp.run()

# Or run task and evaluators separately
results = exp.run_task()
exp.run_evaluators([exact_match], results)

Methods

run()
Run the complete experiment (task + evaluators).
def run(
    self,
    subset: Optional[list[dict]] = None
) -> list[ExperimentResult]
Parameters:
  • self: The Experiment instance
  • subset (list[dict], optional): Subset of dataset rows to run the experiment on. If None, runs on entire dataset
Returns: List of experiment results for each row
run_task()
Run only the task without evaluators.
def run_task(
    self,
    subset: Optional[list[dict]] = None,
    raise_on_error: bool = False
) -> list[ExperimentResult]
Parameters:
  • self: The Experiment instance
  • subset (list[dict], optional): Subset of dataset rows to run the task on. If None, runs on entire dataset
  • raise_on_error (bool, optional): If True, raises exceptions encountered during task execution. If False, captures errors. Defaults to False
Returns: List of experiment results for each row
run_evaluators()
Run evaluators on existing results.
def run_evaluators(
    self,
    evaluators: Optional[list[Callable[[Any, Any], Any]]] = None,
    results: Optional[list[ExperimentResult]] = None
) -> list[ExperimentResult]
Parameters:
  • self: The Experiment instance
  • evaluators (list[Callable], optional): List of evaluator functions to run. If None, uses evaluators from the Experiment instance
  • results (list[ExperimentResult], optional): List of results to evaluate. If None, uses results from the Experiment instance
Returns: The evaluated results

Span

Represents a span in the tracing system. Usually created via the @span decorator.

Methods

set_io()
Set input and output data for the span.
def set_io(
    self,
    input_data: Optional[str] = None,
    output_data: Optional[str] = None
) -> None
Parameters:
  • self: The Span instance
  • input_data (str, optional): Input data to attach to the span. Will be converted to string if not already
  • output_data (str, optional): Output data to attach to the span. Will be converted to string if not already
set_tags()
Set tags on the span.
def set_tags(self, tags: dict[str, str]) -> None
Parameters:
  • self: The Span instance
  • tags (dict[str, str]): Dictionary of tags to set on the span
set_attributes()
Set attributes on the span.
def set_attributes(self, attributes: dict[str, Any]) -> None
Parameters:
  • self: The Span instance
  • attributes (dict[str, Any]): Dictionary of attributes to set on the span
set_error()
Set error information for the span.
def set_error(
    self,
    code: str,
    message: str,
    stack: Optional[str] = None
) -> None
Parameters:
  • self: The Span instance
  • code (str): Error code or exception class name
  • message (str): Error message
  • stack (str, optional): Stack trace information
add_screenshot()
Attach a screenshot to the span for visual evaluation by LLM judges. Screenshots are uploaded during ingestion and can be evaluated alongside text data.
def add_screenshot(
    self,
    base64_data: str,
    viewport: str = "desktop",
    width: Optional[int] = None,
    height: Optional[int] = None,
    label: Optional[str] = None
) -> None
Parameters:
  • self: The Span instance
  • base64_data (str): Base64 encoded image data. Accepts raw base64 or data URL format (data:image/png;base64,...)
  • viewport (str, optional): Viewport type - "desktop", "mobile", or "tablet". Defaults to "desktop"
  • width (int, optional): Image width in pixels
  • height (int, optional): Image height in pixels
  • label (str, optional): Human-readable description of the screenshot
Example:
import zeroeval as ze

with ze.span(name="browser_test", tags={"test": "visual"}) as span:
    # Capture and attach a desktop screenshot
    span.add_screenshot(
        base64_data=desktop_screenshot_base64,
        viewport="desktop",
        width=1920,
        height=1080,
        label="Homepage - Desktop"
    )

    # Also capture mobile view
    span.add_screenshot(
        base64_data=mobile_screenshot_base64,
        viewport="mobile",
        width=375,
        height=812,
        label="Homepage - iPhone"
    )

    span.set_io(
        input_data="Navigate to homepage",
        output_data="Captured viewport screenshots"
    )
add_image()
Attach a generic image to the span for visual evaluation. Use this for non-screenshot images like charts, diagrams, or UI component states.
def add_image(
    self,
    base64_data: str,
    label: Optional[str] = None,
    metadata: Optional[dict[str, Any]] = None
) -> None
Parameters:
  • self: The Span instance
  • base64_data (str): Base64 encoded image data. Accepts raw base64 or data URL format
  • label (str, optional): Human-readable description of the image
  • metadata (dict, optional): Additional metadata to store with the image
Example:
import zeroeval as ze

with ze.span(name="chart_generation") as span:
    # Generate a chart and attach it
    chart_base64 = generate_chart(data)

    span.add_image(
        base64_data=chart_base64,
        label="Monthly Revenue Chart",
        metadata={"chart_type": "bar", "data_points": 12}
    )

    span.set_io(
        input_data="Generate revenue chart for Q4",
        output_data="Chart generated with 12 data points"
    )
Attaching images via URL (S3 presigned or CDN)
If your images are already hosted externally, you can pass an HTTPS URL instead of base64 data. ZeroEval will download, validate, and copy the image into its own storage during ingestion. Supported URL sources:
  • S3 presigned URLs (*.amazonaws.com with valid authentication parameters)
  • CDN URLs from trusted domains
Attach URLs directly via attributes.attachments using the url key:
import boto3
import zeroeval as ze

# Option A: Presigned S3 URL
s3 = boto3.client("s3")
presigned_url = s3.generate_presigned_url(
    "get_object",
    Params={"Bucket": "my-bucket", "Key": "images/chart.png"},
    ExpiresIn=300,
)

with ze.span(name="chart_generation") as span:
    span.attributes["attachments"] = [
        {
            "type": "image",
            "url": presigned_url,
            "label": "Monthly Revenue Chart",
        }
    ]

    span.set_io(
        input_data="Generate revenue chart for Q4",
        output_data="Chart generated"
    )
import zeroeval as ze

# Option B: CDN URL
cdn_url = "https://cdn.example.com/images/product-photo.png"

with ze.span(name="product_image_check") as span:
    span.attributes["attachments"] = [
        {
            "type": "image",
            "url": cdn_url,
            "label": "Product listing photo",
        }
    ]

    span.set_io(
        input_data="Check product image quality",
        output_data="Image attached for evaluation"
    )
Images attached to spans can be evaluated by LLM judges configured for multimodal evaluation. See the Multimodal Evaluation guide for setup instructions.

Context Functions

get_current_span()

Returns the currently active span, if any.
def get_current_span() -> Optional[Span]
Returns: The currently active Span instance, or None if no span is active

get_current_trace()

Returns the current trace ID.
def get_current_trace() -> Optional[str]
Returns: The current trace ID, or None if no trace is active

get_current_session()

Returns the current session ID.
def get_current_session() -> Optional[str]
Returns: The current session ID, or None if no session is active

set_tag()

Sets tags on a span, trace, or session.
def set_tag(
    target: Union[Span, str],
    tags: dict[str, str]
) -> None
Parameters:
  • target: The target to set tags on
    • Span: Sets tags on the specific span
    • str: Sets tags on the trace (if valid trace ID) or session (if valid session ID)
  • tags (dict[str, str]): Dictionary of tags to set
Example:
import zeroeval as ze

# Set tags on current span
current_span = ze.get_current_span()
if current_span:
    ze.set_tag(current_span, {"user_id": "12345", "environment": "production"})

# Set tags on trace
trace_id = ze.get_current_trace()
if trace_id:
    ze.set_tag(trace_id, {"version": "1.5"})

Judge Feedback APIs

send_feedback()

Programmatically submit user feedback for a completion or judge evaluation.
def send_feedback(
    *,
    prompt_slug: str,
    completion_id: str,
    thumbs_up: bool,
    reason: Optional[str] = None,
    expected_output: Optional[str] = None,
    metadata: Optional[dict] = None,
    judge_id: Optional[str] = None,
    expected_score: Optional[float] = None,
    score_direction: Optional[str] = None,
    criteria_feedback: Optional[dict] = None
) -> dict
Notes:
  • Existing usage without criteria_feedback is unchanged.
  • criteria_feedback is optional and supported for scored judges.
  • judge_id is required when sending expected_score, score_direction, or criteria_feedback.

get_judge_criteria()

Fetch normalized criteria metadata for a judge (useful before criterion-level feedback).
def get_judge_criteria(
    project_id: str,
    judge_id: str
) -> dict
Returns:
  • judge_id
  • evaluation_type
  • score_min, score_max, pass_threshold
  • criteria (list of {key, label, description})

CLI Commands

The ZeroEval SDK includes a CLI tool for running experiments and setup.

zeroeval run

Run a Python script containing ZeroEval experiments.
zeroeval run script.py

zeroeval setup

Interactive setup to configure API credentials.
zeroeval setup

Environment Variables

Set before importing ZeroEval to configure default behavior.
VariableTypeDefaultDescription
ZEROEVAL_API_KEYstring""API key for authentication
ZEROEVAL_API_URLstring"https://api.zeroeval.com"API endpoint URL
ZEROEVAL_WORKSPACE_NAMEstring"Personal Workspace"Workspace name
ZEROEVAL_SESSION_IDstringauto-generatedSession ID for grouping traces
ZEROEVAL_SESSION_NAMEstring""Human-readable session name
ZEROEVAL_SAMPLING_RATEfloat"1.0"Sampling rate (0.0-1.0)
ZEROEVAL_DISABLED_INTEGRATIONSstring""Comma-separated integrations to disable
ZEROEVAL_DEBUGboolean"false"Enable debug logging
export ZEROEVAL_API_KEY="ze_1234567890abcdef"
export ZEROEVAL_SAMPLING_RATE="0.1"
export ZEROEVAL_DEBUG="true"

Runtime Configuration

Configure after initialization via ze.tracer.configure().
ParameterTypeDefaultDescription
flush_intervalfloat1.0Flush frequency in seconds
max_spansint20Buffer size before forced flush
collect_code_detailsboolTrueCapture code details in spans
integrationsdict[str, bool]{}Enable/disable specific integrations
sampling_ratefloatNoneSampling rate (0.0-1.0)
ze.tracer.configure(
    flush_interval=0.5,
    max_spans=100,
    sampling_rate=0.05,
    integrations={"openai": True, "langchain": False}
)

Available Integrations

IntegrationNameAuto-Instruments
OpenAIIntegration"openai"OpenAI client calls
GeminiIntegration"gemini"Google Gemini calls
LangChainIntegration"langchain"LangChain components
LangGraphIntegration"langgraph"LangGraph workflows
HttpxIntegration"httpx"HTTPX requests
VocodeIntegration"vocode"Vocode voice SDK
Control integrations via:
  • Environment: ZEROEVAL_DISABLED_INTEGRATIONS="langchain,langgraph"
  • Init: disabled_integrations=["langchain"] or enabled_integrations=["openai"]
  • Runtime: ze.tracer.configure(integrations={"langchain": False})

Configuration Examples

Production

ze.init(
    api_key="your_key",
    sampling_rate=0.05,
    debug=False,
    disabled_integrations=["langchain"]
)

ze.tracer.configure(
    flush_interval=0.5,
    max_spans=100
)

Development

ze.init(
    api_key="your_key",
    debug=True,
    sampling_rate=1.0
)

Memory-Optimized

ze.tracer.configure(
    max_spans=5,
    collect_code_details=False,
    flush_interval=2.0
)