Multimodal Evaluation

LLM judges can evaluate spans that contain images alongside text. This is useful for browser agents, UI testing, visual QA, and any workflow where you need to assess visual output.

How it works

Attach images to spans using SDK methods or structured output data
Images are uploaded during span ingestion (base64 data is stripped from the span)
Judges fetch images when evaluating the span and send them to a vision-capable LLM
Evaluation results appear in the dashboard like any other judge evaluation

The LLM sees both the span’s text data (input/output) and any attached images, giving it full context for evaluation.

Attaching images to spans

There are two ways to attach images to spans, depending on your workflow.

Option 1: SDK helper methods

The SDK provides add_screenshot() and add_image() methods for attaching images with metadata. Screenshots with viewport context For browser agents or responsive testing, use add_screenshot() to capture different viewports:

import zeroeval as ze

with ze.span(name="homepage_test", tags={"has_screenshots": "true"}) as span:
    # Desktop viewport
    span.add_screenshot(
        base64_data=desktop_base64,
        viewport="desktop",
        width=1920,
        height=1080,
        label="Homepage - Desktop"
    )
    
    # Mobile viewport
    span.add_screenshot(
        base64_data=mobile_base64,
        viewport="mobile",
        width=375,
        height=812,
        label="Homepage - Mobile"
    )
    
    span.set_io(
        input_data="Load homepage and capture screenshots",
        output_data="Captured 2 viewport screenshots"
    )

Generic images For charts, diagrams, or UI component states, use add_image():

with ze.span(name="button_hover_test") as span:
    span.add_image(
        base64_data=before_hover_base64,
        label="Button - Default State"
    )
    
    span.add_image(
        base64_data=after_hover_base64,
        label="Button - Hover State"
    )
    
    span.set_io(
        input_data="Test button hover interaction",
        output_data="Button changes color on hover"
    )

Option 2: Structured output_data

If your workflow already produces screenshot data as structured output (common with browser automation agents), you can include images directly in the span’s output_data. ZeroEval automatically detects and extracts images from JSON arrays containing base64 fields.

import zeroeval as ze

with ze.span(
    name="screenshot_capture",
    kind="llm",
    tags={"has_screenshots": "true", "screenshot_count": "2"}
) as span:
    # Set input as conversation messages
    span.input_data = [
        {
            "role": "system",
            "content": "You are a screenshot capture service."
        },
        {
            "role": "user", 
            "content": "Navigate to the homepage and capture screenshots"
        }
    ]
    
    # Set output as array of screenshot objects with base64 data
    span.output_data = [
        {
            "viewport": "mobile",
            "width": 768,
            "height": 1024,
            "base64": mobile_screenshot_base64
        },
        {
            "viewport": "desktop",
            "width": 1920,
            "height": 1080,
            "base64": desktop_screenshot_base64
        }
    ]

When ZeroEval ingests this span, it:

Extracts each object with a base64 field as an attachment
Uploads the images to storage
Strips the base64 data from output_data to keep the database lean
Preserves the metadata (viewport, width, height) for display

This approach works well when your browser agent or automation tool already produces structured screenshot output.

Both methods produce the same result: images stored and available for multimodal judge evaluation. Choose whichever fits your workflow better.

Creating a multimodal judge

Multimodal judges work like regular judges, but with criteria that reference attached images. The judge prompt should describe what to look for in the visual content.

Example: UI consistency judge

Evaluate whether the UI renders correctly across viewports.

Check for:
- Layout breaks or overlapping elements
- Text that's too small to read on mobile
- Missing or broken images
- Inconsistent spacing between viewports

Score 1 if all viewports render correctly, 0 if there are visual issues.

Example: Brand compliance judge

Check if the page follows brand guidelines.

Look for:
- Correct logo placement and sizing
- Brand colors used consistently
- Proper typography hierarchy
- Appropriate whitespace

Score 1 for full compliance, 0 for violations.

Example: Accessibility judge

Evaluate visual accessibility of the interface.

Check:
- Sufficient color contrast
- Text size readability
- Clear visual hierarchy
- Button/link affordances

Score 1 if accessible, 0 if there are issues. Include specific problems in the reasoning.

Filtering spans for multimodal evaluation

Use tags to identify which spans should be evaluated by your multimodal judge:

# Tag spans that have screenshots
with ze.span(name="browser_test", tags={"has_screenshots": "true"}) as span:
    span.add_screenshot(...)

Then configure your judge to only evaluate spans matching that tag. This prevents the judge from running on text-only spans where multimodal evaluation doesn’t apply.

Supported image formats

JPEG
PNG
WebP
GIF

Images are validated during ingestion. The maximum size is 10MB per image, with up to 5 images per span.

Viewing images in the dashboard

Screenshots appear in two places:

Span details view - Images show in the Data tab with viewport labels and dimensions
Judge evaluation modal - When reviewing an evaluation, you’ll see the images the judge analyzed

Images display with their labels, viewport type (for screenshots), and dimensions when available.

Model support

Multimodal evaluation currently uses Gemini models, which support image inputs. When you create a judge, ZeroEval automatically handles the image formatting for the model.

Multimodal evaluation works best with specific, measurable criteria. Vague prompts like “does this look good?” will produce inconsistent results. Be explicit about what visual properties to check.

Tracing

Prompts

Judges

How it works

Attaching images to spans

Option 1: SDK helper methods

Option 2: Structured output_data

Creating a multimodal judge

Example: UI consistency judge

Example: Brand compliance judge

Example: Accessibility judge

Filtering spans for multimodal evaluation

Supported image formats

Viewing images in the dashboard

Model support

Tracing

Prompts

Judges

​How it works

​Attaching images to spans

​Option 1: SDK helper methods

​Option 2: Structured output_data

​Creating a multimodal judge

​Example: UI consistency judge

​Example: Brand compliance judge

​Example: Accessibility judge

​Filtering spans for multimodal evaluation

​Supported image formats

​Viewing images in the dashboard

​Model support

How it works

Attaching images to spans

Option 1: SDK helper methods

Option 2: Structured output_data

Creating a multimodal judge

Example: UI consistency judge

Example: Brand compliance judge

Example: Accessibility judge

Filtering spans for multimodal evaluation

Supported image formats

Viewing images in the dashboard

Model support