LLM judges can evaluate spans that contain images alongside text. This is useful for browser agents, UI testing, visual QA, and any workflow where you need to assess visual output.
How it works
- Attach images to spans using SDK methods or structured output data
- Images are uploaded during span ingestion (image data is stripped from the span and stored separately)
- Judges fetch images when evaluating the span and send them to a vision-capable LLM
- Evaluation results appear in the dashboard like any other judge evaluation
The LLM sees both the span’s text data (input/output) and any attached images, giving it full context for evaluation.
Images can be provided as base64-encoded strings, presigned S3 URLs, or CDN image URLs. In all cases, ZeroEval copies the image into its own storage during ingestion, so the source URL only needs to remain valid long enough for the ingest request to complete.
Attaching images to spans
There are two ways to attach images to spans, depending on your workflow.
Option 1: SDK helper methods
The SDK provides add_screenshot() and add_image() methods for attaching images with metadata.
Screenshots with viewport context
For browser agents or responsive testing, use add_screenshot() to capture different viewports:
import zeroeval as ze
with ze.span(name="homepage_test", tags={"has_screenshots": "true"}) as span:
# Desktop viewport
span.add_screenshot(
base64_data=desktop_base64,
viewport="desktop",
width=1920,
height=1080,
label="Homepage - Desktop"
)
# Mobile viewport
span.add_screenshot(
base64_data=mobile_base64,
viewport="mobile",
width=375,
height=812,
label="Homepage - Mobile"
)
span.set_io(
input_data="Load homepage and capture screenshots",
output_data="Captured 2 viewport screenshots"
)
Generic images
For charts, diagrams, or UI component states, use add_image():
with ze.span(name="button_hover_test") as span:
span.add_image(
base64_data=before_hover_base64,
label="Button - Default State"
)
span.add_image(
base64_data=after_hover_base64,
label="Button - Hover State"
)
span.set_io(
input_data="Test button hover interaction",
output_data="Button changes color on hover"
)
Option 2: Image URLs (S3 presigned or CDN)
If your images are already hosted externally, you can pass an HTTPS URL instead of base64 data. ZeroEval will download the image, validate it, and copy it into its own storage.
Attach URLs via attributes.attachments using a url key instead of base64:
Presigned S3 URL
import boto3
import zeroeval as ze
s3 = boto3.client("s3")
presigned_url = s3.generate_presigned_url(
"get_object",
Params={"Bucket": "my-bucket", "Key": "images/chart.png"},
ExpiresIn=300,
)
with ze.span(name="chart_generation") as span:
span.attributes["attachments"] = [
{
"type": "image",
"url": presigned_url,
"label": "Monthly Revenue Chart",
}
]
span.set_io(
input_data="Generate revenue chart",
output_data="Chart generated"
)
CDN URL
import zeroeval as ze
cdn_url = "https://cdn.example.com/images/product-photo.png"
with ze.span(name="product_image_check") as span:
span.attributes["attachments"] = [
{
"type": "image",
"url": cdn_url,
"label": "Product listing photo",
}
]
span.set_io(
input_data="Check product image quality",
output_data="Image attached for evaluation"
)
The URL only needs to stay valid long enough for ZeroEval to download the image during ingestion (typically a few seconds). After that, ZeroEval serves the image from its own storage. CDN URLs must be from a trusted domain configured in the backend.
Option 3: Structured output_data
If your workflow already produces screenshot data as structured output (common with browser automation agents), you can include images directly in the span’s output_data. ZeroEval automatically detects and extracts images from JSON arrays containing base64 or url fields.
import zeroeval as ze
with ze.span(
name="screenshot_capture",
kind="llm",
tags={"has_screenshots": "true", "screenshot_count": "2"}
) as span:
# Set input as conversation messages
span.input_data = [
{
"role": "system",
"content": "You are a screenshot capture service."
},
{
"role": "user",
"content": "Navigate to the homepage and capture screenshots"
}
]
# Set output as array of screenshot objects with base64 data
span.output_data = [
{
"viewport": "mobile",
"width": 768,
"height": 1024,
"base64": mobile_screenshot_base64
},
{
"viewport": "desktop",
"width": 1920,
"height": 1080,
"base64": desktop_screenshot_base64
}
]
You can also use URLs (presigned S3 or CDN) in output_data by replacing the base64 key with url:
span.output_data = [
{
"viewport": "mobile",
"width": 768,
"height": 1024,
"url": mobile_presigned_url
},
{
"viewport": "desktop",
"width": 1920,
"height": 1080,
"url": desktop_presigned_url
}
]
When ZeroEval ingests this span, it:
- Extracts each object with a
base64 or url field as an attachment
- Downloads (for URLs) and uploads the images to storage
- Strips the image data from
output_data to keep the database lean
- Preserves the metadata (viewport, width, height) for display
This approach works well when your browser agent or automation tool already produces structured screenshot output.
All methods produce the same result: images stored and available for multimodal judge evaluation. Choose whichever fits your workflow better.
Creating a multimodal judge
Multimodal judges work like regular judges, but with criteria that reference attached images. The judge prompt should describe what to look for in the visual content.
Example: UI consistency judge
Evaluate whether the UI renders correctly across viewports.
Check for:
- Layout breaks or overlapping elements
- Text that's too small to read on mobile
- Missing or broken images
- Inconsistent spacing between viewports
Score 1 if all viewports render correctly, 0 if there are visual issues.
Example: Brand compliance judge
Check if the page follows brand guidelines.
Look for:
- Correct logo placement and sizing
- Brand colors used consistently
- Proper typography hierarchy
- Appropriate whitespace
Score 1 for full compliance, 0 for violations.
Example: Accessibility judge
Evaluate visual accessibility of the interface.
Check:
- Sufficient color contrast
- Text size readability
- Clear visual hierarchy
- Button/link affordances
Score 1 if accessible, 0 if there are issues. Include specific problems in the reasoning.
Filtering spans for multimodal evaluation
Use tags to identify which spans should be evaluated by your multimodal judge:
# Tag spans that have screenshots
with ze.span(name="browser_test", tags={"has_screenshots": "true"}) as span:
span.add_screenshot(...)
Then configure your judge to only evaluate spans matching that tag. This prevents the judge from running on text-only spans where multimodal evaluation doesn’t apply.
Images can be provided as base64-encoded strings, presigned S3 URLs, or CDN image URLs from trusted domains. In all cases, images are validated by magic bytes during ingestion. The maximum size is 10MB per image, with up to 5 images per span.
Viewing images in the dashboard
Screenshots appear in two places:
- Span details view - Images show in the Data tab with viewport labels and dimensions
- Judge evaluation modal - When reviewing an evaluation, you’ll see the images the judge analyzed
Images display with their labels, viewport type (for screenshots), and dimensions when available.
Model support
Multimodal evaluation currently uses Gemini models, which support image inputs. When you create a judge, ZeroEval automatically handles the image formatting for the model.
Multimodal evaluation works best with specific, measurable criteria. Vague prompts like “does this look good?” will produce inconsistent results. Be explicit about what visual properties to check.