LLM judges can evaluate spans that contain images alongside text. This is useful for browser agents, UI testing, visual QA, and any workflow where you need to assess visual output.
How it works
- Attach images to spans using SDK methods or structured output data
- Images are uploaded during span ingestion (base64 data is stripped from the span)
- Judges fetch images when evaluating the span and send them to a vision-capable LLM
- Evaluation results appear in the dashboard like any other judge evaluation
The LLM sees both the span’s text data (input/output) and any attached images, giving it full context for evaluation.
Attaching images to spans
There are two ways to attach images to spans, depending on your workflow.
Option 1: SDK helper methods
The SDK provides add_screenshot() and add_image() methods for attaching images with metadata.
Screenshots with viewport context
For browser agents or responsive testing, use add_screenshot() to capture different viewports:
import zeroeval as ze
with ze.span(name="homepage_test", tags={"has_screenshots": "true"}) as span:
# Desktop viewport
span.add_screenshot(
base64_data=desktop_base64,
viewport="desktop",
width=1920,
height=1080,
label="Homepage - Desktop"
)
# Mobile viewport
span.add_screenshot(
base64_data=mobile_base64,
viewport="mobile",
width=375,
height=812,
label="Homepage - Mobile"
)
span.set_io(
input_data="Load homepage and capture screenshots",
output_data="Captured 2 viewport screenshots"
)
Generic images
For charts, diagrams, or UI component states, use add_image():
with ze.span(name="button_hover_test") as span:
span.add_image(
base64_data=before_hover_base64,
label="Button - Default State"
)
span.add_image(
base64_data=after_hover_base64,
label="Button - Hover State"
)
span.set_io(
input_data="Test button hover interaction",
output_data="Button changes color on hover"
)
Option 2: Structured output_data
If your workflow already produces screenshot data as structured output (common with browser automation agents), you can include images directly in the span’s output_data. ZeroEval automatically detects and extracts images from JSON arrays containing base64 fields.
import zeroeval as ze
with ze.span(
name="screenshot_capture",
kind="llm",
tags={"has_screenshots": "true", "screenshot_count": "2"}
) as span:
# Set input as conversation messages
span.input_data = [
{
"role": "system",
"content": "You are a screenshot capture service."
},
{
"role": "user",
"content": "Navigate to the homepage and capture screenshots"
}
]
# Set output as array of screenshot objects with base64 data
span.output_data = [
{
"viewport": "mobile",
"width": 768,
"height": 1024,
"base64": mobile_screenshot_base64
},
{
"viewport": "desktop",
"width": 1920,
"height": 1080,
"base64": desktop_screenshot_base64
}
]
When ZeroEval ingests this span, it:
- Extracts each object with a
base64 field as an attachment
- Uploads the images to storage
- Strips the base64 data from
output_data to keep the database lean
- Preserves the metadata (viewport, width, height) for display
This approach works well when your browser agent or automation tool already produces structured screenshot output.
Both methods produce the same result: images stored and available for multimodal judge evaluation. Choose whichever fits your workflow better.
Creating a multimodal judge
Multimodal judges work like regular judges, but with criteria that reference attached images. The judge prompt should describe what to look for in the visual content.
Example: UI consistency judge
Evaluate whether the UI renders correctly across viewports.
Check for:
- Layout breaks or overlapping elements
- Text that's too small to read on mobile
- Missing or broken images
- Inconsistent spacing between viewports
Score 1 if all viewports render correctly, 0 if there are visual issues.
Example: Brand compliance judge
Check if the page follows brand guidelines.
Look for:
- Correct logo placement and sizing
- Brand colors used consistently
- Proper typography hierarchy
- Appropriate whitespace
Score 1 for full compliance, 0 for violations.
Example: Accessibility judge
Evaluate visual accessibility of the interface.
Check:
- Sufficient color contrast
- Text size readability
- Clear visual hierarchy
- Button/link affordances
Score 1 if accessible, 0 if there are issues. Include specific problems in the reasoning.
Filtering spans for multimodal evaluation
Use tags to identify which spans should be evaluated by your multimodal judge:
# Tag spans that have screenshots
with ze.span(name="browser_test", tags={"has_screenshots": "true"}) as span:
span.add_screenshot(...)
Then configure your judge to only evaluate spans matching that tag. This prevents the judge from running on text-only spans where multimodal evaluation doesn’t apply.
Images are validated during ingestion. The maximum size is 10MB per image, with up to 5 images per span.
Viewing images in the dashboard
Screenshots appear in two places:
- Span details view - Images show in the Data tab with viewport labels and dimensions
- Judge evaluation modal - When reviewing an evaluation, you’ll see the images the judge analyzed
Images display with their labels, viewport type (for screenshots), and dimensions when available.
Model support
Multimodal evaluation currently uses Gemini models, which support image inputs. When you create a judge, ZeroEval automatically handles the image formatting for the model.
Multimodal evaluation works best with specific, measurable criteria. Vague prompts like “does this look good?” will produce inconsistent results. Be explicit about what visual properties to check.