Evaluation and Testing

Relevant source files

This page documents ADK's evaluation and testing infrastructure for assessing agent quality and performance.

Overview

ADK provides a comprehensive evaluation and testing system with three main components:

Evaluation Framework: A two-phase system (inference generation + metric evaluation) for assessing agent quality against test cases
Conformance Testing: Record/replay capabilities for deterministic testing of agent behavior
Testing Utilities: Helper classes and patterns for writing unit and integration tests

The evaluation framework separates inference generation from metric evaluation, enabling reuse of inference results across different metric configurations and supporting both synchronous and asynchronous metrics.

Installation Requirements

Evaluation features require optional dependencies. Install them with:

This installs:

google-cloud-aiplatform[evaluation] - For Vertex AI evaluation metrics
pandas - For data manipulation
rouge-score - For text similarity metrics
tabulate - For formatted result display

Sources: pyproject.toml106-113

Evaluation Framework Architecture

The evaluation system implements a two-phase pipeline that decouples inference generation from metric evaluation.

Two-Phase Evaluation Pipeline

Sources: src/google/adk/evaluation/local_eval_service.py65-400 src/google/adk/evaluation/base_eval_service.py33-202

LocalEvalService

The LocalEvalService class orchestrates both phases of evaluation. It runs agents against test cases and evaluates the results against specified metrics.

Method	Purpose	Phase
`perform_inference()`	Generates agent responses for eval cases	Phase 1
`evaluate()`	Compares actual vs expected responses using metrics	Phase 2
`_perform_inference_sigle_eval_item()`	Runs inference for a single eval case	Phase 1 Internal
`_evaluate_single_inference_result()`	Evaluates a single inference result	Phase 2 Internal
`_evaluate_metric()`	Evaluates a specific metric	Phase 2 Internal

Key Features:

Parallelism Control: Configurable via InferenceConfig.parallelism and EvaluateConfig.parallelism (default: 4)
Session Isolation: Creates fresh sessions with prefix ___eval___session___ for each eval case
Error Handling: Captures failures in InferenceResult.status and error_message without affecting other evaluations
Result Persistence: Optionally saves results via EvalSetResultsManager

Sources: src/google/adk/evaluation/local_eval_service.py64-90 src/google/adk/evaluation/local_eval_service.py91-139 src/google/adk/evaluation/local_eval_service.py140-178

Data Models

Sources: src/google/adk/evaluation/base_eval_service.py85-160 src/google/adk/cli/adk_web_server.py196-277

Evaluation Storage

Storage Managers

ADK provides manager classes for persisting evaluation data:

Manager Class	Purpose	Storage Options
`EvalSetsManager`	Manages evaluation test sets	Local filesystem, GCS
`EvalSetResultsManager`	Manages evaluation results	Local filesystem, GCS

Implementations:

Local: LocalEvalSetsManager, LocalEvalSetResultsManager (file-based JSON storage)
Cloud: GCS-backed implementations for team collaboration
In-Memory: InMemoryEvalSetsManager for testing

Sources: src/google/adk/cli/adk_web_server.py410-413

Eval Set File Structure

Eval sets are stored as JSON files with extension .evalset.json:

agents_dir/
  agent_name/
    eval_set_1.evalset.json
    eval_set_2.evalset.json
    ...

Sources: src/google/adk/cli/adk_web_server.py99

Running Evaluations

Via CLI Commands

ADK provides CLI commands for managing evaluations:

Creating Evaluation Sets

Sources: src/google/adk/cli/cli_tools_click.py714-838

Running Evaluations

The CLI supports:

Multiple eval sets: Space-separated list of eval set files or IDs
Selective execution: Use : syntax to specify specific eval cases
Mixed sources: Cannot mix file paths with eval set IDs in same command
Parallelism: Configurable via InferenceConfig.parallelism and EvaluateConfig.parallelism

Sources: src/google/adk/cli/cli_tools_click.py466-706 src/google/adk/cli/cli_eval.py87-158

Via REST API

The FastAPI server exposes evaluation endpoints under the /eval-sets and /eval-results paths.

Sources: src/google/adk/cli/adk_web_server.py837-1052 src/google/adk/cli/adk_web_server.py1069-1154

REST API Endpoints

Endpoint	Method	Purpose
`/apps/{app}/eval-sets`	POST	Create evaluation set
`/apps/{app}/eval-sets`	GET	List evaluation sets
`/apps/{app}/eval-sets/{id}/add-session`	POST	Add session as eval case
`/apps/{app}/eval-sets/{id}/eval-cases/{case_id}`	GET	Get eval case details
`/apps/{app}/eval-sets/{id}/eval-cases/{case_id}`	PUT	Update eval case
`/apps/{app}/eval-sets/{id}/eval-cases/{case_id}`	DELETE	Delete eval case
`/apps/{app}/eval-sets/{id}/run`	POST	Run evaluation
`/apps/{app}/eval_results`	GET	List evaluation results
`/apps/{app}/eval_results/{id}`	GET	Get evaluation result details
`/apps/{app}/metrics-info`	GET	List available metrics

Sources: src/google/adk/cli/adk_web_server.py837-1193

Creating Eval Cases from Sessions

A common workflow is to convert interactive sessions into eval cases:

User creates a session and interacts with the agent
User calls POST /apps/{app}/eval-sets/{id}/add-session with session_id
System converts session events into Invocation objects
System extracts initial state from agent configuration
New EvalCase is added to the eval set

Sources: src/google/adk/cli/adk_web_server.py905-950

Evaluation Metrics

Metric System Architecture

Sources: src/google/adk/evaluation/eval_metrics.py1-350 src/google/adk/evaluation/local_eval_service.py308-334

Metric Evaluator Registry

The MetricEvaluatorRegistry maps metric names to evaluator implementations. ADK uses DEFAULT_METRIC_EVALUATOR_REGISTRY which includes built-in metrics.

Registration Process:

Evaluator Interface:

BaseMetricEvaluator.evaluate_invocations(actual, expected, criterion) -> EvalMetricResult
Synchronous metrics: Regular method
Asynchronous metrics: Async method for LLM-based evaluation

Sources: src/google/adk/evaluation/local_eval_service.py80-86 src/google/adk/evaluation/local_eval_service.py308-334

Built-in Metrics

Metric Name	Type	Description	Evaluator Class
`tool_trajectory_avg_score`	Sync	Compares tool selection accuracy using rubric	`RubricBasedToolUseEvaluator`
`hallucinations_v1`	Sync	Checks response grounding against context	`HallucinationsEvaluator`
`final_response_quality`	Async	LLM-as-judge evaluation of response quality	`ResponseQualityEvaluator`
`safety_v1`	Sync	Evaluates content safety	`SafetyMetric`
`response_match_score`	Sync	ROUGE-based text similarity	`ResponseMatchMetric`

Default Evaluation Criteria:

Sources: src/google/adk/cli/cli_eval.py44-56 src/google/adk/evaluation/eval_metrics.py1-350

Metric Results Structure

EvalMetricResult:

metric_name: Name of the evaluated metric
score: Numeric score (0.0-1.0 typical range)
eval_status: EvalStatus.PASSED, EvalStatus.FAILED, or EvalStatus.NOT_EVALUATED
threshold: Threshold used to determine pass/fail
criterion: Optional criterion used for evaluation
details: Additional details like rubric scores

Per-Invocation vs Overall Results:

Per-Invocation: EvalMetricResultPerInvocation contains scores for each conversation turn
Overall: EvalCaseResult.overall_eval_metric_results aggregates scores across all invocations
Final Status: Determined by comparing overall score to threshold

Sources: src/google/adk/evaluation/eval_metrics.py1-350 src/google/adk/cli/adk_web_server.py232-253

Custom Metrics

To implement a custom metric:

Create an evaluator class implementing BaseMetricEvaluator
Implement evaluate_invocations() method
Register with the metric registry
Use in eval config

Sources: src/google/adk/evaluation/eval_metrics.py1-350

Listing Available Metrics

The /apps/{app}/metrics-info endpoint returns metadata about all registered metrics:

Sources: src/google/adk/cli/adk_web_server.py1175-1193

Conformance Testing

The conformance testing system provides deterministic testing through record/replay capabilities. This enables regression testing and consistent test execution across environments.

CLI Commands

Directory Structure:

tests/
  category/
    test_name/
      spec.yaml                    # Test specification (TestCaseInput or TestCase)
      generated-recordings.yaml    # Recorded interactions (replay mode)
      generated-session.yaml       # Session data (replay mode)

Test Modes:

replay: Verifies agent interactions match previously recorded behaviors exactly
live: Runs evaluation-based verification (compares actual vs expected using metrics)

Sources: src/google/adk/cli/cli_tools_click.py124-275

AdkWebServerClient

AdkWebServerClient is an HTTP client for interacting with the ADK web server in conformance tests. It supports both manual lifecycle management and automatic cleanup via async context manager.

Sources: src/google/adk/cli/conformance/adk_web_server_client.py37-268

Record and Replay Modes

The conformance system injects configuration into session state to control agent behavior:

Record Mode:

This adds to state_delta:

Replay Mode:

This adds to state_delta:

Sources: src/google/adk/cli/conformance/adk_web_server_client.py209-254

Context Manager Usage

The recommended pattern is to use AdkWebServerClient as an async context manager:

Sources: src/google/adk/cli/conformance/adk_web_server_client.py93-103

Testing Best Practices

Mock Patterns

The test suite demonstrates effective mocking patterns for testing ADK components:

Mocking the Runner:

Mocking Services:

Sources: tests/unittests/cli/test_fast_api.py162-289

Test Fixture Strategies

The test suite uses pytest fixtures to create reusable test infrastructure:

Fixture	Purpose
`mock_agent_loader`	Loads test agents without filesystem access
`mock_session_service`	In-memory session storage for tests
`mock_artifact_service`	In-memory artifact storage for tests
`mock_eval_sets_manager`	In-memory eval set management
`test_app`	Configured TestClient for FastAPI app
`create_test_session`	Creates a test session with known IDs
`create_test_eval_set`	Creates a test eval set with sample data

Fixture Composition:

Sources: tests/unittests/cli/test_fast_api.py161-452

TestClient Usage

FastAPI's TestClient enables synchronous testing of async endpoints without running a server:

Testing Streaming Endpoints:

Sources: tests/unittests/cli/test_fast_api.py630-824

Parametrized Testing

Use pytest's parametrize decorator for comprehensive coverage:

Sources: tests/unittests/telemetry/test_google_cloud.py24-61

Telemetry Integration

The evaluation and testing infrastructure integrates with ADK's telemetry system to provide observability.

Trace Collection During Tests

The AdkWebServer sets up internal span exporters to capture traces during evaluation:

Span Processors:

ApiServerSpanExporter: Captures call_llm, send_data, and execute_tool spans, indexed by event ID
InMemoryExporter: Captures all spans for a session, indexed by session ID

Debug Endpoints:

GET /debug/trace/{event_id}: Returns trace attributes for a specific event
GET /debug/trace/session/{session_id}: Returns all spans for a session

Sources: src/google/adk/cli/adk_web_server.py105-163 src/google/adk/cli/adk_web_server.py282-301 src/google/adk/cli/adk_web_server.py668-691

Telemetry Configuration

The _setup_telemetry function supports multiple telemetry backends:

Configuration	Behavior
`otel_to_cloud=True`	Enables Cloud Trace, Cloud Monitoring, Cloud Logging
OTEL env vars set	Uses generic OTLP exporters to configured endpoints
Default	Local-only telemetry with internal exporters

Environment Variables:

OTEL_EXPORTER_OTLP_ENDPOINT
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT
OTEL_EXPORTER_OTLP_METRICS_ENDPOINT
OTEL_EXPORTER_OTLP_LOGS_ENDPOINT

Sources: src/google/adk/cli/adk_web_server.py282-387 src/google/adk/telemetry/setup.py50-126 src/google/adk/telemetry/google_cloud.py45-96

Complete Testing Example

Here's a complete example demonstrating evaluation creation, execution, and result retrieval:

Sources: tests/unittests/cli/conformance/test_adk_web_server_client.py47-249 tests/unittests/cli/test_fast_api.py862-949

Evaluation and Testing

Overview

Installation Requirements

Evaluation Framework Architecture

Two-Phase Evaluation Pipeline

LocalEvalService

Data Models

Evaluation Storage

Storage Managers

Eval Set File Structure

Running Evaluations

Via CLI Commands

Creating Evaluation Sets

Running Evaluations

Via REST API

REST API Endpoints

Creating Eval Cases from Sessions

Evaluation Metrics

Metric System Architecture

Metric Evaluator Registry

Built-in Metrics

Metric Results Structure

Custom Metrics

Listing Available Metrics

Conformance Testing

CLI Commands

AdkWebServerClient

Record and Replay Modes

Context Manager Usage

Testing Best Practices

Mock Patterns

Test Fixture Strategies

TestClient Usage

Parametrized Testing

Telemetry Integration

Trace Collection During Tests

Telemetry Configuration

Complete Testing Example

On this page