Menu

Evaluation and Testing

Relevant source files

This page documents ADK's evaluation and testing infrastructure for assessing agent quality and performance.

Overview

ADK provides a comprehensive evaluation and testing system with three main components:

  1. Evaluation Framework: A two-phase system (inference generation + metric evaluation) for assessing agent quality against test cases
  2. Conformance Testing: Record/replay capabilities for deterministic testing of agent behavior
  3. Testing Utilities: Helper classes and patterns for writing unit and integration tests

The evaluation framework separates inference generation from metric evaluation, enabling reuse of inference results across different metric configurations and supporting both synchronous and asynchronous metrics.

Installation Requirements

Evaluation features require optional dependencies. Install them with:

This installs:

  • google-cloud-aiplatform[evaluation] - For Vertex AI evaluation metrics
  • pandas - For data manipulation
  • rouge-score - For text similarity metrics
  • tabulate - For formatted result display

Sources: pyproject.toml106-113


Evaluation Framework Architecture

The evaluation system implements a two-phase pipeline that decouples inference generation from metric evaluation.

Two-Phase Evaluation Pipeline

Sources: src/google/adk/evaluation/local_eval_service.py65-400 src/google/adk/evaluation/base_eval_service.py33-202

LocalEvalService

The LocalEvalService class orchestrates both phases of evaluation. It runs agents against test cases and evaluates the results against specified metrics.

MethodPurposePhase
perform_inference()Generates agent responses for eval casesPhase 1
evaluate()Compares actual vs expected responses using metricsPhase 2
_perform_inference_sigle_eval_item()Runs inference for a single eval casePhase 1 Internal
_evaluate_single_inference_result()Evaluates a single inference resultPhase 2 Internal
_evaluate_metric()Evaluates a specific metricPhase 2 Internal

Key Features:

  • Parallelism Control: Configurable via InferenceConfig.parallelism and EvaluateConfig.parallelism (default: 4)
  • Session Isolation: Creates fresh sessions with prefix ___eval___session___ for each eval case
  • Error Handling: Captures failures in InferenceResult.status and error_message without affecting other evaluations
  • Result Persistence: Optionally saves results via EvalSetResultsManager

Sources: src/google/adk/evaluation/local_eval_service.py64-90 src/google/adk/evaluation/local_eval_service.py91-139 src/google/adk/evaluation/local_eval_service.py140-178

Data Models

Sources: src/google/adk/evaluation/base_eval_service.py85-160 src/google/adk/cli/adk_web_server.py196-277


Evaluation Storage

Storage Managers

ADK provides manager classes for persisting evaluation data:

Manager ClassPurposeStorage Options
EvalSetsManagerManages evaluation test setsLocal filesystem, GCS
EvalSetResultsManagerManages evaluation resultsLocal filesystem, GCS

Implementations:

  • Local: LocalEvalSetsManager, LocalEvalSetResultsManager (file-based JSON storage)
  • Cloud: GCS-backed implementations for team collaboration
  • In-Memory: InMemoryEvalSetsManager for testing

Sources: src/google/adk/cli/adk_web_server.py410-413

Eval Set File Structure

Eval sets are stored as JSON files with extension .evalset.json:

agents_dir/
  agent_name/
    eval_set_1.evalset.json
    eval_set_2.evalset.json
    ...

Sources: src/google/adk/cli/adk_web_server.py99


Running Evaluations

Via CLI Commands

ADK provides CLI commands for managing evaluations:

Creating Evaluation Sets

Sources: src/google/adk/cli/cli_tools_click.py714-838

Running Evaluations

The CLI supports:

  • Multiple eval sets: Space-separated list of eval set files or IDs
  • Selective execution: Use : syntax to specify specific eval cases
  • Mixed sources: Cannot mix file paths with eval set IDs in same command
  • Parallelism: Configurable via InferenceConfig.parallelism and EvaluateConfig.parallelism

Sources: src/google/adk/cli/cli_tools_click.py466-706 src/google/adk/cli/cli_eval.py87-158

Via REST API

The FastAPI server exposes evaluation endpoints under the /eval-sets and /eval-results paths.

Sources: src/google/adk/cli/adk_web_server.py837-1052 src/google/adk/cli/adk_web_server.py1069-1154

REST API Endpoints

EndpointMethodPurpose
/apps/{app}/eval-setsPOSTCreate evaluation set
/apps/{app}/eval-setsGETList evaluation sets
/apps/{app}/eval-sets/{id}/add-sessionPOSTAdd session as eval case
/apps/{app}/eval-sets/{id}/eval-cases/{case_id}GETGet eval case details
/apps/{app}/eval-sets/{id}/eval-cases/{case_id}PUTUpdate eval case
/apps/{app}/eval-sets/{id}/eval-cases/{case_id}DELETEDelete eval case
/apps/{app}/eval-sets/{id}/runPOSTRun evaluation
/apps/{app}/eval_resultsGETList evaluation results
/apps/{app}/eval_results/{id}GETGet evaluation result details
/apps/{app}/metrics-infoGETList available metrics

Sources: src/google/adk/cli/adk_web_server.py837-1193

Creating Eval Cases from Sessions

A common workflow is to convert interactive sessions into eval cases:

  1. User creates a session and interacts with the agent
  2. User calls POST /apps/{app}/eval-sets/{id}/add-session with session_id
  3. System converts session events into Invocation objects
  4. System extracts initial state from agent configuration
  5. New EvalCase is added to the eval set

Sources: src/google/adk/cli/adk_web_server.py905-950


Evaluation Metrics

Metric System Architecture

Sources: src/google/adk/evaluation/eval_metrics.py1-350 src/google/adk/evaluation/local_eval_service.py308-334

Metric Evaluator Registry

The MetricEvaluatorRegistry maps metric names to evaluator implementations. ADK uses DEFAULT_METRIC_EVALUATOR_REGISTRY which includes built-in metrics.

Registration Process:

Evaluator Interface:

  • BaseMetricEvaluator.evaluate_invocations(actual, expected, criterion) -> EvalMetricResult
  • Synchronous metrics: Regular method
  • Asynchronous metrics: Async method for LLM-based evaluation

Sources: src/google/adk/evaluation/local_eval_service.py80-86 src/google/adk/evaluation/local_eval_service.py308-334

Built-in Metrics

Metric NameTypeDescriptionEvaluator Class
tool_trajectory_avg_scoreSyncCompares tool selection accuracy using rubricRubricBasedToolUseEvaluator
hallucinations_v1SyncChecks response grounding against contextHallucinationsEvaluator
final_response_qualityAsyncLLM-as-judge evaluation of response qualityResponseQualityEvaluator
safety_v1SyncEvaluates content safetySafetyMetric
response_match_scoreSyncROUGE-based text similarityResponseMatchMetric

Default Evaluation Criteria:

Sources: src/google/adk/cli/cli_eval.py44-56 src/google/adk/evaluation/eval_metrics.py1-350

Metric Results Structure

EvalMetricResult:

  • metric_name: Name of the evaluated metric
  • score: Numeric score (0.0-1.0 typical range)
  • eval_status: EvalStatus.PASSED, EvalStatus.FAILED, or EvalStatus.NOT_EVALUATED
  • threshold: Threshold used to determine pass/fail
  • criterion: Optional criterion used for evaluation
  • details: Additional details like rubric scores

Per-Invocation vs Overall Results:

  • Per-Invocation: EvalMetricResultPerInvocation contains scores for each conversation turn
  • Overall: EvalCaseResult.overall_eval_metric_results aggregates scores across all invocations
  • Final Status: Determined by comparing overall score to threshold

Sources: src/google/adk/evaluation/eval_metrics.py1-350 src/google/adk/cli/adk_web_server.py232-253

Custom Metrics

To implement a custom metric:

  1. Create an evaluator class implementing BaseMetricEvaluator
  2. Implement evaluate_invocations() method
  3. Register with the metric registry
  4. Use in eval config

Sources: src/google/adk/evaluation/eval_metrics.py1-350

Listing Available Metrics

The /apps/{app}/metrics-info endpoint returns metadata about all registered metrics:

Sources: src/google/adk/cli/adk_web_server.py1175-1193


Conformance Testing

The conformance testing system provides deterministic testing through record/replay capabilities. This enables regression testing and consistent test execution across environments.

CLI Commands

Directory Structure:

tests/
  category/
    test_name/
      spec.yaml                    # Test specification (TestCaseInput or TestCase)
      generated-recordings.yaml    # Recorded interactions (replay mode)
      generated-session.yaml       # Session data (replay mode)

Test Modes:

  • replay: Verifies agent interactions match previously recorded behaviors exactly
  • live: Runs evaluation-based verification (compares actual vs expected using metrics)

Sources: src/google/adk/cli/cli_tools_click.py124-275

AdkWebServerClient

AdkWebServerClient is an HTTP client for interacting with the ADK web server in conformance tests. It supports both manual lifecycle management and automatic cleanup via async context manager.

Sources: src/google/adk/cli/conformance/adk_web_server_client.py37-268

Record and Replay Modes

The conformance system injects configuration into session state to control agent behavior:

Record Mode:

This adds to state_delta:

Replay Mode:

This adds to state_delta:

Sources: src/google/adk/cli/conformance/adk_web_server_client.py209-254

Context Manager Usage

The recommended pattern is to use AdkWebServerClient as an async context manager:

Sources: src/google/adk/cli/conformance/adk_web_server_client.py93-103


Testing Best Practices

Mock Patterns

The test suite demonstrates effective mocking patterns for testing ADK components:

Mocking the Runner:

Mocking Services:

Sources: tests/unittests/cli/test_fast_api.py162-289

Test Fixture Strategies

The test suite uses pytest fixtures to create reusable test infrastructure:

FixturePurpose
mock_agent_loaderLoads test agents without filesystem access
mock_session_serviceIn-memory session storage for tests
mock_artifact_serviceIn-memory artifact storage for tests
mock_eval_sets_managerIn-memory eval set management
test_appConfigured TestClient for FastAPI app
create_test_sessionCreates a test session with known IDs
create_test_eval_setCreates a test eval set with sample data

Fixture Composition:

Sources: tests/unittests/cli/test_fast_api.py161-452

TestClient Usage

FastAPI's TestClient enables synchronous testing of async endpoints without running a server:

Testing Streaming Endpoints:

Sources: tests/unittests/cli/test_fast_api.py630-824

Parametrized Testing

Use pytest's parametrize decorator for comprehensive coverage:

Sources: tests/unittests/telemetry/test_google_cloud.py24-61


Telemetry Integration

The evaluation and testing infrastructure integrates with ADK's telemetry system to provide observability.

Trace Collection During Tests

The AdkWebServer sets up internal span exporters to capture traces during evaluation:

Span Processors:

  • ApiServerSpanExporter: Captures call_llm, send_data, and execute_tool spans, indexed by event ID
  • InMemoryExporter: Captures all spans for a session, indexed by session ID

Debug Endpoints:

  • GET /debug/trace/{event_id}: Returns trace attributes for a specific event
  • GET /debug/trace/session/{session_id}: Returns all spans for a session

Sources: src/google/adk/cli/adk_web_server.py105-163 src/google/adk/cli/adk_web_server.py282-301 src/google/adk/cli/adk_web_server.py668-691

Telemetry Configuration

The _setup_telemetry function supports multiple telemetry backends:

ConfigurationBehavior
otel_to_cloud=TrueEnables Cloud Trace, Cloud Monitoring, Cloud Logging
OTEL env vars setUses generic OTLP exporters to configured endpoints
DefaultLocal-only telemetry with internal exporters

Environment Variables:

  • OTEL_EXPORTER_OTLP_ENDPOINT
  • OTEL_EXPORTER_OTLP_TRACES_ENDPOINT
  • OTEL_EXPORTER_OTLP_METRICS_ENDPOINT
  • OTEL_EXPORTER_OTLP_LOGS_ENDPOINT

Sources: src/google/adk/cli/adk_web_server.py282-387 src/google/adk/telemetry/setup.py50-126 src/google/adk/telemetry/google_cloud.py45-96


Complete Testing Example

Here's a complete example demonstrating evaluation creation, execution, and result retrieval:

Sources: tests/unittests/cli/conformance/test_adk_web_server_client.py47-249 tests/unittests/cli/test_fast_api.py862-949