Evals

Evals let you define test cases for your agents and verify that they respond correctly — measuring response quality, token usage, execution time, and the sequence of tools called. Run evals manually to validate behavior before deploying changes.

Overview
Core Concepts
Creating an Eval
Eval Properties
Criteria Types
Running an Eval
Execution Results

Overview

An eval is a saved test case tied to a specific agent. Each eval defines an input prompt and one or more criteria the agent’s response must satisfy. When you run an eval, Evolvable.ai sends the input to the agent, collects the response and execution metrics, and checks each criterion automatically.

With evals you can:

Verify that an agent’s response meets quality or content requirements
Assert that the agent calls the right tools in the right order
Enforce limits on token usage and response time

Note: Evals can only be created for top-level agents — agents that users can start conversations with directly.

Core Concepts

Eval

A saved test case for an agent. It contains an input, optional criteria, and optional constraint limits. Evals are reusable — you can run the same eval multiple times and compare results.

Execution

A single run of an eval. Each execution records whether the agent passed or failed each criterion, along with actual token counts and response time.

Trajectory

The sequence of actions the agent took during an execution — specifically, the tools it called and the knowledge chunks it fetched. Trajectory criteria let you assert that the agent used the right tools.

Creating an Eval

Navigate to an agent’s Evals tab and click New Eval. Fill in the title, input, and any criteria you want to check, then save.

Eval Properties

Identity

Property	Description
Title	A name for this test case
Description	Optional notes about what this eval is testing

Input

Property	Description
Input	The prompt sent to the agent during the eval
Streaming	Whether the agent runs in streaming mode for this eval

Constraint Limits

These limits define hard boundaries on token usage and execution time. If the agent exceeds any of these, the eval fails that check.

Property	Description
Max Prompt Tokens	Maximum number of tokens allowed in the prompt
Max Response Tokens	Maximum number of tokens allowed in the response
Max Time (ms)	Maximum allowed execution time in milliseconds

All constraint limits are optional. If a limit is not set, that check is skipped.

Response Criteria

A free-text description of what the agent’s response should contain or achieve. Evolvable.ai uses an LLM to evaluate whether the actual response meets this description, and provides a failure reason if it does not.

Example: “The response must recommend at least one product and include a price.”

Criteria Types

Response Criteria

A natural-language description evaluated by an LLM. The model reads the agent’s response alongside your criteria and determines whether the response satisfies them. If it does not, a failure reason is provided.

Trajectory Criteria

Trajectory criteria validate the sequence of actions the agent took — not what it said, but what it did. You can define multiple trajectory criteria on a single eval; all of them must pass for the trajectory check to succeed.

Tool Call Criterion

Asserts that the agent called specific tools during the execution.

Property	Description
Tool Names	The list of expected tool names
Exact	If enabled, the agent must call exactly these tools — no more, no fewer
Ordered	If enabled, the tools must be called in the specified order

Matching behavior:

Exact	Ordered	Behavior
false	false	All expected tools must be called (subset match)
true	false	The agent must call exactly these tools, in any order
true	true	The agent must call exactly these tools, in this order

Chunk Fetch Criterion

Asserts that the agent fetched specific knowledge chunks during the execution. Specify the expected chunk IDs that should have been retrieved.

This criterion type is available but not yet fully evaluated — it currently does not affect the overall pass/fail result.

Running an Eval

Open an eval and click Run. Evolvable.ai will:

Send the eval’s input to the agent
Collect the response, token counts, and execution time
Evaluate each criterion in sequence
Record the overall pass/fail result

Executions run synchronously and the result is shown immediately.

Execution Results

After an eval runs, the result shows a breakdown of every check performed.

Result Fields

Field	Description
Overall Passed	`true` only if every applicable check passed
Prompt Tokens Passed	Whether prompt token count was within the limit (`null` if no limit set)
Response Tokens Passed	Whether response token count was within the limit (`null` if no limit set)
Time Passed	Whether execution time was within the limit (`null` if no limit set)
Response Criteria Passed	Whether the response met the content criteria (`null` if no criteria set)
Trajectory Criteria Passed	Whether all trajectory criteria passed (`null` if no criteria defined)
Response Criteria Failure Reason	Explanation from the LLM when response criteria fail

Metrics

Each execution also records raw metrics regardless of whether limits were set:

Metric	Description
Prompt Tokens	Actual number of tokens in the prompt
Response Tokens	Actual number of tokens in the response
Execution Time (ms)	Total time from input to response
Tool Calls	List of tools called during the execution, in order

Structure Summary


Agent Eval
 ├── Input
 ├── Streaming flag
 ├── Constraint Limits
 │    ├── Max Prompt Tokens
 │    ├── Max Response Tokens
 │    └── Max Time (ms)
 ├── Response Criteria (LLM-evaluated)
 └── Trajectory Criteria
      ├── Tool Call Criteria
      └── Chunk Fetch Criteria

Evals

Table of Contents

Overview

Core Concepts

Eval

Execution

Trajectory

Creating an Eval

Eval Properties

Identity

Input

Constraint Limits

Response Criteria

Criteria Types

Response Criteria

Trajectory Criteria

Tool Call Criterion

Chunk Fetch Criterion

Running an Eval

Execution Results

Result Fields

Metrics

Structure Summary