Evals
Evals let you define test cases for your agents and verify that they respond correctly β measuring response quality, token usage, execution time, and the sequence of tools called. Run evals manually to validate behavior before deploying changes.
Table of Contents
- Overview
- Core Concepts
- Creating an Eval
- Eval Properties
- Criteria Types
- Running an Eval
- Execution Results
Overview
An eval is a saved test case tied to a specific agent. Each eval defines an input prompt and one or more criteria the agentβs response must satisfy. When you run an eval, Evolvable.ai sends the input to the agent, collects the response and execution metrics, and checks each criterion automatically.
With evals you can:
- Verify that an agentβs response meets quality or content requirements
- Assert that the agent calls the right tools in the right order
- Enforce limits on token usage and response time
Note: Evals can only be created for top-level agents β agents that users can start conversations with directly.
Core Concepts
Eval
A saved test case for an agent. It contains an input, optional criteria, and optional constraint limits. Evals are reusable β you can run the same eval multiple times and compare results.
Execution
A single run of an eval. Each execution records whether the agent passed or failed each criterion, along with actual token counts and response time.
Trajectory
The sequence of actions the agent took during an execution β specifically, the tools it called and the knowledge chunks it fetched. Trajectory criteria let you assert that the agent used the right tools.
Creating an Eval
Navigate to an agentβs Evals tab and click New Eval. Fill in the title, input, and any criteria you want to check, then save.
Eval Properties
Identity
| Property | Description |
|---|---|
| Title | A name for this test case |
| Description | Optional notes about what this eval is testing |
Input
| Property | Description |
|---|---|
| Input | The prompt sent to the agent during the eval |
| Streaming | Whether the agent runs in streaming mode for this eval |
Constraint Limits
These limits define hard boundaries on token usage and execution time. If the agent exceeds any of these, the eval fails that check.
| Property | Description |
|---|---|
| Max Prompt Tokens | Maximum number of tokens allowed in the prompt |
| Max Response Tokens | Maximum number of tokens allowed in the response |
| Max Time (ms) | Maximum allowed execution time in milliseconds |
All constraint limits are optional. If a limit is not set, that check is skipped.
Response Criteria
A free-text description of what the agentβs response should contain or achieve. Evolvable.ai uses an LLM to evaluate whether the actual response meets this description, and provides a failure reason if it does not.
Example: βThe response must recommend at least one product and include a price.β
Criteria Types
Response Criteria
A natural-language description evaluated by an LLM. The model reads the agentβs response alongside your criteria and determines whether the response satisfies them. If it does not, a failure reason is provided.
Trajectory Criteria
Trajectory criteria validate the sequence of actions the agent took β not what it said, but what it did. You can define multiple trajectory criteria on a single eval; all of them must pass for the trajectory check to succeed.
Tool Call Criterion
Asserts that the agent called specific tools during the execution.
| Property | Description |
|---|---|
| Tool Names | The list of expected tool names |
| Exact | If enabled, the agent must call exactly these tools β no more, no fewer |
| Ordered | If enabled, the tools must be called in the specified order |
Matching behavior:
| Exact | Ordered | Behavior |
|---|---|---|
| false | false | All expected tools must be called (subset match) |
| true | false | The agent must call exactly these tools, in any order |
| true | true | The agent must call exactly these tools, in this order |
Chunk Fetch Criterion
Asserts that the agent fetched specific knowledge chunks during the execution. Specify the expected chunk IDs that should have been retrieved.
This criterion type is available but not yet fully evaluated β it currently does not affect the overall pass/fail result.
Running an Eval
Open an eval and click Run. Evolvable.ai will:
- Send the evalβs input to the agent
- Collect the response, token counts, and execution time
- Evaluate each criterion in sequence
- Record the overall pass/fail result
Executions run synchronously and the result is shown immediately.
Execution Results
After an eval runs, the result shows a breakdown of every check performed.
Result Fields
| Field | Description |
|---|---|
| Overall Passed | true only if every applicable check passed |
| Prompt Tokens Passed | Whether prompt token count was within the limit (null if no limit set) |
| Response Tokens Passed | Whether response token count was within the limit (null if no limit set) |
| Time Passed | Whether execution time was within the limit (null if no limit set) |
| Response Criteria Passed | Whether the response met the content criteria (null if no criteria set) |
| Trajectory Criteria Passed | Whether all trajectory criteria passed (null if no criteria defined) |
| Response Criteria Failure Reason | Explanation from the LLM when response criteria fail |
Metrics
Each execution also records raw metrics regardless of whether limits were set:
| Metric | Description |
|---|---|
| Prompt Tokens | Actual number of tokens in the prompt |
| Response Tokens | Actual number of tokens in the response |
| Execution Time (ms) | Total time from input to response |
| Tool Calls | List of tools called during the execution, in order |
Structure Summary
Agent Eval
βββ Input
βββ Streaming flag
βββ Constraint Limits
β βββ Max Prompt Tokens
β βββ Max Response Tokens
β βββ Max Time (ms)
βββ Response Criteria (LLM-evaluated)
βββ Trajectory Criteria
βββ Tool Call Criteria
βββ Chunk Fetch Criteria