Skip to Content
Welcome to the new Evolvable AI docs! πŸ‘‹

Evals

Evals let you define test cases for your agents and verify that they respond correctly β€” measuring response quality, token usage, execution time, and the sequence of tools called. Run evals manually to validate behavior before deploying changes.


Table of Contents

  1. Overview
  2. Core Concepts
  3. Creating an Eval
  4. Eval Properties
  5. Criteria Types
  6. Running an Eval
  7. Execution Results

Overview

An eval is a saved test case tied to a specific agent. Each eval defines an input prompt and one or more criteria the agent’s response must satisfy. When you run an eval, Evolvable.ai sends the input to the agent, collects the response and execution metrics, and checks each criterion automatically.

With evals you can:

  • Verify that an agent’s response meets quality or content requirements
  • Assert that the agent calls the right tools in the right order
  • Enforce limits on token usage and response time

Note: Evals can only be created for top-level agents β€” agents that users can start conversations with directly.


Core Concepts

Eval

A saved test case for an agent. It contains an input, optional criteria, and optional constraint limits. Evals are reusable β€” you can run the same eval multiple times and compare results.

Execution

A single run of an eval. Each execution records whether the agent passed or failed each criterion, along with actual token counts and response time.

Trajectory

The sequence of actions the agent took during an execution β€” specifically, the tools it called and the knowledge chunks it fetched. Trajectory criteria let you assert that the agent used the right tools.


Creating an Eval

Navigate to an agent’s Evals tab and click New Eval. Fill in the title, input, and any criteria you want to check, then save.


Eval Properties

Identity

PropertyDescription
TitleA name for this test case
DescriptionOptional notes about what this eval is testing

Input

PropertyDescription
InputThe prompt sent to the agent during the eval
StreamingWhether the agent runs in streaming mode for this eval

Constraint Limits

These limits define hard boundaries on token usage and execution time. If the agent exceeds any of these, the eval fails that check.

PropertyDescription
Max Prompt TokensMaximum number of tokens allowed in the prompt
Max Response TokensMaximum number of tokens allowed in the response
Max Time (ms)Maximum allowed execution time in milliseconds

All constraint limits are optional. If a limit is not set, that check is skipped.

Response Criteria

A free-text description of what the agent’s response should contain or achieve. Evolvable.ai uses an LLM to evaluate whether the actual response meets this description, and provides a failure reason if it does not.

Example: β€œThe response must recommend at least one product and include a price.”


Criteria Types

Response Criteria

A natural-language description evaluated by an LLM. The model reads the agent’s response alongside your criteria and determines whether the response satisfies them. If it does not, a failure reason is provided.

Trajectory Criteria

Trajectory criteria validate the sequence of actions the agent took β€” not what it said, but what it did. You can define multiple trajectory criteria on a single eval; all of them must pass for the trajectory check to succeed.

Tool Call Criterion

Asserts that the agent called specific tools during the execution.

PropertyDescription
Tool NamesThe list of expected tool names
ExactIf enabled, the agent must call exactly these tools β€” no more, no fewer
OrderedIf enabled, the tools must be called in the specified order

Matching behavior:

ExactOrderedBehavior
falsefalseAll expected tools must be called (subset match)
truefalseThe agent must call exactly these tools, in any order
truetrueThe agent must call exactly these tools, in this order

Chunk Fetch Criterion

Asserts that the agent fetched specific knowledge chunks during the execution. Specify the expected chunk IDs that should have been retrieved.

This criterion type is available but not yet fully evaluated β€” it currently does not affect the overall pass/fail result.


Running an Eval

Open an eval and click Run. Evolvable.ai will:

  1. Send the eval’s input to the agent
  2. Collect the response, token counts, and execution time
  3. Evaluate each criterion in sequence
  4. Record the overall pass/fail result

Executions run synchronously and the result is shown immediately.


Execution Results

After an eval runs, the result shows a breakdown of every check performed.

Result Fields

FieldDescription
Overall Passedtrue only if every applicable check passed
Prompt Tokens PassedWhether prompt token count was within the limit (null if no limit set)
Response Tokens PassedWhether response token count was within the limit (null if no limit set)
Time PassedWhether execution time was within the limit (null if no limit set)
Response Criteria PassedWhether the response met the content criteria (null if no criteria set)
Trajectory Criteria PassedWhether all trajectory criteria passed (null if no criteria defined)
Response Criteria Failure ReasonExplanation from the LLM when response criteria fail

Metrics

Each execution also records raw metrics regardless of whether limits were set:

MetricDescription
Prompt TokensActual number of tokens in the prompt
Response TokensActual number of tokens in the response
Execution Time (ms)Total time from input to response
Tool CallsList of tools called during the execution, in order

Structure Summary

Agent Eval β”œβ”€β”€ Input β”œβ”€β”€ Streaming flag β”œβ”€β”€ Constraint Limits β”‚ β”œβ”€β”€ Max Prompt Tokens β”‚ β”œβ”€β”€ Max Response Tokens β”‚ └── Max Time (ms) β”œβ”€β”€ Response Criteria (LLM-evaluated) └── Trajectory Criteria β”œβ”€β”€ Tool Call Criteria └── Chunk Fetch Criteria
Last updated on