#Evals: Testing AI Systems

"It seems to work" isn't good enough for production. Evals give you confidence that your AI actually does what you need.

Evals (evaluations) are how we measure AI system quality—like unit tests, but for LLMs.

#Why Evaluate?

LLMs are non-deterministic. The same prompt can produce different outputs. Small prompt changes can have big effects. How do you know if your changes made things better or worse?

Evals give you:

  • Confidence to deploy changes
  • Early warning when quality degrades
  • Metrics to compare different approaches
  • Documentation of expected behavior

Watch an evaluation suite run through a series of test cases:

Evaluation Suite Runner

88%

Avg Score

0

Passed

0

Warning

0

Failed

Test 1

Input: What is the capital of France?

Expected: Paris

Test 2

Input: Calculate 15 + 27

Expected: 42

Test 3

Input: Summarize: AI helps automate tasks

Expected: AI enables automation

Test 4

Input: Is the sky blue?

Expected: Yes

#Evaluation Metrics

There's no single "quality" metric for AI. Different use cases need different measurements. Explore the common metrics:

Common Evaluation Metrics
Accuracy

How factually correct is the response?

Q: What year was JavaScript created? A: 1995 ✓ | 1999 ✗

Most production systems track multiple metrics, weighting them based on what matters most for the use case.

#Types of Evals

Evaluations fall into a few categories:

  • Exact Match — Output must match expected value exactly
  • Contains/Pattern — Output must contain specific content
  • LLM-as-Judge — Use another LLM to score the output
  • Human Evaluation — Manual review for nuanced quality
  • Behavioral — Test specific behaviors (safety, refusals)
1import { generateText, generateObject } from 'ai';
2import { z } from 'zod';
3 
4// Exact match eval
5function exactMatchEval(output: string, expected: string) {
6 return output.toLowerCase().includes(expected.toLowerCase());
7}
8 
9// LLM-as-Judge eval
10async function llmJudgeEval(question: string, answer: string) {
11 const { object } = await generateObject({
12 model: 'openai/gpt-5.2',
13 schema: z.object({
14 score: z.number().min(0).max(1),
15 reasoning: z.string()
16 }),
17 prompt: `Rate this answer from 0-1 based on accuracy and helpfulness.
18 
19Question: ${question}
20Answer: ${answer}
21 
22Be strict but fair. Explain your reasoning.`
23 });
24
25 return object;
26}
27 
28// Run eval suite
29async function runEvals(testCases) {
30 const results = [];
31
32 for (const test of testCases) {
33 const { text } = await generateText({
34 model: 'openai/gpt-5.2',
35 prompt: test.input
36 });
37
38 const score = await llmJudgeEval(test.input, text);
39 results.push({ ...test, output: text, ...score });
40 }
41
42 return results;
43}

#Eval Best Practices

  • Start with real examples — Use actual user queries, not synthetic data
  • Include edge cases — Test adversarial inputs and failure modes
  • Version your eval sets — Track changes to test cases over time
  • Run automatically — Integrate evals into your CI/CD pipeline
  • Set thresholds — Define minimum acceptable scores for deployment

Good evals are an investment. They take time to build, but they pay dividends in reliability and confidence. Start small and expand your eval suite as you learn what matters.