#Evals: Testing AI Systems

"It seems to work" isn't good enough for production. Evals give you confidence that your AI actually does what you need.

Evals (evaluations) are how we measure AI system quality—like unit tests, but for LLMs.

#Why Evaluate?

LLMs are non-deterministic. The same prompt can produce different outputs. Small prompt changes can have big effects. How do you know if your changes made things better or worse?

Evals give you:

Confidence to deploy changes
Early warning when quality degrades
Metrics to compare different approaches
Documentation of expected behavior

Watch an evaluation suite run through a series of test cases:

Evaluation Suite Runner

88%

Avg Score

Passed

Warning

Failed

Test 1

Input: What is the capital of France?

Expected: Paris

Test 2

Input: Calculate 15 + 27

Expected: 42

Test 3

Input: Summarize: AI helps automate tasks

Expected: AI enables automation

Test 4

Input: Is the sky blue?

Expected: Yes

#Evaluation Metrics

There's no single "quality" metric for AI. Different use cases need different measurements. Explore the common metrics:

Common Evaluation Metrics

Accuracy

How factually correct is the response?

Q: What year was JavaScript created? A: 1995 ✓ | 1999 ✗

Most production systems track multiple metrics, weighting them based on what matters most for the use case.

#Types of Evals

Evaluations fall into a few categories:

Exact Match — Output must match expected value exactly
Contains/Pattern — Output must contain specific content
LLM-as-Judge — Use another LLM to score the output
Human Evaluation — Manual review for nuanced quality
Behavioral — Test specific behaviors (safety, refusals)

1import { generateText, generateObject } from 'ai';
2import { z } from 'zod';
3 
4// Exact match eval
5function exactMatchEval(output: string, expected: string) {
6  return output.toLowerCase().includes(expected.toLowerCase());
7}
8 
9// LLM-as-Judge eval
10async function llmJudgeEval(question: string, answer: string) {
11  const { object } = await generateObject({
12    model: 'openai/gpt-5.2',
13    schema: z.object({
14      score: z.number().min(0).max(1),
15      reasoning: z.string()
16    }),
17    prompt: `Rate this answer from 0-1 based on accuracy and helpfulness.
18 
19Question: ${question}
20Answer: ${answer}
21 
22Be strict but fair. Explain your reasoning.`
23  });
24  
25  return object;
26}
27 
28// Run eval suite
29async function runEvals(testCases) {
30  const results = [];
31  
32  for (const test of testCases) {
33    const { text } = await generateText({
34      model: 'openai/gpt-5.2',
35      prompt: test.input
36    });
37    
38    const score = await llmJudgeEval(test.input, text);
39    results.push({ ...test, output: text, ...score });
40  }
41  
42  return results;
43}

#Eval Best Practices

Start with real examples — Use actual user queries, not synthetic data
Include edge cases — Test adversarial inputs and failure modes
Version your eval sets — Track changes to test cases over time
Run automatically — Integrate evals into your CI/CD pipeline
Set thresholds — Define minimum acceptable scores for deployment

Good evals are an investment. They take time to build, but they pay dividends in reliability and confidence. Start small and expand your eval suite as you learn what matters.

RAG