#Evals: Testing AI Systems
"It seems to work" isn't good enough for production. Evals give you confidence that your AI actually does what you need.
Evals (evaluations) are how we measure AI system quality—like unit tests, but for LLMs.
#Why Evaluate?
LLMs are non-deterministic. The same prompt can produce different outputs. Small prompt changes can have big effects. How do you know if your changes made things better or worse?
Evals give you:
- Confidence to deploy changes
- Early warning when quality degrades
- Metrics to compare different approaches
- Documentation of expected behavior
Watch an evaluation suite run through a series of test cases:
88%
Avg Score
0
Passed
0
Warning
0
Failed
Input: What is the capital of France?
Expected: Paris
Input: Calculate 15 + 27
Expected: 42
Input: Summarize: AI helps automate tasks
Expected: AI enables automation
Input: Is the sky blue?
Expected: Yes
#Evaluation Metrics
There's no single "quality" metric for AI. Different use cases need different measurements. Explore the common metrics:
How factually correct is the response?
Most production systems track multiple metrics, weighting them based on what matters most for the use case.
#Types of Evals
Evaluations fall into a few categories:
- Exact Match — Output must match expected value exactly
- Contains/Pattern — Output must contain specific content
- LLM-as-Judge — Use another LLM to score the output
- Human Evaluation — Manual review for nuanced quality
- Behavioral — Test specific behaviors (safety, refusals)
1import { generateText, generateObject } from 'ai';2import { z } from 'zod';3 4// Exact match eval5function exactMatchEval(output: string, expected: string) {6 return output.toLowerCase().includes(expected.toLowerCase());7}8 9// LLM-as-Judge eval10async function llmJudgeEval(question: string, answer: string) {11 const { object } = await generateObject({12 model: 'openai/gpt-5.2',13 schema: z.object({14 score: z.number().min(0).max(1),15 reasoning: z.string()16 }),17 prompt: `Rate this answer from 0-1 based on accuracy and helpfulness.18 19Question: ${question}20Answer: ${answer}21 22Be strict but fair. Explain your reasoning.`23 });24 25 return object;26}27 28// Run eval suite29async function runEvals(testCases) {30 const results = [];31 32 for (const test of testCases) {33 const { text } = await generateText({34 model: 'openai/gpt-5.2',35 prompt: test.input36 });37 38 const score = await llmJudgeEval(test.input, text);39 results.push({ ...test, output: text, ...score });40 }41 42 return results;43}#Eval Best Practices
- Start with real examples — Use actual user queries, not synthetic data
- Include edge cases — Test adversarial inputs and failure modes
- Version your eval sets — Track changes to test cases over time
- Run automatically — Integrate evals into your CI/CD pipeline
- Set thresholds — Define minimum acceptable scores for deployment
Good evals are an investment. They take time to build, but they pay dividends in reliability and confidence. Start small and expand your eval suite as you learn what matters.