#RAG: Teaching LLMs Your Data

LLMs know a lot, but they don't know about your company's docs, your codebase, or yesterday's news. RAG fixes that.

RAG stands for Retrieval Augmented Generation. It's how you give LLMs access to custom knowledge.

#The RAG Pipeline

RAG works by finding relevant information first, then including it in the prompt. Watch the full pipeline in action:

RAG Pipeline Visualization

Query

Embed

Retrieve

Generate

User Query

The key insight: instead of fine-tuning a model (expensive, slow), we just inject the right context at query time. The model generates responses based on information it "retrieves" rather than information it "memorized".

#Chunking: Breaking Down Documents

Before we can search documents, we need to break them into smaller pieces. This is called chunking. Why? Because:

Embeddings work better on focused, specific text
We only want to retrieve relevant parts, not entire documents
Smaller chunks mean more precise matches

Try adjusting the chunk size and overlap to see how it affects the output:

Document Chunking

Original Document

The quick brown fox jumps over the lazy dog. It was a sunny day in the forest. The fox was looking for food. She found some berries near the stream.

Chunk Size: 3 sentences

Overlap: 1 sentence(s)

Chunk 13 sentences

The quick brown fox jumps over the lazy dog. It was a sunny day in the forest. The fox was looking for food.

Chunk 22 sentences

The fox was looking for food. She found some berries near the stream..

Chunking breaks documents into smaller pieces for embedding. Overlap helps preserve context at chunk boundaries.

#Building a RAG System

Here's a complete RAG implementation with the AI SDK:

1import { embed, generateText } from 'ai';
2import { Index } from '@upstash/vector';
3 
4const index = new Index();
5 
6// Step 1: Index your documents (run once)
7async function indexDocuments(documents: string[]) {
8  for (const doc of documents) {
9    const { embedding } = await embed({
10      model: 'openai/text-embedding-5',
11      value: doc
12    });
13    
14    await index.upsert({
15      id: crypto.randomUUID(),
16      vector: embedding,
17      metadata: { content: doc }
18    });
19  }
20}
21 
22// Step 2: Query with RAG
23async function askWithRAG(question: string) {
24  // Embed the question
25  const { embedding } = await embed({
26    model: 'openai/text-embedding-5',
27    value: question
28  });
29  
30  // Find relevant documents
31  const results = await index.query({
32    vector: embedding,
33    topK: 3,
34    includeMetadata: true
35  });
36  
37  // Build context from results
38  const context = results
39    .map(r => r.metadata?.content)
40    .join('\n\n');
41  
42  // Generate answer with context
43  const { text } = await generateText({
44    model: 'openai/gpt-5.2',
45    prompt: `Use this context to answer the question.
46    
47Context:
48${context}
49 
50Question: ${question}`
51  });
52  
53  return text;
54}

#RAG Best Practices

Chunk thoughtfully — Too small loses context, too big adds noise
Add overlap — Prevents cutting off in the middle of important info
Include metadata — Source, date, author help with filtering
Rerank results — Use a reranker model for better precision
Handle "no results" — What if nothing relevant is found?

RAG is powerful but not magic. The quality of your chunking, embeddings, and retrieval strategy directly impacts response quality.

Embeddings

Evals