#RAG: Teaching LLMs Your Data
LLMs know a lot, but they don't know about your company's docs, your codebase, or yesterday's news. RAG fixes that.
RAG stands for Retrieval Augmented Generation. It's how you give LLMs access to custom knowledge.
#The RAG Pipeline
RAG works by finding relevant information first, then including it in the prompt. Watch the full pipeline in action:
The key insight: instead of fine-tuning a model (expensive, slow), we just inject the right context at query time. The model generates responses based on information it "retrieves" rather than information it "memorized".
#Chunking: Breaking Down Documents
Before we can search documents, we need to break them into smaller pieces. This is called chunking. Why? Because:
- Embeddings work better on focused, specific text
- We only want to retrieve relevant parts, not entire documents
- Smaller chunks mean more precise matches
Try adjusting the chunk size and overlap to see how it affects the output:
Original Document
The quick brown fox jumps over the lazy dog. It was a sunny day in the forest. The fox was looking for food. She found some berries near the stream.
The quick brown fox jumps over the lazy dog. It was a sunny day in the forest. The fox was looking for food.
The fox was looking for food. She found some berries near the stream..
Chunking breaks documents into smaller pieces for embedding. Overlap helps preserve context at chunk boundaries.
#Building a RAG System
Here's a complete RAG implementation with the AI SDK:
1import { embed, generateText } from 'ai';2import { Index } from '@upstash/vector';3 4const index = new Index();5 6// Step 1: Index your documents (run once)7async function indexDocuments(documents: string[]) {8 for (const doc of documents) {9 const { embedding } = await embed({10 model: 'openai/text-embedding-5',11 value: doc12 });13 14 await index.upsert({15 id: crypto.randomUUID(),16 vector: embedding,17 metadata: { content: doc }18 });19 }20}21 22// Step 2: Query with RAG23async function askWithRAG(question: string) {24 // Embed the question25 const { embedding } = await embed({26 model: 'openai/text-embedding-5',27 value: question28 });29 30 // Find relevant documents31 const results = await index.query({32 vector: embedding,33 topK: 3,34 includeMetadata: true35 });36 37 // Build context from results38 const context = results39 .map(r => r.metadata?.content)40 .join('\n\n');41 42 // Generate answer with context43 const { text } = await generateText({44 model: 'openai/gpt-5.2',45 prompt: `Use this context to answer the question.46 47Context:48${context}49 50Question: ${question}`51 });52 53 return text;54}#RAG Best Practices
- Chunk thoughtfully — Too small loses context, too big adds noise
- Add overlap — Prevents cutting off in the middle of important info
- Include metadata — Source, date, author help with filtering
- Rerank results — Use a reranker model for better precision
- Handle "no results" — What if nothing relevant is found?
RAG is powerful but not magic. The quality of your chunking, embeddings, and retrieval strategy directly impacts response quality.