#RAG: Teaching LLMs Your Data

LLMs know a lot, but they don't know about your company's docs, your codebase, or yesterday's news. RAG fixes that.

RAG stands for Retrieval Augmented Generation. It's how you give LLMs access to custom knowledge.

#The RAG Pipeline

RAG works by finding relevant information first, then including it in the prompt. Watch the full pipeline in action:

RAG Pipeline Visualization
Query
Embed
Search
Retrieve
Generate

The key insight: instead of fine-tuning a model (expensive, slow), we just inject the right context at query time. The model generates responses based on information it "retrieves" rather than information it "memorized".

#Chunking: Breaking Down Documents

Before we can search documents, we need to break them into smaller pieces. This is called chunking. Why? Because:

  • Embeddings work better on focused, specific text
  • We only want to retrieve relevant parts, not entire documents
  • Smaller chunks mean more precise matches

Try adjusting the chunk size and overlap to see how it affects the output:

Document Chunking

Original Document

The quick brown fox jumps over the lazy dog. It was a sunny day in the forest. The fox was looking for food. She found some berries near the stream.

Chunk 13 sentences

The quick brown fox jumps over the lazy dog. It was a sunny day in the forest. The fox was looking for food.

Chunk 22 sentences

The fox was looking for food. She found some berries near the stream..

Chunking breaks documents into smaller pieces for embedding. Overlap helps preserve context at chunk boundaries.

#Building a RAG System

Here's a complete RAG implementation with the AI SDK:

1import { embed, generateText } from 'ai';
2import { Index } from '@upstash/vector';
3 
4const index = new Index();
5 
6// Step 1: Index your documents (run once)
7async function indexDocuments(documents: string[]) {
8 for (const doc of documents) {
9 const { embedding } = await embed({
10 model: 'openai/text-embedding-5',
11 value: doc
12 });
13
14 await index.upsert({
15 id: crypto.randomUUID(),
16 vector: embedding,
17 metadata: { content: doc }
18 });
19 }
20}
21 
22// Step 2: Query with RAG
23async function askWithRAG(question: string) {
24 // Embed the question
25 const { embedding } = await embed({
26 model: 'openai/text-embedding-5',
27 value: question
28 });
29
30 // Find relevant documents
31 const results = await index.query({
32 vector: embedding,
33 topK: 3,
34 includeMetadata: true
35 });
36
37 // Build context from results
38 const context = results
39 .map(r => r.metadata?.content)
40 .join('\n\n');
41
42 // Generate answer with context
43 const { text } = await generateText({
44 model: 'openai/gpt-5.2',
45 prompt: `Use this context to answer the question.
46
47Context:
48${context}
49 
50Question: ${question}`
51 });
52
53 return text;
54}

#RAG Best Practices

  • Chunk thoughtfully — Too small loses context, too big adds noise
  • Add overlap — Prevents cutting off in the middle of important info
  • Include metadata — Source, date, author help with filtering
  • Rerank results — Use a reranker model for better precision
  • Handle "no results" — What if nothing relevant is found?

RAG is powerful but not magic. The quality of your chunking, embeddings, and retrieval strategy directly impacts response quality.