Module 732 min read · Agentic AI and Autonomous Systems

Retrieval-Augmented Generation

Every language model has a knowledge cutoff — a point in time beyond which it knows nothing. Every model also lacks access to your private data, internal documents, and proprietary knowledge. And every model hallucinates, sometimes confidently producing plausible-sounding but false information. Retrieval-Augmented Generation (RAG) is the most widely deployed technique for addressing all three of these limitations simultaneously. Understanding RAG deeply — from the indexing pipeline through advanced retrieval techniques to evaluation — is essential knowledge for anyone building production AI systems.

What RAG solves

RAG addresses three fundamental limitations of language models operating from training knowledge alone.

The knowledge cutoff problem. Models are trained on data collected up to a certain date. Events, research, regulations, and facts that emerged after that date are simply unknown to the model. RAG solves this by retrieving current information at inference time, allowing the model to answer questions about recent events by reading retrieved content rather than relying on training knowledge.

The private data problem. Your company's internal documents, customer records, proprietary research, and institutional knowledge were not in the model's training data. Without RAG, the model has no access to these sources. With RAG, you index private data and retrieve relevant sections on demand, allowing the model to reason over your specific knowledge.

Hallucination reduction. When a model must answer from training knowledge alone, it can produce confident but fabricated responses — especially for specific facts, citations, statistics, and niche domain knowledge. When the model is given retrieved documents containing the actual answer, it can ground its response in real content rather than generating plausible-sounding substitutes. RAG doesn't eliminate hallucination, but it reduces the rate significantly for factual questions by giving the model a reliable reference to cite.

The indexing pipeline

Before RAG can work, your documents must be indexed — converted from raw text into a format that supports fast semantic retrieval. The indexing pipeline consists of four stages.

1. Document loading

Raw documents must be loaded from wherever they live: PDFs, Word documents, web pages, databases, APIs, email archives, code repositories. Each source type requires a different loader. LangChain and LlamaIndex both provide extensive libraries of document loaders. The key challenges at this stage are handling diverse file formats, preserving document structure (headings, tables, code blocks), and dealing with encoding issues in older or non-English documents.

2. Chunking

Documents must be split into chunks before embedding, because embedding models have input length limits (typically 512–8192 tokens) and you need chunks small enough that retrieved content fits in the context window without crowding out other information. The chunking strategy dramatically affects retrieval quality. Poor chunking — cutting in the middle of a sentence, separating a question from its answer — produces chunks that are semantically incomplete and retrieve poorly.

3. Embedding

Each chunk is passed through an embedding model to produce a dense vector representation. All chunks in the knowledge base must be embedded with the same model. This process is done once at index time (not at query time) and is typically the most computationally expensive part of the indexing pipeline for large knowledge bases.

4. Storage

Vectors are stored in a vector database alongside the source chunk content and metadata (document ID, title, date, section heading, etc.). Good metadata enables metadata filtering during retrieval — restricting searches to a specific document type, date range, or author without changing the semantic search itself.

Chunking strategies

Chunking is where most RAG systems fail in practice. Getting chunking right has more impact on retrieval quality than almost any other design decision.

Fixed-size chunking

Split every document into chunks of N characters or tokens with an overlap of M. Simple to implement, consistent, and works surprisingly well for many use cases. The limitation is that it ignores document structure entirely — a 512-token chunk might start mid-paragraph and end mid-sentence, producing semantically incoherent fragments. The overlap parameter (typically 10–20% of chunk size) helps by ensuring context from the end of one chunk appears at the start of the next.

Semantic chunking

Rather than splitting at fixed sizes, semantic chunking splits at semantic boundaries — paragraph breaks, section changes, or detected topic shifts. This preserves the natural flow of ideas within each chunk and produces more coherent retrievable units. More complex to implement but generally produces higher-quality chunks for structured documents like research papers, legal texts, and technical documentation.

Recursive character splitting

LangChain's RecursiveCharacterTextSplitter implements a practical middle ground: it tries to split on paragraph breaks first, then sentence breaks, then word breaks, falling back to character splits only when necessary to meet the target chunk size. This preserves structure when possible while guaranteeing a maximum chunk size.

Contextual chunk enrichment

A newer technique (popularized by Anthropic's "contextual retrieval" research) involves prepending a brief context summary to each chunk before embedding: "This excerpt is from Chapter 3 of [Document Name], which discusses [topic]. The preceding section covered [topic]. This chunk explains..." This enriches the semantic content of the chunk with context that would otherwise be lost, significantly improving retrieval accuracy for documents where individual chunks lack sufficient context to be retrieved correctly in isolation.

Vector search and approximate nearest neighbor

At query time, the search process converts the query to a vector using the same embedding model, then finds the stored vectors most similar to the query vector. For small knowledge bases (under ~100k vectors), exact nearest neighbor search is fast enough. For large knowledge bases, approximate nearest neighbor (ANN) algorithms are used — they trade a small accuracy loss for massive speed improvements.

The leading ANN algorithms used in vector databases include HNSW (Hierarchical Navigable Small World — the default in most modern vector databases, excellent speed/accuracy tradeoff), IVF (Inverted File Index — partitions the space into clusters, searches only nearby clusters), and ScaNN (Google's scalable ANN — particularly fast for very high-dimensional vectors). The choice is usually handled by the vector database, not the application developer.

Re-ranking for precision

Vector similarity search is fast but not always precise. The top-k results by cosine similarity often include somewhat irrelevant items that happened to have high embedding similarity but are not what the user actually needs. Re-ranking adds a second pass that reorders the retrieved candidates using a more computationally expensive but more accurate relevance model.

Re-rankers (like Cohere Rerank, or cross-encoder models like ms-marco-MiniLM-L-6-v2) take the original query and each retrieved chunk as a pair and produce a relevance score for that specific query-document combination — a more nuanced assessment than pure embedding similarity. A typical RAG pipeline: retrieve top-50 by vector similarity, re-rank to top-5, inject those 5 into the context. The re-ranking step typically adds 100–300ms but substantially improves precision.

Query transformation techniques

Sometimes the user's raw query is not the best input for vector search. Query transformation techniques modify or expand the query before retrieval to improve the quality of what gets retrieved.

HyDE: Hypothetical Document Embeddings

HyDE is one of the most effective query transformation techniques. Instead of embedding the user's query and searching for similar documents, HyDE first uses a language model to generate a hypothetical document that would answer the query — essentially, it writes an answer to the question (without actual knowledge, just based on what such an answer would look like). Then it embeds that hypothetical document and searches for real documents similar to it.

The intuition is that the hypothetical answer is in "document space" rather than "question space" — it uses the vocabulary, phrasing, and structure of answers, which tends to be much closer in embedding space to actual answer documents than the question itself is. HyDE can dramatically improve retrieval for complex or technical queries where the question phrasing differs significantly from how answers are written in the knowledge base.

HyDE in Practice

A query like "why does my payment fail at checkout?" might retrieve different documents than a hypothetical answer like "Payment failures at checkout are commonly caused by expired cards, insufficient funds, mismatched billing addresses, or temporary gateway outages. The error code returned indicates..." — because the hypothetical answer uses the terminology and structure found in actual troubleshooting documentation. HyDE is particularly valuable when the knowledge base was written for experts and the queries come from non-experts using different vocabulary.

Query expansion

Generate multiple variations of the user's query — different phrasings, synonyms, related questions — and retrieve for each variation, then merge and de-duplicate the results. This improves recall by covering more of the semantic space around the original query. Useful when the knowledge base uses inconsistent terminology or when the user's query might be expressed in various ways.

Step-back prompting

Transform a specific question into a more general one before retrieval. "What's the FDA approval status of Drug X for pediatric use?" becomes "What are the FDA approval requirements for pediatric drug use?" The more general question may retrieve foundational context that helps answer the specific question — context that wouldn't match the specific query at all.

Hybrid search: semantic plus keyword

Pure semantic search excels at meaning-based queries but can miss items that are only retrievable by exact keyword match. Pure keyword search (BM25) is precise for known terms but misses semantic matches. Hybrid search combines both, running both semantic and keyword search and merging the results using a rank fusion algorithm like Reciprocal Rank Fusion (RRF).

Hybrid search is particularly valuable for: queries containing specific names, codes, model numbers, or technical terms that are better matched by exact text; knowledge bases with precise technical vocabulary; and situations where both "what does this term mean" and "find this specific term" queries need to be handled by the same system. Weaviate and many other vector databases support hybrid search natively.

Advanced RAG patterns

Parent document retrieval: Store small chunks for precise retrieval but return the parent document or a larger chunk when the small chunk is retrieved. This solves the tension between small chunks (better retrieval precision) and large chunks (more context for the model).

Agentic RAG: Rather than a single fixed retrieval step, an agent decides when to retrieve, what to query, and whether to retrieve again after reading initial results. If the first retrieval doesn't answer the question, the agent reformulates the query and tries again. This iterative, adaptive retrieval substantially improves quality for complex multi-part questions.

Multi-vector retrieval: Index each document with multiple vectors — a summary vector, a question vector (what questions does this document answer?), and a content vector. Different query types route to different index layers, improving coverage.

Evaluating RAG: the RAGAS framework

RAGAS (Retrieval Augmented Generation Assessment) is the leading open-source framework for evaluating RAG pipelines. It measures four dimensions:

Metric	What It Measures	Ideal Value
Faithfulness	Does the generated answer contain only claims that are supported by the retrieved context? (Hallucination detection)	1.0 — every claim is grounded
Answer Relevancy	Does the generated answer address the actual question asked?	1.0 — directly answers the question
Context Precision	Are the retrieved chunks actually relevant to the question? (Retrieval quality)	1.0 — all retrieved chunks are relevant
Context Recall	Do the retrieved chunks contain all the information needed to answer the question?	1.0 — nothing important was missed

RAGAS evaluates these metrics automatically using LLM judges, making it practical to run on large test sets without manual annotation. A well-functioning RAG pipeline should score above 0.8 on all four metrics; scores below 0.6 on any metric indicate a problem worth diagnosing.

RAG vs fine-tuning decision matrix

A common architectural decision: should new knowledge be added to a model through RAG or through fine-tuning? Both can work, but they are better suited to different use cases.

Dimension	RAG	Fine-tuning
Knowledge updates	Instant — update the index, no retraining	Requires retraining cycle (hours to days)
Knowledge type	Factual, retrievable content; documents; Q&A pairs	Style, tone, behavior, implicit know-how
Cost	Storage + retrieval at inference time	Training compute upfront, cheaper inference
Interpretability	High — can see exactly what was retrieved and why	Low — knowledge is implicit in weights
Privacy	Data stays outside the model; can be deleted	Data is baked into model weights
Best for	Dynamic knowledge, private data, large knowledge bases	Consistent output format, domain-specific behavior, specialized vocabulary

The Most Common Answer

In the majority of production use cases, RAG is the right choice over fine-tuning for adding factual knowledge. It is faster to update, more interpretable, more privacy-preserving, and often achieves higher accuracy because the model can read the exact source rather than relying on vaguely-remembered fine-tuning examples. Fine-tuning is most valuable for changing how the model behaves — its tone, output format, and domain-specific reasoning style — rather than what it knows. Many of the most effective systems combine both: a fine-tuned base model with a RAG layer for dynamic factual grounding.