Module 728 min read · Building with AI APIs

Embeddings and Semantic Search

Every API call so far in this course has sent text in and received text back. Embeddings are different: you send text in and receive a dense numerical vector back — a list of floating-point numbers, typically 1,536 or 3,072 of them, that encodes the meaning of that text in a high-dimensional geometric space. The power of this representation is that meaning becomes distance: semantically similar texts map to vectors that are close together, semantically different texts map to vectors that are far apart. This enables a fundamentally new kind of search — one that finds relevant content by meaning rather than by keyword matching.

What an embedding is, precisely

An embedding model is a neural network trained to convert variable-length text into a fixed-length vector of floats. The training objective rewards the model for producing vectors where semantically related texts cluster together and unrelated texts spread apart. The result is a learned geometric representation of meaning: "dog" and "puppy" end up nearby, "bank" (financial) and "bank" (riverbank) end up in different regions depending on context, and "Paris is the capital of France" and "France's capital city is Paris" end up nearly identical despite using different words.

The dimensionality of the vector (1,536 for OpenAI's text-embedding-3-small, 3,072 for text-embedding-3-large) determines the resolution of the semantic space — higher dimensions can encode more fine-grained distinctions but require more storage and compute. The specific numbers in the vector have no human-interpretable meaning individually; the meaning is encoded in the relationships between vectors, not in any single dimension.

Critically, embedding models are distinct from completion models. You cannot use gpt-4o to generate embeddings — you need a model specifically trained for embedding (like text-embedding-3-small from OpenAI or embed-english-v3.0 from Cohere). Each provider's embeddings live in their own geometric space and are not interoperable: you cannot compare an OpenAI embedding to a Cohere embedding using cosine similarity and get a meaningful result.

Generating embeddings via API

The API call is simple — simpler than a chat completion. You provide an input string (or list of strings for batch embedding) and the model ID, and receive a list of floats.

With the OpenAI Python SDK:

from openai import OpenAI
client = OpenAI()

response = client.embeddings.create(
    input="The mitochondria is the powerhouse of the cell",
    model="text-embedding-3-small"
)

vector = response.data[0].embedding  # list of 1536 floats
print(len(vector))  # 1536

For batch embedding — embedding many texts in a single API call — pass a list as input. Batch embedding is significantly more cost-efficient than making one API call per text: the same token count processed in batch costs the same as individual calls, but you save on per-request overhead and rate limit headroom.

Measuring similarity: cosine similarity

The standard way to measure how similar two embedding vectors are is cosine similarity: the cosine of the angle between the two vectors. Cosine similarity ranges from -1 (opposite directions, maximally dissimilar) to 1 (same direction, maximally similar), with 0 representing orthogonality (unrelated). For text embeddings from modern models, scores above 0.85 typically indicate strong semantic similarity, scores between 0.7 and 0.85 indicate topical relatedness, and scores below 0.5 indicate little semantic connection.

Computing cosine similarity between two numpy arrays:

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# If embeddings are already L2-normalized (most modern models normalize),
# cosine similarity reduces to a simple dot product:
def cosine_sim_normalized(a, b):
    return np.dot(a, b)

Building a semantic search system

Semantic search has three components: an index (a stored collection of embeddings for your documents), a query (an embedding of the user's question), and a retrieval step (finding the stored embeddings closest to the query embedding). The simplest possible implementation stores embeddings in memory and uses brute-force nearest-neighbor search.

The full pipeline

At index time: for each document in your corpus, call the embeddings API and store the resulting vector alongside the document text and any metadata. Do this once, offline, and cache the results. Re-embed only when documents change.

At query time: embed the user's query using the same model (critical — you must use the same model for queries and documents). Compute cosine similarity between the query vector and every stored document vector. Return the top-k documents by similarity score, along with their text.

Scaling with vector databases

Brute-force search over stored embeddings works fine up to tens of thousands of documents. Above that scale, you need a vector database — a specialized data store designed to efficiently find the nearest neighbors of a query vector in a large collection. Popular vector databases include Pinecone (managed cloud service), Weaviate (open-source, can self-host), Qdrant (open-source, Rust-based, very fast), and pgvector (a PostgreSQL extension that adds vector similarity search to a standard relational database). If you already use PostgreSQL, pgvector is often the pragmatic choice — you get vector search without adding another infrastructure component.

Vector databases use approximate nearest neighbor (ANN) algorithms — typically HNSW (Hierarchical Navigable Small World graphs) — that trade a small amount of recall accuracy for dramatically faster search. At one million documents, an ANN index finds the 10 most similar documents in milliseconds; brute-force search would take seconds. At ten million documents, only ANN search is practical.

Retrieval-Augmented Generation (RAG)

Embeddings and semantic search are the foundation of RAG — the pattern where you retrieve relevant documents from a knowledge base and inject them into the context of an LLM completion call. This is how you give an LLM access to information that is too large to fit in a context window, updated after the model's training cutoff, or proprietary to your organization.

The basic RAG loop: user asks a question → embed the question → find the top-k most relevant documents → concatenate those documents into the LLM's context → ask the LLM to answer the question based on the provided context. The LLM's answer is grounded in your retrieved documents rather than solely in its training data, which reduces hallucination and enables factual answers about private or recent information.

When RAG outperforms fine-tuning

For knowledge that changes frequently or is specific to your organization, RAG almost always outperforms fine-tuning. Fine-tuning bakes knowledge into model weights, which means the knowledge is static after training and updating requires an expensive retraining run. RAG updates the knowledge base by updating the document index — no model training required. If your knowledge base changes daily or weekly, RAG is the right architecture.

Chunking strategies

Documents are rarely embedded as a whole. A 10-page PDF embedded as a single vector loses all internal structure — the retrieval system cannot surface the specific paragraph that answers a question, only indicate that the document is broadly relevant. Chunking — splitting documents into smaller pieces before embedding — gives the retrieval system finer granularity.

The main chunking strategies:

Fixed-size chunking: split every N tokens or characters, optionally with overlap between chunks. Simple to implement, works reasonably well for uniform text. Overlap (e.g., 50 token overlap between 200-token chunks) reduces the chance of splitting a sentence or concept across chunk boundaries.
Semantic chunking: split at natural boundaries — paragraph breaks, section headings, sentence boundaries. Preserves the semantic coherence of each chunk at the cost of variable chunk sizes.
Recursive chunking: split first by large delimiters (double newlines), then by smaller ones (single newlines), then by sentences, until all chunks are under the target size. This is LangChain's RecursiveCharacterTextSplitter approach and works well for most document types.

Chunk size is a hyperparameter to tune. Smaller chunks (100–200 tokens) give more precise retrieval but may lack context for the LLM to answer. Larger chunks (500–1,000 tokens) provide more context but may include irrelevant content that dilutes the LLM's focus. A common practical approach: retrieve small chunks for relevance, then expand to the surrounding paragraphs for LLM context.

Embedding model consistency

If you change your embedding model — for instance, upgrading from text-embedding-3-small to text-embedding-3-large for better quality — you must re-embed your entire document corpus. The new model's vectors live in a different geometric space, and comparing old-model document vectors to new-model query vectors will produce meaningless similarity scores. This re-embedding cost is worth budgeting for in your architecture — it is a predictable maintenance event, not an edge case.

In Module 8, we take the API capabilities further: function calling and tool use, which let LLMs take actions rather than just generate text. This is the foundation of agent systems and opens up applications that go far beyond question-answering.