Module 6 · Expert Track25 min read · Machine Learning Explained

NLP and Transformers

Language is the hardest thing for computers to handle. It's ambiguous, context-dependent, culturally loaded, and infinite in its possible combinations. "I saw the man with the telescope" has two completely different meanings depending on who held the telescope. Computers need to turn language into numbers before they can process it at all — and the way we do that encoding has changed dramatically, culminating in the transformer architecture that powers every major language model today.

The first challenge: making text into numbers

Neural networks operate on numbers. Text is not numbers. The first step in any natural language processing pipeline is converting text into a numerical representation — a process that seems trivial but is actually full of consequential decisions.

Tokenization is the process of breaking text into units that the model can work with. Early systems used words as tokens. "The cat sat on the mat" becomes six tokens: [The, cat, sat, on, the, mat]. But this creates problems. What about "don't" — is it one token or two? What about "unhappiness," "unhappy," "happier" — three separate tokens, or should the system recognize they share a root? What about Chinese, which has no spaces?

Modern systems use subword tokenization, which breaks words into smaller pieces. "Unhappiness" might become ["un", "happy", "ness"]. This lets the vocabulary stay manageable (most systems use 30,000–100,000 tokens) while still handling rare words, new words, and multiple languages. The model learns that "un-" tends to negate what follows and that "-ness" tends to create nouns from adjectives — even if it was never explicitly taught these rules.

Once text is tokenized, each token gets converted to a number (its ID in the vocabulary). But a single number carries no information about meaning. The word "cat" getting ID 4,291 and "dog" getting ID 7,832 tells the model nothing about how similar they are. We need richer representations.

Word embeddings: words in geometric space

The breakthrough insight of word embeddings, popularized by Google's Word2Vec in 2013, was this: give each word a position in a high-dimensional space — typically 100 to 300 dimensions — and arrange those positions so that words with similar meanings end up near each other.

How are these positions learned? By training a model to predict words from their context. If you take a huge corpus of text and train a model to predict "the blank sat on the mat" → "cat," then "the blank chased the mouse" → "cat," then do the same for "dog" across thousands of contexts, the model learns that "cat" and "dog" appear in similar contexts. It represents them with similar vectors. Meaning emerges from distribution — from what company a word keeps.

The results were remarkable. When researchers visualized these word vectors, "king" minus "man" plus "woman" equaled approximately "queen." Country names clustered near their capitals. Adjectives and their superlatives formed consistent directional patterns. The model had learned semantic relationships nobody programmed — they emerged from pure statistical patterns in text.

The city map analogy for semantic space

Imagine a city where every word has an address. Similar words live in similar neighborhoods: "cat," "dog," "hamster," and "parrot" all live in the pet district. "Run," "sprint," "jog," and "dash" are in the motion quarter. "Paris," "London," "Berlin" are in the European capital zone. The distance between two words' addresses captures how semantically related they are. Directions in this city have meaning: moving in the "capital of" direction always takes you from a country to its capital city. Moving in the "plural" direction always takes you from a singular noun to its plural. The geometry of the space encodes the grammar and semantics of the language.

The problem word embeddings couldn't solve

Classic word embeddings gave every word a single fixed vector. But "bank" in "I went to the bank to deposit money" and "bank" in "we fished from the river bank" are completely different concepts that happen to share a spelling. A fixed vector cannot be both at once.

Language is deeply context-dependent. The meaning of any word depends on what surrounds it — often on what came many sentences earlier, or what the whole document is about. Early systems tried to handle this with Recurrent Neural Networks (RNNs) — architectures designed to process text sequentially, maintaining a running "memory" of what they'd seen. RNNs worked, but they struggled with long-range dependencies. By the time an RNN got to the end of a long document, it had largely forgotten the beginning. And they were slow to train because each word had to be processed in sequence — you couldn't parallelize the computation.

The attention mechanism: looking everywhere at once

In 2017, a team at Google published a paper titled "Attention Is All You Need." It introduced the transformer architecture, and it changed everything.

The key innovation is the attention mechanism. Instead of processing words in sequence and maintaining a compressed memory, attention lets every word in a sequence look at every other word simultaneously and decide how much to "attend to" each one when figuring out its own meaning.

The mechanism works through three concepts: queries, keys, and values. Each word generates all three. The query is like a question: "what information do I need from the rest of the sentence?" The key is like an answer label: "here's what I contain." The value is the actual content. To figure out what "bank" means, it sends out its query and checks it against every other word's key. Words with matching keys (like "river" or "deposit") get high attention scores. The word then collects the values from high-attention words, weighted by those scores, to build a contextual representation that captures its meaning in this specific sentence.

The library analogy for attention

Imagine you're researching a topic (your query). You walk through a library where every book has a catalog card on the cover (the key) describing what's inside. You scan the cards and pull the most relevant books (high attention scores). You then read those books' content (the values) and synthesize them into your understanding. You do this for every word in the sentence simultaneously. The insight is that you don't have to read every book cover-to-cover in order — you can jump directly to what's relevant. This is why transformers handle long-range dependencies so much better than RNNs: "bank" can attend directly to "river" even if they're 200 words apart.

Why transformers replaced RNNs so completely

Transformers had three advantages that compounded dramatically with scale. First, parallelism: because every position in the sequence attends to every other position simultaneously, the entire computation can be run in parallel on GPU hardware. Training that would take months with RNNs takes days with transformers. Second, long-range dependencies: attention's ability to connect any two positions directly, regardless of distance, solved the fundamental limitation of sequential processing. Third, scalability: transformers got better with more data, more parameters, and more compute in a way that RNNs simply didn't. Every dollar of additional training made them measurably smarter.

Within a few years of the original paper, virtually every state-of-the-art NLP system had moved to transformer-based architectures. BERT, GPT, T5, XLNet, RoBERTa — the entire family of modern language models is built on the transformer foundation.

BERT vs GPT: two ways to use the same architecture

The transformer architecture can be configured in different ways depending on what you want the model to do. Two approaches dominate, and they lead to fundamentally different capabilities.

BERT (Google, 2018) uses only the encoder portion of the transformer. It's trained to "understand" text by reading it in both directions simultaneously — it sees the whole sentence at once, context from both left and right. The training task: predict randomly masked words given all the surrounding context ("The cat [MASK] on the mat" → "sat"). BERT excels at understanding tasks — sentiment analysis, question answering, named entity recognition, classification. It reads text and produces a rich representation of what it means. It's terrible at generating new text, because it's built to read, not write.

GPT (OpenAI, 2018 onward) uses only the decoder portion. It's trained to predict the next word given only what came before — it only sees context from the left. The training task: given every book and webpage on the internet, predict each next word. This is a generation task. GPT learned to write by practicing writing — across billions of documents, in every style, on every topic. GPT-family models are what power ChatGPT, Claude, and most large language models. They're generators, not understanders — though at sufficient scale, generation requires such deep understanding that the distinction blurs.

BERT-style (encoder-only): for understanding
Reads text bidirectionally. Used for classification, search, question answering, named entity recognition. Produces embeddings that capture meaning. Does not generate new text.
GPT-style (decoder-only): for generation
Predicts text left to right. Used for writing, summarization, translation, conversation. Produces fluent coherent text. Foundation for ChatGPT, Claude, Gemini, and most LLMs.
T5-style (encoder-decoder): for transformation
Reads input with encoder, generates output with decoder. Used for translation, summarization (input a document, output a summary), question answering with generation. Google Translate uses a variant of this approach.

What fine-tuning actually means

Pre-training a large language model costs tens of millions of dollars and months of compute time. The result is a model that has internalized the statistical patterns of the entire internet. But "predict the next word" is not the same as "be helpful, harmless, and honest" or "classify customer support tickets into these 12 categories."

Fine-tuning takes a pre-trained model and continues training it on a smaller, task-specific dataset. The model's vast general knowledge is preserved; new layers of specific behavior are added. A hospital takes a general language model and fine-tunes it on anonymized clinical notes and medical literature — the model now speaks medicine. A customer service platform fine-tunes on their ticket history — the model now knows the specific products, policies, and tone of that company.

Fine-tuning is far more efficient than training from scratch because the model doesn't need to re-learn language, grammar, world knowledge, or common sense. It already knows all of that. The fine-tuning process just specializes the existing knowledge for a specific task. This is why fine-tuning a capable base model on 10,000 examples can produce a specialized system that outperforms training a dedicated model from scratch on 10 million examples.

What LLMs are actually doing when they respond

When you ask a large language model a question, what is actually happening? The honest and demystifying answer: the model is predicting, token by token, the most likely continuation of the conversation so far.

Every response is built one token at a time. The model sees the entire conversation — your question, its previous responses, everything — and predicts the single most likely next token given all of that context. That token is appended to the context, and the process repeats. "The capital of France is" → most likely next token: "Paris." The model doesn't know what it's going to say at the end of the response when it starts — it discovers the answer one token at a time, the same way you discover the end of a sentence as you read it.

This sounds mechanistic, but the implications are profound. To predict the next word reliably across every possible topic, the model had to internalize enormous amounts of knowledge and reasoning ability. Predicting "The mitochondria is the" → "powerhouse" is easy. Predicting coherent multi-step reasoning across a complex analytical problem requires something that, functionally, looks a lot like understanding — even if the underlying mechanism is "just" next-token prediction.

The hallucination problem

LLMs predict likely continuations, not necessarily true ones. A model trained on text where "The capital of Australia is Sydney" appears many times (because people often confuse it) may confidently predict "Sydney" even though the correct answer is Canberra. The model has no ground truth oracle to check against — it only has the patterns it learned from training data. This is why LLMs sometimes "hallucinate" confident, fluent, plausible-sounding falsehoods. The generation mechanism and the knowledge mechanism are the same: next-token prediction. When they conflict, prediction wins.