Module 10 · Expert Track26 min read · Machine Learning Explained

Foundation Models and the Frontier

Something qualitatively different happened in AI around 2020. It wasn't that the technology got faster, or that the data got cleaner, or that the algorithms got cleverer in the traditional sense. A new category of system emerged — one that previous ML frameworks simply don't capture well. Understanding what changed, why it matters, and what it means for practitioners is perhaps the most important thing anyone working in technology can grasp right now.

What changed with scale: the bitter lesson

Rich Sutton, one of the foundational figures in reinforcement learning, wrote an essay in 2019 called "The Bitter Lesson." Its thesis: over the entire history of AI, the approaches that consistently won in the long run were not those that encoded the most human knowledge but those that scaled best with computation. Search, learning, and general methods beat hand-crafted human insight, every time, once enough compute was applied.

The lesson is "bitter" because it's humbling for researchers who spend years engineering clever domain-specific solutions. But it has profound implications. It means the primary driver of AI capability progress is not algorithmic sophistication — it's scale. More data, more parameters, more compute. The transformer architecture happened to be particularly well-suited to scaling along all three dimensions simultaneously.

When models crossed certain scale thresholds, something unexpected happened: emergent capabilities. Abilities that were completely absent in smaller models appeared suddenly in larger ones — not as a gradual improvement, but as a threshold effect. Models that couldn't do multi-step arithmetic at one scale could do it reliably at ten times the scale. Models that produced word salad in certain languages became fluent in those languages at sufficient scale, even without any targeted training on those languages. Nobody programmed these capabilities. They emerged from scale applied to a simple objective — predict the next token.

The emergence phenomenon

Scientists at Google and other labs documented over 100 capabilities that appear suddenly as language models scale, including multi-step reasoning, translation to low-resource languages, analogical reasoning, and basic mathematical problem solving. The phenomenon is not fully understood. The leading hypothesis is that some capabilities require a minimum level of internal model complexity to represent the relevant patterns — once the model is large enough, the patterns are discoverable; below that threshold, they're not. This is unlike traditional engineering where performance improves smoothly with resources. In foundation models, sometimes nothing works, then something suddenly works very well.

Why foundation models are different infrastructure

All previous ML was task-specific. A spam classifier is a spam classifier. A medical image reader reads medical images. A recommendation system recommends. To apply ML to a new task, you collected task-specific data, designed task-specific features, trained a task-specific model, and built task-specific infrastructure. Each new application was an independent engineering project of comparable difficulty.

Foundation models broke this paradigm completely. A single model trained on general data at scale can perform thousands of tasks — not through separate training for each, but because the capabilities needed for most tasks were implicitly learned during general training. The same GPT-4 that writes a cover letter can also debug code, translate legal contracts, summarize medical literature, explain quantum physics to a teenager, and compose a sonnet. Without being retrained for any of these tasks.

This is the infrastructure shift. Foundation models are not task-specific tools. They're general-purpose cognitive infrastructure — more analogous to the internet or the operating system than to a conventional software application. The economic and organizational implications are enormous: the bulk of a company's ML investment can now go into using and adapting a single shared foundation, rather than building and maintaining a portfolio of specialized models.

Fine-tuning vs prompting vs RAG: the practitioner's decision framework

If foundation models are the platform, practitioners need a principled way to choose how to adapt them for specific needs. Three major approaches exist, with meaningfully different tradeoffs.

Prompting (in-context learning)

Write instructions, examples, and context in the prompt itself. No additional training. Fastest to deploy, most flexible, easiest to iterate on. Limits: constrained by context window size, no persistent learning, and the model's underlying knowledge is fixed at its training cutoff. Best for: tasks where the base model already has the necessary knowledge, prototyping new capabilities, and flexible ad-hoc use cases.

RAG (Retrieval-Augmented Generation)

Retrieve relevant documents from a knowledge base and inject them into the prompt at query time. The model has access to current, specific, and proprietary information beyond its training cutoff. No model weights are modified. Best for: questions requiring up-to-date or organization-specific knowledge — legal document Q&A, internal knowledge base search, product documentation assistance.

Fine-tuning

Continued training on task-specific data modifies the model's weights. The model learns task-specific style, format, and specialized knowledge that becomes part of its default behavior. Higher cost and slower iteration than prompting. Best for: consistent formatting requirements, highly specialized domains where the base model lacks key knowledge, applications requiring very specific persona or style, and production systems where prompt cost at scale is prohibitive.

The decision is not binary — most production systems combine approaches. A customer service bot might be fine-tuned on the company's tone and product knowledge, augmented by RAG for real-time inventory and order status, and further guided by system prompts for each interaction type. The framework: use prompting first (cheapest and fastest), add RAG when knowledge currency matters, add fine-tuning when consistent behavior or specialized performance is required despite the cost.

Multimodal models: intelligence across perception types

Early language models handled only text. The frontier has moved decisively toward multimodal systems — models that process and generate across text, images, audio, video, and even structured data simultaneously.

GPT-4V can analyze photographs and answer questions about them. Gemini Ultra processes video natively, understanding sequences of events across frames. Claude can read and reason about charts, diagrams, and documents. Models now being developed process protein structures, satellite imagery, sensor streams, and musical audio as naturally as text.

The significance is not just expanded input capability. When a model can see and read simultaneously, it can solve problems that are genuinely multimodal in nature — analyzing an X-ray while referencing the patient's clinical notes, understanding a diagram while reading its caption, processing a video call while hearing the audio. These are the natural forms in which information presents itself to humans. Multimodal models close the gap between how AI systems receive information and how humans experience the world.

AI agents and agentic behavior

The most significant frontier of the current moment is not a better language model — it's AI agents: systems that take sequences of actions in the world over extended time horizons to accomplish goals, rather than simply responding to single queries.

An agent receives a goal ("research and draft a competitive analysis of these five companies"), selects actions to achieve it (search the web, read documents, extract data, structure findings, draft text), executes those actions using available tools, observes the results, and adjusts its approach based on what it learns. It loops through this process over many steps, potentially hours, with limited human involvement.

This is qualitatively different from chatbot interactions. The model is not responding — it's planning and acting. Mistakes compound. The model might search for wrong information, misinterpret results, and base subsequent actions on the error, propagating it through the whole workflow. The reliability requirements are completely different from single-turn Q&A.

Current production examples of agentic AI include GitHub Copilot Workspace (writes, tests, and iterates on code autonomously), Devin (end-to-end software engineering agent), and enterprise systems where AI agents process invoices, triage support tickets, and update CRM records without human intervention. The pattern is clear: AI systems are beginning to do multi-step knowledge work, not just answer questions about it.

The junior analyst analogy for agents

Think of a traditional LLM as a very knowledgeable reference book — you ask it a question, it answers. Think of an AI agent as a very capable junior analyst you've hired. You give them a project goal. They go off, research it, make decisions about what to look up and how to organize findings, draft materials, revise them, and come back with finished work. They can use tools — a web browser, a spreadsheet, an email client. They take actions, not just produce text. The difference between a reference book and an analyst is the difference between answering questions and accomplishing tasks.

What ML practitioners need to understand about the paradigm shift

The transition to foundation models changes what ML practitioners need to know, what skills matter, and how systems are built. Several shifts are worth internalizing explicitly.

The feature engineering bottleneck is gone for text and vision. Traditional ML required careful manual feature engineering — deciding what the model should pay attention to. Foundation models ingest raw data and figure out what's relevant. The work moves upstream to data curation and downstream to evaluation.

Evaluation is harder and more important. When a model can do anything, evaluating whether it does your specific thing well requires genuine effort. Benchmark contamination (the model saw the test questions during training), capability elicitation (how do you know the model can do something if you don't know how to ask?), and alignment evaluation (does the model do what you actually want, not just what you asked for?) are live research problems that every practitioner deploying foundation models needs to grapple with.

Prompting is an engineering discipline. The quality of a foundation model's output is dramatically sensitive to how it's instructed. Prompt engineering — designing instructions, examples, and framing to reliably elicit desired behavior — is a real skill with real impact on production system performance. It's not magic, but it's not trivial either.

The cost structure of ML changed. Traditional ML: most cost is in training. Foundation models: most cost is in inference. A company using GPT-4 for customer service pays per token generated. At scale, this can be enormous. Optimizing for inference efficiency — using smaller models where possible, caching common responses, batching requests — becomes a major engineering concern it wasn't in the traditional ML world.

The practitioner's north star in a fast-moving landscape

The specific models, benchmarks, and architectures dominating discussion today will be different in two years. What won't change: the importance of clear problem definition, principled evaluation, and understanding what a system is actually doing versus what you hope it's doing. The practitioners who navigate the frontier best are not those who know the most about current models — they're those who can ask the right questions about any system: What is it optimizing for? What are its failure modes? How would I know if it was wrong? How would I measure if it's actually helping?

Final Assessment · Expert Track

You've completed all 10 modules

Test your understanding across the entire Machine Learning Explained course. 25 randomized questions. You need 70% to earn your certificate. Questions cover concepts, applications, and scenario-based reasoning — not just recall.

Take the Final Assessment →