Module 9 · Expert Track24 min read · Machine Learning Explained

ML in Production

There's a famous saying in ML engineering: a model in a notebook is not a product. The gap between a model that works in a controlled research environment and one that reliably serves real users in the real world is enormous — arguably larger than the gap between having no model and having a notebook model. Most ML projects that reach production encounter problems that had nothing to do with the model itself. This module is about everything that happens after the training run finishes.

The notebook-to-production gap

When data scientists train a model, they typically work with a clean, static dataset. The features are already computed, the schema is consistent, the labels are verified. Training runs on a powerful machine or cloud cluster. Evaluation happens on a held-out test set drawn from the same distribution as the training data. Everything is controlled.

Production is the opposite of controlled. Real data arrives from multiple upstream systems that evolve independently. Schemas change without notice — a field gets renamed, a service starts sending null values where it used to send numbers, a third-party data provider changes their API. Users behave in ways the training data didn't capture. The world changes. Time passes.

A model trained on last year's user behavior predicts this year's user behavior with decreasing accuracy as the two diverge. A fraud detection model trained on fraud patterns from before a new attack technique was invented doesn't detect the new technique. A recommendation system trained on pre-pandemic viewing habits doesn't serve users well when their habits changed permanently during lockdown. None of this is a model failure in the traditional sense — the model learned correctly from the data it had. The data itself became stale.

The hidden technical debt problem

Google's landmark 2015 paper "Machine Learning: The High-Interest Credit Card of Technical Debt" described a finding that surprised many: in mature ML systems, the actual model training code often represents only 5% of the total codebase. The other 95% is infrastructure: data ingestion pipelines, feature computation, monitoring, serving, logging, retraining triggers, validation checks, and shadow mode testing. This ratio means that most ML engineering effort is not about machine learning at all — it's about software engineering reliability. Teams that ignore this pay it back with production incidents.

ML system architecture

A production ML system is not a model — it's a pipeline with a model at the center. Understanding the full architecture helps identify where things break.

Data ingestion layer: Raw data arrives from databases, event streams, APIs, and file uploads. This layer must handle late-arriving data, duplicate events, schema changes, and partial failures. It must be monitored — if upstream data quality degrades, the model's input quality degrades with it, silently.

Feature engineering layer: Raw data is transformed into model features. This is where most bugs live. A feature computed slightly differently between training and serving produces silent failures that are very hard to debug. Did you normalize that feature using the training set mean? You must use that same training set mean at serving time, not the current mean — otherwise the model receives inputs it has never seen.

Model serving layer: The trained model receives feature vectors and returns predictions. This must be fast enough for the latency requirements of the application (real-time recommendation needs sub-100ms latency; fraud detection on a payment must return in under 200ms before the transaction times out).

Monitoring and observability layer: Everything that happens in the system is logged and tracked. Prediction distributions, feature distributions, latency, error rates — all monitored for anomalies. This is how you detect problems before they cascade into business impact.

Feature stores: solving the training-serving skew problem

One of the most insidious problems in ML systems is training-serving skew: the features used during training are computed differently from the features computed at serving time, so the model effectively receives different data in production than it was trained on. Even a small discrepancy can degrade performance significantly, and diagnosing it is notoriously difficult.

A feature store solves this by providing a single, centralized repository where features are computed once and stored, and consumed both by training pipelines and serving pipelines. When you train, you read historical feature values from the feature store. When you serve, you read current feature values from the same feature store, computed by the same logic. The skew problem disappears because there's only one feature computation path.

Companies like Uber (Michelangelo), Airbnb (Zipline), and Spotify built proprietary feature stores before open-source options like Feast and Tecton became widely available. The pattern is now considered infrastructure hygiene for production ML teams.

Model serving: REST API vs batch inference

How predictions are consumed drives the architecture of the serving layer. Two dominant patterns exist:

Online serving (REST API): The model is deployed as a web service. Applications send feature vectors via HTTP requests and receive predictions in real time. This is required for any use case where predictions must happen at request time — fraud detection, real-time recommendations, content ranking. The challenges: latency requirements are strict, the service must be highly available, and it must handle traffic spikes gracefully. Most production ML serving infrastructure (TensorFlow Serving, Triton Inference Server, Seldon, BentoML) targets this pattern.

Batch inference: The model runs on a large dataset periodically — hourly, daily, weekly. Predictions are stored in a database or file system and consumed later. This is appropriate when predictions don't need to be instantaneous: "generate tomorrow's recommendations for all users overnight," "score all loan applications received this week." The challenges are different — efficiency at scale, storage and retrieval of predictions, and staleness (a prediction generated yesterday may be wrong today).

Data drift and concept drift: the twin degradation mechanisms

Models degrade over time. The question is why and how fast. Two mechanisms cause this, and they require different responses.

Data drift occurs when the statistical distribution of input features changes. The model hasn't changed; the world it's predicting has. A product recommendation model trained when most users were on mobile phones performs differently after a shift in the user base toward tablet users. The model never saw those usage patterns during training. Detecting data drift means continuously monitoring the statistical properties of incoming feature distributions and comparing them to training distributions. When they diverge significantly, performance likely follows.

Concept drift occurs when the relationship between inputs and the correct output changes. The patterns the model learned are no longer true. A price prediction model trained before a market disruption event predicts based on relationships that no longer hold. A spam filter trained before a new spamming technique became prevalent doesn't recognize the new patterns. Data drift is about the inputs changing; concept drift is about the underlying world changing in a way that makes the model's learned patterns wrong.

The weather forecaster analogy

Data drift is like a weather forecaster who was trained in California being deployed in England. The inputs — temperature, humidity, pressure patterns — follow different distributions. The forecaster's rules still work mechanically, but the inputs they were calibrated for rarely appear. Concept drift is like climate change gradually shifting the patterns the forecaster learned, so even the relationships they understood become less reliable over decades. Both require updating the model — but for different reasons and by different means.

A/B testing ML models

When you train a new model, you believe it's better. But "better on the test set" is not the same as "better in production." Users behave differently than test data suggests. The system has complex feedback loops. What you need is evidence from production — and A/B testing provides it.

In an ML A/B test, traffic is split between the old model (control) and the new model (treatment). Both serve real users simultaneously. You measure the business metrics that actually matter — conversion rate, revenue per user, churn, engagement — not just model accuracy. After sufficient time to gather statistically significant results, you compare the groups and decide whether to roll out the new model.

This discipline prevents a common mistake: deploying a model with better offline metrics that makes things worse in production. Netflix A/B tests every algorithm change that touches recommendations. Google A/B tests every ranking change. The discipline scales: at large organizations, hundreds of A/B tests run simultaneously, each a controlled experiment to validate a proposed ML or product change before full deployment.

MLOps and the CI/CD pipeline for ML

Software engineering developed DevOps to make software deployment reliable, repeatable, and fast. ML engineering adapted these practices into MLOps — the operational discipline for building reliable ML systems at scale.

A mature MLOps pipeline includes: automated data validation (does incoming training data meet quality requirements?), automated training on a schedule or triggered by data drift alerts, automated evaluation against the current production model before any deployment, automated deployment via canary release (gradually shifting traffic rather than switching all at once), and automated rollback if production metrics degrade after a deployment.

The CI/CD (Continuous Integration / Continuous Delivery) paradigm from software engineering applies: every change to model code, training data, or feature logic triggers a full automated pipeline — training, evaluation, and deployment decision — without requiring manual intervention. Teams like those at Spotify and LinkedIn report deploying ML model updates multiple times per day using such pipelines, with far fewer production incidents than teams that deploy manually.

Model cards and documentation

A model in production without documentation is a liability. Who trained this model, and when? What data was it trained on? What are its known limitations and failure modes? What populations was it tested on? On what subgroups does performance degrade?

Model cards, introduced by Google as a standard practice, answer these questions in a structured format. A model card documents the intended use cases, out-of-scope uses, training data description, evaluation results disaggregated by important subgroups, and known ethical considerations. They're now required by many enterprise procurement processes and emerging AI regulations.

The discipline of writing a model card forces teams to confront uncomfortable questions they might otherwise skip: Did you test performance on users with disabilities? On speakers of non-standard dialects? On examples from demographics underrepresented in the training data? Models deployed without this analysis frequently fail on exactly those groups — and the failure is often discovered only after it's caused real harm.

What actually goes wrong in real deployments

Based on documented production incidents across the industry: the most common failure mode is not a bad model but a broken data pipeline that silently delivers wrong inputs to a correct model. The second most common is training-serving skew — features computed differently in training vs. serving. Third is concept drift discovered too late because monitoring wasn't set up. Fourth is deployment of a model that performs well on average but catastrophically on a specific subgroup that wasn't in the test data. None of these are about model architecture. They're about engineering discipline, monitoring, and documentation.