Model Evaluation
A model that works in your notebook and fails in production is worse than no model at all — it generates false confidence. Evaluation is the discipline of measuring whether a model is actually solving your problem, which is surprisingly different from measuring whether it gets high scores on standard benchmarks. The metrics you choose encode your values and your business priorities. Choose wrong, and you'll optimize for the wrong thing with impressive numbers to show for it.
Why accuracy is almost always the wrong metric
Accuracy seems intuitive: what fraction of my predictions were correct? If your model is right 95% of the time, it must be good. This reasoning fails in nearly every real-world application that matters, due to a fundamental statistical reality: class imbalance.
Consider a fraud detection system. Credit card fraud affects roughly 1% of transactions. A model that predicts "not fraud" for every single transaction — a model that has learned absolutely nothing, that ignores all the features you carefully engineered — achieves 99% accuracy. It perfectly passes the accuracy test while being completely useless.
The same problem appears everywhere: cancer screening (most patients don't have cancer), network intrusion detection (most traffic is legitimate), structural defect detection (most parts are fine), rare disease diagnosis (most patients are healthy). In any domain where positive cases are rare, accuracy is actively misleading — it rewards models that learn to ignore the minority class.
A bank deploys a new ML fraud detection system. The vendor reports 99% accuracy. The compliance team is impressed. Six months later, fraud losses have doubled. Investigation reveals: the model flags almost nothing as fraud. It learned that "not fraud" is almost always right and never bothers to detect the rare cases. The metric they chose actively rewarded the wrong behavior. This scenario, or close variants of it, has actually happened at multiple financial institutions. Accuracy on imbalanced datasets is not just uninformative — it's dangerous.
The confusion matrix: four outcomes that tell the real story
Every binary classification prediction falls into one of four categories. The confusion matrix makes these explicit:
True Positive (TP): The model predicted positive, and it was actually positive. The fraud detector flagged a transaction, and it really was fraud. Correct catch.
True Negative (TN): The model predicted negative, and it was actually negative. The detector cleared a transaction, and it was legitimate. Correct clear.
False Positive (FP): The model predicted positive, but it was actually negative. The detector flagged a legitimate transaction as fraud. Wrong alarm — the customer's card gets declined at a restaurant.
False Negative (FN): The model predicted negative, but it was actually positive. The detector cleared a fraudulent transaction. The fraud goes through — money is lost.
Accuracy counts only the true positives and true negatives. But the costs of false positives and false negatives are usually different, often dramatically so. In cancer screening, a false negative (missed cancer) is catastrophically worse than a false positive (unnecessary follow-up biopsy). In spam filtering, a false positive (legitimate email sent to spam) might be worse than a false negative (spam gets through). The confusion matrix forces you to see all four outcomes clearly.
Precision and recall: the fundamental tradeoff
Precision asks: of all the cases the model flagged as positive, what fraction actually were positive? A model with perfect precision never generates a false alarm. But a model can achieve perfect precision by being very conservative — only flagging the most egregiously obvious cases, missing many real positives.
Recall (also called sensitivity) asks: of all the actual positive cases, what fraction did the model catch? A model with perfect recall never misses a positive case. But a model can achieve perfect recall by flagging everything as positive, generating endless false alarms.
These two metrics are in tension, and this tension is fundamental — it's not a flaw in your model, it's a structural feature of classification under uncertainty. Making your model more aggressive (flagging more things as positive) increases recall but decreases precision. Making it more conservative does the opposite. You cannot simultaneously maximize both.
Airport security faces the precision-recall tradeoff daily. High recall: search every single passenger exhaustively. You'll catch every threat (perfect recall). You'll also flag thousands of innocent people per day as suspicious (terrible precision). High precision: only search passengers who match an extremely specific profile. Almost everyone you pull aside will actually be a threat (great precision). But most actual threats won't match that narrow profile, so they walk through (poor recall). Every security checkpoint sets a threshold on this tradeoff based on the relative costs of missing a threat versus inconveniencing innocent passengers. ML classifiers face exactly the same choice.
F1 score: one number that balances both
If you need a single number that captures the balance between precision and recall, the F1 score provides it. It's the harmonic mean of precision and recall — a mean that penalizes extreme imbalance between the two. A model with 100% precision and 0% recall would average to 50% under regular arithmetic but gets an F1 of 0. A model with 90% precision and 90% recall gets an F1 of 90%.
The F1 score is much more useful than accuracy on imbalanced datasets, but it still embeds an implicit assumption: that precision and recall matter equally. In many real problems, they don't. In cancer screening, a false negative (missing cancer) is much worse than a false positive (unnecessary biopsy). The F-beta score generalizes F1 to weight precision and recall differently — you can explicitly encode "recall matters twice as much as precision" into your metric. Your choice of evaluation metric is a business decision, not a technical one.
ROC curves and AUC: evaluating threshold-free performance
Most classifiers don't produce a hard "yes/no" — they produce a probability score. "This transaction has a 73% probability of being fraud." You then set a threshold: flag everything above 50%? Above 70%? Above 90%? Changing the threshold changes the precision-recall tradeoff.
An ROC (Receiver Operating Characteristic) curve shows you all possible tradeoffs at once. On the x-axis: false positive rate (what fraction of legitimate cases get flagged). On the y-axis: true positive rate, i.e., recall (what fraction of actual fraud gets caught). Each point on the curve represents a different threshold setting. A perfect classifier's curve bends sharply to the upper-left corner — you can catch everything with no false alarms. A random classifier's curve is a diagonal line.
The AUC (Area Under the Curve) summarizes the ROC curve as a single number between 0 and 1. AUC of 1.0 is a perfect classifier. AUC of 0.5 is random. AUC of 0.85 means: pick a random fraud case and a random legitimate case; there's an 85% chance the model assigns a higher fraud score to the actual fraud. AUC is threshold-independent — it tells you about the classifier's overall discriminative ability, not its performance at any specific threshold setting.
The bias-variance tradeoff: underfitting and overfitting
Every ML model sits somewhere on a spectrum between two failure modes. Understanding this spectrum is essential for diagnosing what's wrong with a model and how to fix it.
Underfitting (high bias): The model is too simple to capture the real patterns in the data. It performs poorly on both training data and new data. A linear model trying to fit a clearly curved relationship will underfit — it can't represent the curvature at all, so its predictions are systematically wrong regardless of which data you show it. The signal is there; the model is too rigid to find it.
Overfitting (high variance): The model is too complex — it has memorized the training data, including its noise and quirks. It performs spectacularly on training data but poorly on new data. If you show it a training example it has seen, it's perfect. If you show it a slightly different example it hasn't seen, it may fail completely. The model has learned the exam questions, not the underlying concepts.
Underfitting is a student who follows one rule for all cooking: "add salt." The rule is too simple. They under-season everything, over-season everything, and never adapt to the dish. High bias: wrong in a systematic, predictable direction. Overfitting is a student who memorized every recipe in the textbook word for word. Give them the exact recipe from class — perfect execution. Change one ingredient — complete confusion. They learned the specific instructions, not the underlying principles of flavor balance and technique. High variance: right on the familiar, wrong on the new.
Cross-validation: the honest performance estimate
The fundamental rule of evaluation: never measure performance on the data you trained on. A model measured on its own training data is like a student grading their own exam — you'll always get 100% regardless of actual understanding.
The simple solution is to hold out a test set — data the model never saw during training. But if you then make decisions based on test set performance (choosing between models, tuning parameters), you've implicitly trained on the test set too. You need a third set: train on the training set, tune on the validation set, and report final performance on the test set — once, never again.
Cross-validation makes this more robust. Instead of a single train/validation split, you divide the data into k equally sized groups (folds). Train on k-1 folds, test on the remaining fold. Repeat k times, each time using a different fold as the test set. Average the k performance estimates. This gives you a stable estimate of how the model will perform on new data, using all available data for both training and testing. Five-fold and ten-fold cross-validation are standard in practice.
Choosing the right metric for your actual business problem
The most important evaluation decision is the choice of metric itself. This is a business decision that should be made before any model is built, by people who understand the consequences of different types of errors.
The single most important evaluation practice is deciding your success metric before you see any results. Researchers and practitioners who choose metrics after seeing model performance are unconsciously (or consciously) selecting metrics that make their model look good. This is a form of p-hacking applied to machine learning. Lock in your metric, justify it in terms of business consequences, then train and evaluate. The metric should not change based on what the model achieves.