Module 7 · Expert Track22 min read · Machine Learning Explained

Computer Vision

Vision is the sense humans trust most — we process an enormous amount of information visually, and we do it with seemingly effortless ease. Recognizing a face in a crowd, reading a street sign while driving, spotting a tumor on an X-ray — these feel natural. For machines, each of these tasks requires solving a genuinely hard problem: how do you turn a grid of colored numbers into meaningful understanding of the world? Computer vision is the field that answers that question.

Images as numerical tensors

Before a model can "see" anything, we need to understand what an image actually is in computational terms. A digital image is not a picture — it's a three-dimensional array of numbers. Each pixel has three values representing the intensity of red, green, and blue light, each ranging from 0 to 255. A 1920×1080 image is a grid of 1,920 columns and 1,080 rows, with 3 color channels per pixel: roughly 6.2 million numbers.

When a computer vision model receives an image, it receives this numerical tensor. The model has no built-in concept of "edge," "color," "object," or "face." Every piece of visual understanding must be learned from data. The question is: how do you design a system that learns to extract meaningful patterns from a tensor of millions of numbers?

Early approaches tried to hand-engineer features: programmers wrote algorithms to detect edges, corners, and color histograms, then fed those features into classifiers. This worked for narrow tasks but required years of expert effort per domain and broke down whenever images were taken from a slightly different angle or lighting condition. The breakthrough came when we let the model learn features directly from pixel data.

Convolutional neural networks: filters as feature detectors

A Convolutional Neural Network (CNN) is a neural network architecture designed specifically for grid-structured data like images. Its key innovation is the convolutional filter — a small pattern-detector that slides across the entire image, looking for a specific feature wherever it appears.

Think of a filter as a small template — maybe 5×5 pixels in size — that represents a particular pattern, like a vertical edge or a diagonal stripe. The filter slides across every position in the image (left to right, top to bottom) and at each position, it checks how much the image matches the template. Where the image closely matches the filter, the output value is high. Where it doesn't, the output is low. The result is a new "activation map" — a grid showing where in the image that particular pattern was detected.

A CNN doesn't use one filter — it uses hundreds or thousands simultaneously. Each filter learns to detect a different elementary pattern. Crucially, the filters learn their patterns during training from data. You don't specify what patterns to look for; the network discovers which patterns are useful for the task.

The rubber stamp analogy

Imagine you have a rubber stamp with a particular design — say, a horizontal line. You stamp it across every position on a piece of paper and record where the ink sticks. The resulting pattern shows you everywhere that horizontal feature appeared. Now imagine having thousands of different rubber stamps — for vertical lines, curves, spots, textures — and stamping all of them simultaneously across the same image. You get thousands of maps, each highlighting where a different elementary visual feature lives. That collection of maps is what the first layer of a CNN produces from a raw image.

Pooling and spatial hierarchy

After a convolutional layer produces its activation maps, a pooling layer typically follows. Pooling reduces the spatial resolution of those maps — essentially shrinking them down. The most common approach, max pooling, divides the map into small regions and keeps only the maximum value from each region.

Why throw away information? Because we want the detected features to be location-tolerant. If a filter detected a vertical edge at position (42, 73) in the image, it should still be detected if the image is slightly shifted or the edge is at position (44, 71). By taking the maximum over small regions, pooling makes the detection robust to small spatial shifts. It also reduces the number of computations as you go deeper.

The alternating pattern of convolution and pooling creates spatial hierarchy — the network's signature capability. First-layer filters detect edges and textures. Second-layer filters, operating on the output of the first layer, detect combinations of edges — corners, curves, simple shapes. Third-layer filters detect combinations of shapes — eyes, wheels, leaves. By the final layers, the network recognizes high-level objects: faces, cars, dogs. The hierarchy builds automatically from training data, without anyone telling the network what intermediate features to look for.

Classification vs detection vs segmentation

Computer vision is not a single task — it's a family of increasingly precise challenges. Understanding the distinctions matters for matching the right tool to the right problem.

Image Classification

What is the primary subject of this image? Answer: one label. "This is a dog." Used by Instagram to auto-categorize photo content, by Pinterest to cluster visual pins, by medical systems to identify the primary pathology in a scan.

Object Detection

Where are all the relevant objects in this image, and what are they? Answer: bounding boxes with labels and confidence scores. "Dog at position [234,156,412,380], confidence 94%. Cat at position [88,44,200,190], confidence 87%." Used in autonomous vehicles to detect pedestrians, cyclists, and other cars in real time.

Semantic Segmentation

Classify every single pixel in the image. Answer: a pixel-level map. "These 48,221 pixels are road. These 7,432 pixels are sidewalk. These 1,840 pixels are pedestrian." Used in surgical robotics to distinguish tissue types, in satellite imagery to map land use at field level.

Instance Segmentation

Like semantic segmentation, but distinguishes between separate instances of the same class. Not just "these pixels are person" but "these pixels are Person A, those are Person B." Used in crowd analysis, sports tracking, and medical cell counting.

Transfer learning: standing on the shoulders of ImageNet

Training a CNN from scratch requires millions of labeled images and weeks of compute. Most organizations don't have either. Transfer learning solves this by letting you start with a model that someone else already trained on a massive dataset.

ImageNet is a dataset of 14 million labeled images across 1,000 categories. Training a large CNN on ImageNet takes the equivalent of thousands of GPU-hours and requires thousands of labeled examples per category. But the resulting model has learned extraordinarily rich visual features — not just "what a dog looks like," but a general understanding of textures, shapes, edges, colors, and how they combine into objects.

Transfer learning takes such a pre-trained model and adapts it to your specific task. You freeze the early layers (they already know how to detect edges and shapes — universal features useful everywhere), and retrain only the later layers on your specific dataset. A radiologist's group wanting to detect a specific lung pathology doesn't need a million chest X-rays — a few thousand, combined with a pre-trained model's general visual knowledge, can produce clinical-grade performance. Tesla's vision system, Google Photos' categorization, Apple's Face ID recognition, and medical imaging diagnostic tools all build on transfer learning from models trained on large general datasets.

The 2012 inflection point

In 2012, Alex Krizhevsky's AlexNet won the ImageNet competition with a top-5 error rate of 15.3% — compared to the runner-up's 26.2%. The gap was so large that the entire computer vision community pivoted to deep learning within months. Every year since, deep learning models have surpassed human-level performance (around 5% top-5 error) on ImageNet. What changed was not the algorithm — CNNs existed since the 1980s — but the combination of large labeled datasets, GPU hardware, and architectural refinements that made training feasible at scale.

Generative models: creating images from scratch

So far we've discussed vision models that analyze images. An equally important branch creates images. Two families of generative models have transformed creative and industrial applications.

GANs: the forger and the detective

Generative Adversarial Networks (GANs), introduced by Ian Goodfellow in 2014, use two neural networks in competition. The generator starts from random noise and tries to produce realistic images. The discriminator receives both real images and the generator's fakes, and tries to distinguish them. The generator's goal is to fool the discriminator. The discriminator's goal is to catch every fake.

Over thousands of training rounds, both networks improve. The discriminator gets better at spotting fakes, which forces the generator to produce more convincing images, which pushes the discriminator to look harder, and so on. The result, after sufficient training, is a generator that produces images so realistic humans cannot reliably tell them from photographs. This is how deepfakes are made, how StyleGAN generates synthetic human faces, and how drug companies generate synthetic molecular images for drug discovery research.

Diffusion models: sculpting signal from noise

Diffusion models, the technology behind Stable Diffusion, DALL-E, and Midjourney, work differently. They are trained by taking real images and progressively adding random noise until the image is completely destroyed — just static. The model learns to reverse this process: given a slightly noisy image, predict what the slightly-less-noisy version looked like.

At inference time, you start with pure noise and apply the learned denoising process hundreds of times, each step removing a little noise. What emerges from the noise, guided by a text prompt, is a generated image that matches the description. The text prompt guides which direction in "image space" the denoising moves — "a photorealistic mountain landscape at sunset" steers the noise removal toward images that look like mountain landscapes at sunset. The results are astonishing: photorealistic scenes, painterly artwork, product visualizations, and architectural renders that didn't exist before the prompt was typed.

Real-world computer vision applications

Medical imaging

CV models now match or exceed radiologist accuracy on specific diagnostic tasks. Google's DeepMind developed a system that detects over 50 types of eye disease from retinal scans with accuracy comparable to world-leading ophthalmologists. Zebra Medical Vision's models screen chest X-rays for pneumonia, fractures, and cardiovascular abnormalities in seconds. PathAI's pathology models analyze biopsy slides to identify cancer margins with precision that reduces diagnostic variability between pathologists. The economic and human impact is significant: in regions with radiologist shortages, CV can provide diagnostic screening that simply wasn't available before.

Autonomous vehicles

Autonomous vehicle perception stacks process inputs from cameras, LiDAR, and radar simultaneously — running multiple CV models to detect lanes, pedestrians, cyclists, traffic signs, and obstacles in real time. Tesla's Full Self-Driving system processes eight camera feeds simultaneously using a custom neural network accelerator chip designed specifically for vision inference. Waymo's system runs over a trillion operations per second of vision inference. The challenge is not just accuracy but robustness: a medical image classifier that fails 1% of the time is a problem; an autonomous vehicle vision system that fails 0.001% of the time while driving at highway speed may still cause accidents.

Manufacturing quality control

Traditional visual inspection relied on human inspectors who fatigue, have bad days, and cannot examine every product at production line speed. CV inspection systems run 24/7 at millisecond latency, examining every unit. Landing AI deploys visual inspection systems on semiconductor wafer fabrication lines to detect defects smaller than a micron. BMW uses CV to inspect welds, paint quality, and component fit across vehicle assembly. The economics are compelling: one CV system can replace dozens of human inspectors while catching defects earlier, reducing waste, and producing a digital record of every inspection.

The distribution shift problem in production vision systems

A CV model trained on images from one factory, one camera model, and one lighting configuration may perform poorly when deployed in a facility with different equipment. The model learned to recognize defects given the specific visual signature of its training environment — not the abstract concept of "defect." When that environment changes (camera replaced, lighting upgraded, product line modified), performance degrades. This is distribution shift, and it's the central operational challenge of deployed vision systems. Monitoring for it, collecting new training data continuously, and retraining regularly are not optional — they're core operational requirements.