Skip to main content
Machine Learning·Intermediate

Computer Vision Fundamentals: CNNs, ResNet, ViT, and Production Transfer Learning

The core computer vision concepts every ML engineer needs: convolution mechanics, why ResNet's skip connections solved deep network training, the inductive bias tradeoffs between CNNs and Vision Transformers (ViT), and a production-grade transfer learning guide. Covers CNN architectures from LeNet to EfficientNet, object detection (YOLO, R-CNN family), and when CNNs still beat ViTs.

45 min read 8 sections 6 interview questions
Computer VisionCNNConvolutionResNetEfficientNetViTVision TransformerTransfer LearningObject DetectionYOLOImage ClassificationFeature MapsReceptive FieldBatch Normalization

Convolution: The Core Operation of All CNN Architectures

A convolution applies a learnable filter (kernel) across an image to produce a feature map. For a 3×3 filter applied to a 32×32 input with 3 channels:

  • The filter has 3×3×3 = 27 weights plus 1 bias = 28 parameters
  • Output size: (32 - 3 + 2×padding)/stride + 1 = 32 with padding=1, stride=1 (same convolution)
  • Each output pixel is the dot product of the filter weights with the local image patch — detecting a specific local pattern (edge, texture, corner)

Weight sharing: The same filter slides across the entire image. A 3×3 filter detects the same feature (e.g., a horizontal edge) everywhere in the image. This is the translation equivariance property: if you shift the input, the feature map shifts by the same amount. This inductive bias is extremely useful for images where the same object can appear anywhere in the frame.

Receptive field: How much of the input each output unit 'sees'. After 1 convolutional layer with 3×3 filters: 3×3. After 2 layers: 5×5. After 10 layers: 21×21. Deeper networks see larger context. Modern architectures use dilated convolutions (skip pixels in the filter) to rapidly expand the receptive field without losing resolution.

1×1 convolutions: Projects feature dimension without spatial mixing. With C_in channels, a 1×1 conv with C_out filters costs C_in × C_out parameters per spatial position — far cheaper than 3×3 × C_in × C_out. Used in: bottleneck blocks (ResNet), channel mixing in depthwise separable convolutions (MobileNet), dimensionality reduction (ResNet projections).

Pooling vs strided convolution: Max pooling: take the maximum activation in a 2×2 region → downsample 2×. Strided convolution: apply conv with stride=2 to achieve the same downsampling. Modern architectures (ResNet) prefer strided convolutions because pooling discards information, while strided conv can learn what to downsample. Average pooling (Global Average Pooling, GAP) at the final layer: average each feature map to a scalar, replacing the fully connected classifier. ResNet, EfficientNet, and ViT all use GAP before the final linear layer.

CNN Architecture Evolution: Key Milestones

Rendering diagram...

ResNet's Skip Connections: Why They Actually Work

Before ResNet (He et al., 2015), training networks deeper than 20 layers would degrade — not just overfit, but perform worse on training data than shallower models. This was the degradation problem: deeper networks couldn't even learn the identity function.

The skip connection solution: Instead of learning the desired mapping H(x), force the block to learn the residual F(x) = H(x) - x. The block output becomes F(x) + x.

Why does this work?

  1. Gradient highway: During backpropagation, the skip connection provides a direct gradient path from the loss to early layers: ∂L/∂x = ∂L/∂H · (1 + ∂F/∂x). The '1' term means gradients never fully vanish — they can flow straight back through the skip connections, bypassing layers that may have near-zero gradients.

  2. Easier optimization for identity: If the optimal transformation for a block is the identity (no-op), it's much easier to push weights toward zero (making F(x) → 0) than to learn an explicit identity mapping. Residual connections are an inductive bias toward identity mappings when blocks aren't needed.

  3. Ensemble interpretation: A ResNet with N blocks can be viewed as an ensemble of 2^N paths of varying depths (Veit et al., 2016). Most paths are short. This distributes the learning burden and makes the network robust to removing individual blocks.

ResNet bottleneck block (ResNet-50 and deeper): 1×1 conv (reduce channels: 256→64) → 3×3 conv (64→64) → 1×1 conv (expand: 64→256). Total FLOPs: 4× less than a naive 3×3, 3×3 block. The projection shortcut (1×1 conv on the skip path) is used when input and output dimensions differ.

BatchNorm position: Original ResNet: Conv → BN → ReLU → Conv → BN → Add → ReLU. Pre-activation ResNet (He et al., 2016): BN → ReLU → Conv → BN → ReLU → Conv → Add. Pre-activation works slightly better in practice and is used in EfficientNet and ResNeXt.

CNNs vs Vision Transformers: The Inductive Bias Tradeoff

The central question in modern computer vision: CNN or ViT?

CNNs have 3 strong inductive biases built into the architecture:

  1. Locality: Each neuron sees only a local neighborhood (3×3, 5×5). Appropriate because nearby pixels are more correlated than distant ones.
  2. Weight sharing: The same filter is applied everywhere. Makes CNNs translation-equivariant and extremely data-efficient.
  3. Hierarchical feature learning: Pooling creates a coarse-to-fine pyramid: edges → textures → parts → objects.

ViT has no inductive biases: Every patch attends to every other patch globally from layer 1. The model must learn locality, translation invariance, and hierarchical features entirely from data. Consequence:

  • ViT underperforms CNNs on small datasets (< 1M images). On ImageNet-1K alone, ViT-B/16 underperforms ResNet-50 without additional pretraining.
  • ViT outperforms CNNs at scale: pretrained on ImageNet-21K (14M images) or JFT-300M (300M images), ViT exceeds EfficientNet on most benchmarks.
  • ViT is more robust: more resistant to adversarial perturbations, natural distribution shifts, and occlusions than CNNs.

Swin Transformer (Liu et al., 2021) bridges the gap: introduces local window attention (W×W patch neighborhoods) with shifted windows for cross-window connectivity. Achieves O(N) complexity vs ViT's O(N²). Produces hierarchical feature maps compatible with FPN-based detection heads. State-of-the-art on COCO object detection (58.7 mAP) at time of release.

Practical decision guide:

  • Dataset < 100K images, no large pretrained model available → CNN (EfficientNet-B0 to B4)
  • Dataset 100K–1M images → CNN fine-tuned from ImageNet, or ViT fine-tuned from ImageNet-21K
  • Dataset > 1M images or using large pretrained model → ViT (ViT-L, Swin-L, DINOv2)
  • Real-time mobile/edge inference → MobileNetV3, EfficientNet-Lite, or MobileViT
  • Object detection with FPN → Swin Transformer backbone (hierarchical feature maps required)

CNN vs ViT — Detailed Comparison for Production Decisions

PropertyResNet-50 / CNNViT-B/16 (standard)Swin-B (Transformer hybrid)
ArchitectureConvolutional + skip connectionsPatch embeddings + global self-attentionLocal window attention + shifted windows
Inductive biasesLocality, weight sharing, translation equivarianceNone — all learned from dataWeak locality (window attention)
Data requirementLow — works well on 10K+ imagesHigh — needs 1M+ or strong pretrainingMedium — pretrained recommended
Attention complexityN/A (convolution is O(N×k²×C))O(N²) — global, expensive for large inputsO(N×W²) — local windows, linear in image size
Hierarchical featuresYes — natural from pooling layersNo — single scale (problematic for detection)Yes — shifted windows create multi-scale
Transfer learningExcellent — strong ImageNet pretrainExcellent — DINOv2, MAE, SigLIP pretrainsExcellent — MS COCO pretrain available
Inference latencyEfficientNet-B0: ~6ms/img on V100 (typical)ViT-B/16: ~18ms/img on V100 (typical)Swin-B: ~22ms/img on V100 (typical)
Best forSmall data, edge deployment, real-timeLarge-scale classification, VLMs, foundation modelsObject detection, segmentation, dense prediction

Transfer Learning: The Practical Guide

Transfer learning is the single most important practical technique in CV: start from a model pretrained on a large dataset (ImageNet, JFT, LAION), fine-tune on your task. Training from scratch on < 1M images almost never beats fine-tuning from a pretrained model.

When to freeze vs fine-tune:

  • Linear probe first: Always start by training only the final classification head with the backbone frozen. This establishes a performance floor and takes minutes. If accuracy is acceptable, stop here.
  • Fine-tune top layers: Freeze early layers (basic edges/textures transfer universally), train later layers + head. Good when domain shift is moderate.
  • Full fine-tuning: Train all layers with a low learning rate (10–100× lower than scratch training). Use when the target domain is very different from ImageNet (medical imaging, satellite imagery, microscopy).

Layer freezing strategy by domain distance:

  • Target data similar to ImageNet (natural photos, web images): freeze all, train head only
  • Moderate domain shift (food, fashion, automotive): freeze layers 1–3, fine-tune layers 4–last + head
  • Large domain shift (chest X-rays, aerial imagery, microscopy): full fine-tuning with small lr (1e-5)

Self-supervised pretrained models are superior for domain shift: DINOv2 (Meta, 2023) — ViT trained with self-supervised DINO objective on 142M images — produces features that transfer better than supervised ImageNet features for most downstream tasks, especially low-data regimes. Use dinov2-vit-base-patch14 as the default backbone for new CV projects.

Data augmentation: The #2 lever after architecture. For ImageNet-scale training: RandAugment (14 augmentation policies, 2 operations per image) improves top-1 accuracy by 1–2%. Mixup (blend two images + labels) and CutMix (cut-paste image patches) each add ~0.5% on top. Use all three together: torchvision.transforms + timm.data.Mixup. For small datasets: stronger augmentation (wider crop ratios, heavier color jitter, more aggressive cutout).

Transfer Learning with EfficientNet — Production Pattern

pythontransfer_learning.py
import timm
import torch
import torch.nn as nn
from torch.optim import AdamW
from torch.optim.lr_scheduler import OneCycleLR

def create_cv_model(num_classes: int, pretrained: bool = True,
                    model_name: str = "efficientnet_b0"):
    """
    Production pattern: timm covers 900+ pretrained models.
    efficientnet_b0: 5.3M params, ~6ms/image on V100 — production real-time default
    efficientnet_b4: 19M params, ~14ms/image — accuracy-latency sweet spot
    vit_base_patch16_224: 86M params — high accuracy, not real-time
    swin_base_patch4_window7_224: 88M params — best for detection backbones
    """
    model = timm.create_model(
        model_name,
        pretrained=pretrained,
        num_classes=num_classes,
    )
    return model

def get_optimizer_for_finetuning(model, base_lr=1e-4, head_lr_multiplier=10.0):
    """
    Layer-wise learning rates: head trains 10× faster than backbone.
    Critical for transfer learning — head needs to adapt quickly while
    backbone features are preserved.
    """
    backbone_params = []
    head_params = []
    for name, param in model.named_parameters():
        if 'classifier' in name or 'head' in name or 'fc' in name:
            head_params.append(param)
        else:
            backbone_params.append(param)

    return AdamW([
        {'params': backbone_params, 'lr': base_lr},
        {'params': head_params, 'lr': base_lr * head_lr_multiplier},
    ], weight_decay=1e-4)

# Fine-tuning in 3 phases:
# Phase 1: Freeze backbone, train head (5 epochs, high lr)
# Phase 2: Unfreeze top layers, lower lr (10 epochs)
# Phase 3: Full fine-tuning, very low lr (5 epochs)
def freeze_backbone(model):
    for name, param in model.named_parameters():
        if 'classifier' not in name and 'head' not in name:
            param.requires_grad = False

def unfreeze_top_k_layers(model, k=3):
    """Unfreeze the last k stages of the backbone."""
    layers = list(model.named_parameters())
    for name, param in layers[-k*50:]:  # approximate: last k×50 params
        param.requires_grad = True
NOTE

Object Detection: YOLO vs R-CNN Family — When to Use Each

R-CNN family (two-stage): Propose regions → classify each region. Accurate but slow (Faster R-CNN: ~5 FPS). Best for: high-accuracy tasks where latency is not critical (medical imaging, quality inspection with 10+ FPS budget).

YOLO family (one-stage): Directly predict bounding boxes on a grid. Fast (YOLOv8: ~100 FPS on A10G), slightly less accurate than Faster R-CNN on small objects. Best for: real-time detection (cameras, autonomous driving, retail). YOLOv8 is the current production default for new object detection projects.

Key metric: mAP@50 (mean Average Precision at IoU threshold 0.5). YOLOv8-x: 53.9 mAP@50 on COCO. Faster R-CNN (ResNet-50 FPN): 37.0 mAP@50. YOLOv8 wins on both speed and accuracy for most practical tasks.

Non-Maximum Suppression (NMS): Any detector produces many overlapping boxes. NMS keeps the box with highest confidence and removes all boxes with IoU > threshold (typically 0.5). Production pitfall: NMS is sequential and hard to parallelize — it can be a latency bottleneck at > 100 detections per frame. Use batched NMS (torchvision.ops.batched_nms) or limit max detections per class.

Interview Questions

Click to reveal answers
Test your knowledge

Sign in to take the Quiz

This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.

Sign in to take quiz →