Deep Residual Network Architectural Design

Deep Residual Networks (ResNets)

  • Award winning simple and clean architecture for training “very” deep nets for computer vision

  • Invented by Microsoft Research in 2015

  • Won 1st places in all five main tracks in ImageNet and COCO Large Scale Visual Recognition Competition in 2015

    • ImageNet Classification: “Ultra-deep” 152-layer nets

    • ImageNet Detection: 16% better than 2nd

    • ImageNet Localization: 27% better than 2nd

    • COCO Detection: 11% better than 2nd

    • COCO Segmentation: 12% better than 2nd

Motivation

  • Network depth is of crucial importance in neural network architectures

  • Deeper networks are more difficult to train.

  • Residual learning eases the training

  • Enables them to be substantially deeper  with improved performance

Training Increasingly Deeper Networks

Common Practices

Initialization

  • Careful initialization of model weights

    • Avoid exploding or vanishing gradients during training

Batch Normalization

  • Batch Normalization of each layer for each training mini-batch

    • Accelerates training

    • Less sensitive to initialization, for more stable training

    • Improves regularization of model (better generalization)

Simply stacking more layers does not improve performance

  • 56-layer CNN has higher training error and test error than 20-layer CNN

  • accuracy gets saturated and then starts degrading

ResNet Introduces Residual Learning

Original Paper: Deep Residual Learning for Image Recognition

Plain Layer Stacking

Stacking with Residual Connection

  • add an identity skip connection

  • layer learns residual mapping instead

  • makes optimization easier expecially for deeper layers

  • helps propagate signal deep into the network with less degradation

New assumption:

  • optimal function is closer to an identity mapping than to a zero mapping

  • easier for network to learn residual error

  • each layer is responsible for fine-tuning the output of a previous block (instead of having to generate the desired output from scratch)

Bottleneck Design

Practical design for going deeper

  • sandwich 3x3 conv layer with two 1x1 conv layers

  • similar complexity

  • better representation

  • 1x1 conv reduce tensor dimensonality for 3x3 conv layer

ResNet Provides Improvements in 3 Key Concepts

Representation

  • training of much deeper networks

  • larger feature space allows for better representation

Optimization

  • Enable very smooth forward propagation of signal and backward propagation of error

  • Greatly ease optimizing deeper models

Generalization

  • Does not overfit on training data

  • Maintains good performance on test data

ResNet Improvement: Pre-Activation

Improvements on ResNet design: Identity Mappings in Deep Residual Networks

  • ReLU could block prop when there are 1000 layers

  • pre-activation design eases optimization and improves generalization

Next Lesson

Implementation of a Deep Residual Neural Network

  • residual skip connections

  • bottleneck blocks

Last updated