PyTorch implementation: classic network ResNet

Classical network ResNet

1 Brief description

Networks such as GoogleNet and VGG have demonstrated that deeper networks can abstract more expressive features and thus achieve stronger classification capabilities. In a deep network, with the increase of the network depth, the resolution of the feature map output by each layer is mainly reduced in height and width, while the depth gradually increases.
The increase of depth can theoretically improve the expressive ability of the network, but for optimization, it will cause the problem of gradient disappearance. In a deep network, during backpropagation, the gradient propagates layer by layer from the output end to the data end. During the propagation process, the accumulation of gradients makes the near data segment close to 0, which makes the network training invalid.
In order to solve the problem of gradient disappearance, BatchNorm can be added to the network, and the activation function is replaced by ReLU, which alleviates the problem of gradient disappearance to a certain extent.
Another problem with increasing depth is the Degradation of deep network. That is, on the basis of the existing network, increase the depth of the network. In theory, only after training to the best situation, the performance of the new network should not be lower than the shallow network. Because, as long as the newly added layer is learned as an identity mapping (identity mapping). In other words, the solution space of a shallow network is a subset of the solution space of a deep network. But deeper networks are not necessarily better than shallow ones due to the Degradation problem.
The idea of ​​the Residual module is to allow the network to implement this identity mapping. As shown in the figure, the residual structure adds a branch in parallel on the basis of the two-layer convolution, and the input is directly added before the final ReLU activation function. If the two-layer convolution changes the resolution and number of channels of a large number of inputs, in order to be able to Additively, a 1x1 convolution can be used on the added branch to match the dimensions.

2. Residual structure

The ResNet network has two types of residual blocks, one is two 3x3 convolutions, the other is 1x1, 3x3, 1x1 three convolutional networks connected in series to form a residual module.


PyTorch implementation:

class Residual_1(nn.Module):
    r""" 
    18-layer, 34-layer residual block
    1. used something like VGG the 3×3 Convolutional layer design;
    2. First use two 3 with the same number of output channels×3 Convolutional layer, followed by a batch normalization and ReLU activation function;
    3. Add paths across the convolutional layers, adding to the last ReLU before the activation function;
    4. If you want to match the size and number of channels of the output after convolution, you can use 1 on the added cross-pass×1 convolution;
    """
    def __init__(self, input_channels, num_channels, use_1x1conv=False, strides=1):
        r"""
        parameters:
            input_channels: number of input channels
            num_channels: Number of output channels
            use_1x1conv: Is it necessary to use 1 x1 convolution control size
            stride: stride of the first convolution
        """
        super().__init__()
        # 3×3 convolution, strides controls whether the resolution is reduced
        self.conv1 = nn.Conv2d(input_channels, 
                               num_channels,
                               kernel_size=3, 
                               padding=1, 
                               stride=strides)
        # 3×3 convolution without changing resolution
        self.conv2 = nn.Conv2d(num_channels,
                               num_channels, 
                               kernel_size=3, 
                               padding=1)
        
        # Transform input resolution and channels using 1x1 convolution
        if use_1x1conv:
            self.conv3 = nn.Conv2d(input_channels, 
                                   num_channels, 
                                   kernel_size=1, 
                                   stride=strides)
        else:
            self.conv3 = None
            
        # batch normalization layer
        self.bn1 = nn.BatchNorm2d(num_channels)
        self.bn2 = nn.BatchNorm2d(num_channels)
        
    def forward(self, X):
        
        Y = F.relu(self.bn1(self.conv1(X)))
        Y = self.bn2(self.conv2(Y))
        if self.conv3:
            X = self.conv3(X)
        # print(X.shape)
        Y += X
        return F.relu(Y)
        
class Residual_2(nn.Module):
    r""" 
    50-layer, 101-layer, 152-layer residual block
    1. First use 1 x1 convolution, ReLU activation function;
    2. then use 3×3 Convolutional layers, batch normalized after one, ReLU activation function;
    3. 1 again x1 convolutional layer;
    4. Add paths across the convolutional layers, adding to the last ReLU before the activation function;
    5. If you want to match the size and number of channels of the output after convolution, you can use 1 on the added cross-pass×1 convolution;
    """
    def __init__(self, input_channels, num_channels, use_1x1conv=False, strides=1):
        r"""
        parameters:
            input_channels: number of input channels
            num_channels: Number of output channels
            use_1x1conv: Is it necessary to use 1 x1 convolution control size
            stride: stride of the first convolution
        """
        super().__init__()
        # 1×1 convolution, strides controls whether the resolution is reduced
        self.conv1 = nn.Conv2d(input_channels, 
                               num_channels,
                               kernel_size=1, 
                               padding=1, 
                               stride=strides)
        # 3×3 convolution without changing resolution
        self.conv2 = nn.Conv2d(num_channels,
                               num_channels, 
                               kernel_size=3, 
                               padding=1)
        # 1×1 convolution, strides controls whether the resolution is reduced
        self.conv3 = nn.Conv2d(input_channels, 
                               num_channels,
                               kernel_size=1, 
                               padding=1)
        
        # Transform input resolution and channels using 1x1 convolution
        if use_1x1conv:
            self.conv3 = nn.Conv2d(input_channels, 
                                   num_channels, 
                                   kernel_size=1, 
                                   stride=strides)
        else:
            self.conv3 = None
            
        # batch normalization layer
        self.bn1 = nn.BatchNorm2d(num_channels)
        self.bn2 = nn.BatchNorm2d(num_channels)
        
    def forward(self, X):
        
        Y = F.relu(self.bn1(self.conv1(X)))
        Y = F.relu(self.bn2(self.conv2(Y)))
        Y = self.conv3(Y)
        if self.conv3:
            X = self.conv3(X)
        # print(X.shape)
        Y += X
        return F.relu(Y)
        

ResNet has different network layers, the more commonly used ones are 50-layer, 101-layer, and 152-layer. They are all implemented by stacking the residual modules described above.

3. 18-layer implementation

First define the module consisting of the residual structure:

# ResNet module
def resnet_block(input_channels, num_channels, num_residuals, first_block=False):
    r"""A module composed of residual blocks"""
    blk = []
    for i in range(num_residuals):
        if i == 0 and not first_block:
            blk.append(Residual_1(input_channels, 
                                num_channels, 
                                use_1x1conv=True, 
                                strides=2))
        else:
            blk.append(Residual_1(num_channels, num_channels))
    return blk

Define the very first layer of 18-layer:

# The first two layers of ResNet:
#    1. 7x7 convolutional layer with 64 output channels and stride 2
#    2. 3x3 max pooling layer with stride 2
conv_1 = nn.Sequential(nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3),
                   nn.BatchNorm2d(64), 
                   nn.ReLU(), 
                   nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

Define the residual group module:

# ResNet module
conv_2 = nn.Sequential(*resnet_block(64, 64, 2, first_block=True))
conv_3 = nn.Sequential(*resnet_block(64, 128, 2))
conv_4 = nn.Sequential(*resnet_block(128, 256, 2))
conv_5 = nn.Sequential(*resnet_block(256, 512, 2))

ResNet 18-layer model:

net = nn.Sequential(conv_1, conv_2, conv_3, conv_4, conv_5, 
                    nn.AdaptiveAvgPool2d((1, 1)), 
                    nn.Flatten(), 
                    nn.Linear(512, 10))

# Observe the output size of each layer of the model
X = torch.rand(size=(1, 1, 224, 224))
for layer in net:
    X = layer(X)
    print(layer.__class__.__name__,'output shape:\t', X.shape)

output:

Sequential output shape:	 torch.Size([1, 64, 56, 56])
Sequential output shape:	 torch.Size([1, 64, 56, 56])
Sequential output shape:	 torch.Size([1, 128, 28, 28])
Sequential output shape:	 torch.Size([1, 256, 14, 14])
Sequential output shape:	 torch.Size([1, 512, 7, 7])
AdaptiveAvgPool2d output shape:	 torch.Size([1, 512, 1, 1])
Flatten output shape:	 torch.Size([1, 512])
Linear output shape:	 torch.Size([1, 10])

4. Train on the dataset

def load_datasets_Cifar10(batch_size, resize=None):
    trans = [transforms.ToTensor()]
    if resize:
        transform = trans.insert(0, transforms.Resize(resize))
    trans = transforms.Compose(trans)
    train_data = torchvision.datasets.CIFAR10(root="../data", train=True, transform=trans, download=True)
    test_data = torchvision.datasets.CIFAR10(root="../data", train=False, transform=trans, download=True)
    
    print("Cifar10 Download completed...")
    return (torch.utils.data.DataLoader(train_data, batch_size, shuffle=True),
            torch.utils.data.DataLoader(test_data, batch_size, shuffle=False))

def load_datasets_FashionMNIST(batch_size, resize=None):
    trans = [transforms.ToTensor()]
    if resize:
        transform = trans.insert(0, transforms.Resize(resize))
    trans = transforms.Compose(trans)
    train_data = torchvision.datasets.FashionMNIST(root="../data", train=True, transform=trans, download=True)
    test_data = torchvision.datasets.FashionMNIST(root="../data", train=False, transform=trans, download=True)
    
    print("FashionMNIST Download completed...")
    return (torch.utils.data.DataLoader(train_data, batch_size, shuffle=True),
            torch.utils.data.DataLoader(test_data, batch_size, shuffle=False))

def load_datasets(dataset, batch_size, resize):
    if dataset == "Cifar10":
        return load_datasets_Cifar10(batch_size, resize=resize)
    else:
        return load_datasets_FashionMNIST(batch_size, resize=resize)
train_iter, test_iter = load_datasets("", 128, 224) # Cifar10

  • Concatenate a layer to change the function class, we hope to expand the function class
  • The residual block is added to the fast channel to get f ( x ) = x + g ( x ) f(x)=x+g(x) Structure of f(x)=x+g(x)

ResNet
Two ResNet block s

  • Halves the height and width of the ResNet block (stride 2)
  • Followed by multiple ResNet blocks with constant height and width

Similar to the overall architecture of VGG and GoogleNet, but replaced by ResNet.

  • Residual blocks make training very deep networks easier
    • You can even train a thousand layers
  • Residual networks have had a profound impact on subsequent neural network designs, whether convolutional or fully connected.

Why ResNet can alleviate the disappearance of gradient

Tags: Machine Learning Pytorch Deep Learning

Posted by RDx321 on Wed, 04 May 2022 17:58:02 +0300