Pytorch Essay (Introduction to the Speed ​​of Light)

This article is only for novices like me, please detour!

When I was flipping through my notes, I found this blog post:
Aha~ Take a day to quickly get started with Pytorch (probably the most complete process in the entire network from 0 to deployment)

But I found that there is a part about the gradient that is not clear, because I disagree with tensorflow, pytorch does not use the calculation graph declaratively, so the calculation of the gradient, etc., is easy to disappear, and then it jumps. This problem often Encountered, for example, when I wrote the first GAN network, the first RL project, and the recent object detection framework. The main reason is that there is ambiguity in the way of recording of one of its gradients, so it is specially added that in " Aha~ Take a day to quickly get started with Pytorch (probably the most complete process in the entire network from 0 to deployment)
" in the article.

Basically, this blog post covers a lot of content, so if you want to get started with "speed of light", you need to take a look at the blog post above, probably here:

You can see it here.

Pytorch Gradient

If you have understood the basic principles of neural networks or linear regression (we will still use linear regression as an example here), you should know that it is very important to find derivatives and gradients, and handwritten partial derivatives are actually a bit complicated. , Especially for deep-seated networks, it is fortunate that Pytorch has its own differential library, and can directly find the gradient to help us with our calculations

One difference between the data in numpy and our tensor here is that our tensor has gradients (although not by default)

But we can make a variable have a gradient like this:

a = torch.tensor(5.0,requires_grad=True)

Then we do this:

a = torch.tensor(5.0,requires_grad=True)
b = a*5
c = b/2
print(b)
print(c)
with torch.no_grad():
    d = b*c
print(d)
c.backward()

Then we can see the result:

tensor(25., grad_fn=<MulBackward0>)
tensor(37.5000, grad_fn=<AddBackward0>)
tensor(937.5000)

Let's analyze dada first
We introduce the concept of a computational graph, which is too common in tensorflow, but pytorch handles it automatically in order to reduce the difficulty.

We created a variable a, so we also created a corresponding calculation link for a, and put them in a graph.

When our C is back-propagating, we first find the corresponding link and keep going back, then we find a, and finally get the partial derivative of a.

So at this time you find that no matter what operation you do later, we can locate a at the beginning, because the memory of the initial a has not changed, that is to say, the variable a has not undergone the inplace operation, that is, the value of a The memory has not changed, which is normal at this time.

Now let's simulate some weight updates, we introduce another variable a2. The code becomes like this.

import torch

a = torch.tensor(5.0,requires_grad=True)
a2 = torch.tensor(5.0,requires_grad=True)
b = a*5
c = b/2+a2*5
print(b)
print(c)
with torch.no_grad():
    d = b*c
print(d)
c.backward()

So in our graph, we added a variable, but the last two variables need to be calculated together, so if we calculate from the back to the front, we can actually know that a and a2 are in the same graph

When C is back-propagating again, take the derivatives separately, and then trace the link.

We can see the corresponding gradient:

print(a.grad)
print(b.grad)
print(a2.grad)
tensor(2.5000)
None
tensor(5.)

And notice that b doesn't have a gradient, because b is just an intermediate variable. You can understand that he is just a value.

Linear regression

original

Now let's look at linear regression again. The original example is to use the high-level API, so now we combine the gradient, let's take a look at the most primitive way of writing (based on pytorch, the more primitive is based on numpy manual derivation, you can check this article if you are interested "Logistic regression dome demo ". Although it is logistic regression, compared with linear regression, there is only one more activation function. The specific steps are similar, and the theoretical derivation is a little more complicated, but the code inside is simple.

import torch
import numpy as np
from matplotlib import pyplot as plt
#1, prepare data y=3X+0.8, prepare parameters
x = torch.rand([500,1])
# This is the objective function we need to fit
y=3*x+0.8
w=torch.rand([1,1],requires_grad=True)
b=torch.tensor(0,requires_grad=True,dtype=torch.float32)

leanning = 0.01

for i in range(100):
    y_predict = torch.matmul(x,w)+b
    loss = (y - y_predict).pow(2).mean()
    if(w.grad is not None):
        w.grad.data.zero_()
    if(b.grad is not None):
        b.grad.data.zero_()
        
    loss.backward()
    w.data = w.data - leanning*w.grad
    b.data = b.data - leanning*b.grad
    # It will be updated automatically in the net, we are the oldest version
    if((1+i)%100==0):
        print("w,b,loss",w.item(),b.item(),loss.item())        
fig, ax = plt.subplots() # Create a graph instance
ax.plot(x.numpy().reshape(-1),y.numpy().reshape(-1),label='y_true')
y_predict = torch.matmul(x,w)+b
ax.plot(x.numpy().reshape(-1),y_predict.detach().numpy().reshape(-1),label='y_pred')

The result is as follows:

Advanced API version

Now let's use an advanced API, let's build our network, yes we can use this neural network to complete linear regression, because the neural network is actually a linear layer plus an activation function, that is, after linear regression, add a function, and then find partial derivatives.

import torch
import torch.nn as nn
from torch.optim import SGD
x=torch.rand([500,1])
y_true=3*x+0.8
#1. Define the model
class MyLinear(nn.Module):
    def __init__(self):

        super(MyLinear,self).__init__()
        self.linear=nn.Linear(1,1)
    def forward(self,x):
        out=self.linear(x)
        return out
#2. Instantiate model, optimizer class instantiation, 0ss instantiation
my_linear=MyLinear()
optimizer=SGD(my_linear.parameters(),0.001) 
#The optimizer, which updates the weights is equivalent to this    
#w.data = w.data - leanning*w.grad
#b.data = b.data - leanning*b.grad
loss_fn=nn.MSELoss()
#3. Loop, gradient descent, parameter update
for i in range(2000):

    y_predict=my_linear(x)
    loss=loss_fn(y_predict,y_true)

    optimizer.zero_grad()
    
    loss.backward()
    optimizer.step()
    if ((i+1)%100==0):
        print(loss.item(),list(my_linear.parameters()))
    

optimizer

This is also what I want to add. As you can see, this optimizer is actually a thing to update our W weights.

But why should I say this separately? The reason is very simple. Basically, any optimization algorithm cannot avoid some convergence problems. Of course, the most important thing is that the original blog post did not say it.
So in the neural network, the problem lies in our optimizer, which is used to update the weights, although this thing has only one hyperparameter
That is the learning rate, but the impact of this learning rate on the convergence of the entire network is still very large, and the optimizer is derivation. If the objective function has multiple peaks, will the algorithm itself produce a local optimum?
As a result, the obtained model is not so ideal. Of course, there are many factors that determine the quality of the model. In fact, after the detailed derivation and learning of some machine learning algorithms, the derivation of the complete model operation is vague. In fact, I I also thought about whether these optimizers would have better optimization combined with some heuristic algorithms, and finally I was fixed by a law: there is no free lunch in the world. With the right amount of computing power, with imperfect data, with "cheap" algorithms, it is perfectly acceptable to get an algorithm that doesn't look perfect but has some reliability in the execution range, because there are too many inconsistencies. Certainty, in layman's terms, is unreliable and unreliable, which is likely to be reliable.... Of course, this is also the charm of connectionism. At least compared to statisticalism, I don't lose so much hair. The formula is really difficult to push, and it is really tiring to push.

In real time, people are also particular about it.
Let's talk about a few common ones.

Let's just go through it here, just need to know that some of our optimizers are also more complicated. The reason is that there are some hyperparameter settings that can be saved during the training process. So why can it be saved? It is the following reason.

(There is a bit of a problem with Latex typesetting in the web page)

Gradient descent algorithm (batch gradient descent BGD)

All samples need to be sent in each iteration. The advantage of this is that all samples are considered in each iteration, and the global optimization is done.
It is the most primitive thing we had before. We only take the example of linear regression here, because the loss function is unimodal and derivable.

Stochastic gradient descent SGD

Aiming at the shortcomings of the slow training speed of the gradient descent algorithm, a stochastic gradient descent algorithm is proposed. The stochastic gradient descent algorithm is based on the sample
A group is randomly selected from this book, updated once by gradient after training, and then another group is selected and updated again.
In this case, it may be possible to obtain a horizontal shape with a loss value within an acceptable range without training all samples.

Mini-batch gradient descent MBGD

SGD is relatively much faster, but there are also problems. Since the training of a single sample may bring a lot of noise, SGD
Not every iteration is in the direction of the overall optimization, so it may converge quickly at the beginning of training, but it will take a while to train.
Then it becomes very slow. On this basis, a mini-batch gradient descent method is proposed, which randomly selects a small batch from the sample each time for
Train the stream, not a group, so that both the effect and the speed are guaranteed.

Momentum

Although the mini-batch SGD algorithm can bring good training speed, it cannot always be used when reaching the most worrying point.
It is to really reach the optimum point, but to hover around the optimum point.
Another disadvantage is that mini-batch SGDi requires us to choose a suitable learning rate. When we use a small learning rate,
will cause the grid to converge too slowly during training; when we use a large learning rate, it will lead to the optimization of the amplitude of the training process.
degree skips the range of the function, that is, it is possible to skip the optimal point. All we want is the loss function of the network as the network eventually optimizes
The numbers have a good rate of convergence without swinging too much.
So the Momentum optimizer can just solve the problem we are facing. It is mainly a gradient-based moving exponentially weighted
Average, the parameters of the grid are smoothed, so that the swing amplitude of the gradient becomes smaller.

v = 0.8 v + 0.2 ∇ w , ∇ w show − gradient of times w = w − α v , α represents the learning rate \begin{align*} &v = 0.8v + 0.2 \nabla w &,\nabla w represents the previous-time gradient\ &w = w - \alpha v &, \alpha represents the learning rate \end{align*} ​v=0.8v+0.2∇ww=w−αv​, ∇w represents the gradient of the previous − times, α represents the learning rate​

AdaGrad

The AdaGrad algorithm is to take the square of the gradient of each iteration of each parameter and add it to the square, and divide the global learning rate by this
number, as a dynamic update of the learning rate, so as to achieve the effect of adaptive learning rate

g r a d e n t = g r a d e n t + ( ∇ w ) 2 w = w − α g r a d e n t + δ ∇ w , δ approximately set to 1 0 − 7 \begin{align*} &gradent = gradent + (\nabla w)^2 \ &w = w-\frac{\alpha}{\sqrt{gradent}+\delta}\nabla w , &\delta set to about 10 ^{-7} \end{align*} ​gradent=gradent+(∇w)2w=w−gradent ​+δα​∇w,​δ is approximately set to 10−7​

RMSProp

In the Momentum optimization stop method, although the problem of large swing in optimization is initially solved, in order to further optimize the loss function
There is a problem that the swing amplitude is too large in the new model, and once the convergence speed of the function is further accelerated, the RMSProp algorithm uses the gradient of the parameter
square weighted average.

g r a d e n t = 0.8 ∗ h i s t o r y − g r a d e n t + 0.2 ∗ ( V w ) 2 w = w − α V w g r a d e n t + δ \begin{align*} gradent=0.8* history_ {-} gradent+0.2* (Vw)^ {2} \\ w=w - \alpha \frac {Vw}{\sqrt {gradent}+\delta } \end{align*} gradent=0.8∗history−​gradent+0.2∗(Vw)2w=w−αgradent ​+δVw​​

Adam

The Adam(Adaptive Moment Estimation) algorithm is a combination of the Momentum algorithm and the RMSProp algorithm.
An algorithm that can prevent the gradient from swinging, while also increasing the speed of convergence

1. The cumulants and squared cumulants that need to initialize the gradient v w = 0 , s w = 0 2. the first t round training , We can first calculate M o m e n t u m and R M S P r o p parameter update v w = 0.8 v + 0.2 V w , M o m e n t u m Calculated gradient s w = 0.8 ∗ s + 0.2 ∗ ( V w ) 2 , R M S P r o p Calculated gradient 3. After processing the value in it , get : w = w − α v w s m + δ \begin{align*} 1. It is necessary to initialize the cumulant and square cumulant of the gradient \ v_ {w} =0, s_ {w} =0 \ 2. In the t-th round of training, we can first calculate the Momentum and Parameter update of RMSProp \v_{w} = 0.8v+0.2Vw , Gradient calculated by Momentum \s_{w} = 0.8*s+0.2*(Vw)^{2} , Gradient calculated by RMSProp \3. After processing the values, we get: w=w- \alpha \frac {v_ {w}}{\sqrt {s_ {m}+\delta }} \ \end{align*} 1. The cumulants and square cumulants that need to be initialized vw​=0,sw​=02. In the t-th round of training, we can first calculate the parameter updates of Momentum and RMSProp vw​=0.8v+0.2Vw,Momentum calculation The gradient sw​=0.8∗s+0.2∗(Vw)2, the gradient 3 calculated by RMSProp. After processing the values, we get: w=w−αsm​+δ ​vw​​​

Handwritten Numbers Case

This case is actually because, in our last article, it was a little more complicated, especially when it comes to the deployment in the end, and the convolution inside is also a little more complicated, so a simpler case is used here. to do it.

data set

For this dataset, we will directly use the one provided by Pytorch.

from torchvision.datasets import MNIST
mnist = MNIST(root="./data",train=True,download=True)
print(mnist)

This thing will help us to download, automatic download. There are both training and test sets.

Then this data will be downloaded automatically if there is no one in the place you specify.
like this:

Then the data looks like this:

First, there are 6W images in training and 1W in the validation set, but they are all small images, so the download is very fast and small. Its data is 1 picture + what is the number of the picture, the size of a picture is 28x28 and there is only one channel, because it is black and white.

data partition

So after understanding this, we can divide the dataset.
Let's write a loading function first:

def get_dataloader(train=True):
    transform_fn =Compose([
            ToTensor(),
            Normalize(mean=(0.1307,),std=(0.3081,))
            #mean and std have the same shape and number of channels
    ])
        
    dataset=MNIST(root="./data",train=train,transform=transform_fn)
    data_loader=DataLoader(dataset,batch_size=BATCH_SIZE,shuffle=True)
    return data_loader

Which involves image conversion, this article has.

train_loader = get_dataloader() #training dataset
test_loader=get_dataloader(train=False) # validation dataset

network construction

We build a very simple network here, and because the image format is very small, we do not need to use convolution here, and go directly to the full connection, this part.

class MnistNet(nn.Module):
    def __init__(self):
        super(MnistNet,self).__init__()
        self.fc1=nn.Linear(28*28,28)#Define Linear's steal-in and steal-out shapes
        self.fc2=nn.Linear(28,10)#Define the shape of Linear's input and output
    def forward(self,x):
        x=x.view(-1,28*28)#For data shape deformation, -1 means the bit 5 is automatically adjusted according to the following shape
        x=self.fc1(x)#[batch_size,28]
        x=F.relu(x)#[batch_size,28]
        x=self.fc2(x)#[batch_size,10]
        return x

training and validation

This training part is similar to the linear regression we just did.

def train(epochs,test_times = 1):
    for epoch in range(epochs):
        for i, data in enumerate(train_loader):
            inputs, labels = data
            outputs = mnistNet(inputs)
            optimizer.zero_grad()
            loss = criterion(outputs,labels)
            loss.backward()# Calculate the gradient after backward
            optimizer.step()
            if((i+1)%100==0):
                print(epoch,i,loss.item())
        if((epoch+1)%20==0):
            #Temporarily save every 20 training sessions
            torch.save(mnistNet.state_dict(),"./model_temp.pt")
            torch.save(optimizer.state_dict(),'./optimizer_temp.pt')
        if((epoch+1)%test_times==0):
            test()
    torch.save(mnistNet.state_dict(),"./model_last.pt")    
    torch.save(optimizer.state_dict(),'./optimizer_last.pt')
    
    
    

But what does verification mean? In fact, it is evaluation. We train a model, then input new data, and then see how many output results are obtained to judge the quality of our model.

def test():
    test_loss =0
    correct =  0
    mnistNet.eval()    
    with torch.no_grad():
        #Do not calculate its degree
        for data,target in test_loader:
            output=mnistNet(data)
            loss = criterion(output,target)
            test_loss+=loss.item()
            pred=output.data.max(1,keepdim=True)[1]#Where to get the maximum value, [batch_size,1]
            correct+=pred.eq(target.data.view_as(pred)).sum()#The predicted numbers match several in a batchsize
        test_loss/=len(test_loader)

        print('\nTest set:Avg.loss:{:.4f},Accuracy:{}/{}({:.2f}%)'.format(
                test_loss,correct,len(test_loader.dataset),100.0*(correct/len(test_loader.dataset))
            )
              )

full code

from torch.utils.data import DataLoader
import torch.nn as nn
from torch.optim import Adam
import torch.nn.functional as F
from torchvision.datasets import MNIST
from torchvision.transforms import Compose,ToTensor,Normalize
import torch
import os

BATCH_SIZE =128
#1. Prepare the dataset
def get_dataloader(train=True):
    transform_fn =Compose([
            ToTensor(),
            Normalize(mean=(0.1307,),std=(0.3081,))
            #mean and std have the same shape and number of channels
    ])
        
    dataset=MNIST(root="./data",train=train,transform=transform_fn)
    data_loader=DataLoader(dataset,batch_size=BATCH_SIZE,shuffle=True)
    return data_loader

class MnistNet(nn.Module):
    def __init__(self):
        super(MnistNet,self).__init__()
        self.fc1=nn.Linear(28*28,28)#Define Linear's steal-in and steal-out shapes
        self.fc2=nn.Linear(28,10)#Define the shape of Linear's input and output
    def forward(self,x):
        x=x.view(-1,28*28)#For data shape deformation, -1 means the bit 5 is automatically adjusted according to the following shape
        x=self.fc1(x)#[batch_size,28]
        x=F.relu(x)#[batch_size,28]
        x=self.fc2(x)#[batch_size,10]
        return x

#pytorch's cross entropy comes with softmax
criterion=nn.CrossEntropyLoss()

mnistNet = MnistNet()
optimizer = Adam(mnistNet.parameters(),lr=0.001)
train_loader = get_dataloader() #training dataset
test_loader=get_dataloader(train=False) # validation dataset

if(os.path.exists("./model_last.pt")):
    mnistNet.load_state_dict(torch.load("./model_last.pt"))
if(os.path.exists("./optimizer_last.pt")):
    optimizer.load_state_dict(torch.load("./optimizer_last.pt"))

def test():
    test_loss =0
    correct =  0
    mnistNet.eval()    
    with torch.no_grad():
        #Do not calculate its degree
        for data,target in test_loader:
            output=mnistNet(data)
            loss = criterion(output,target)
            test_loss+=loss.item()
            pred=output.data.max(1,keepdim=True)[1]#Where to get the maximum value, [batch_size,1]
            correct+=pred.eq(target.data.view_as(pred)).sum()#The predicted numbers match several in a batchsize
        test_loss/=len(test_loader)

        print('\nTest set:Avg.loss:{:.4f},Accuracy:{}/{}({:.2f}%)'.format(
                test_loss,correct,len(test_loader.dataset),100.0*(correct/len(test_loader.dataset))
            )
              )
    
def train(epochs,test_times = 1):
    for epoch in range(epochs):
        for i, data in enumerate(train_loader):
            inputs, labels = data
            outputs = mnistNet(inputs)
            optimizer.zero_grad()
            loss = criterion(outputs,labels)
            loss.backward()# Calculate the gradient after backward
            optimizer.step()
            if((i+1)%100==0):
                print(epoch,i,loss.item())
        if((epoch+1)%20==0):
            #Temporarily save once every 20 training sessions
            torch.save(mnistNet.state_dict(),"./model_temp.pt")
            torch.save(optimizer.state_dict(),'./optimizer_temp.pt')
        if((epoch+1)%test_times==0):
            test()
    torch.save(mnistNet.state_dict(),"./model_last.pt")    
    torch.save(optimizer.state_dict(),'./optimizer_last.pt')
    
    
    

train(2)

        
            

        

Summarize

At this point, congratulations on completing the Lightspeed Primer!

Tags: Python Machine Learning Deep Learning

Posted by paulspoon on Mon, 05 Sep 2022 20:59:18 +0300