pytorch on GPU - training neural network with CUDA (pytorch series-30)

Running PyTorch code on GPU - neural network programming guide

In this episode, we will learn how to use GPU and PyTorch. We will see how to use the general methods of GPU, and we will see how to apply these general techniques to train our neural networks.

Deep learning using GPU

If you haven't seen the episode about why deep learning and neural networks use GPU, be sure to review that episode with this episode to get the best understanding of these concepts.

Now, we will use an example of PyTorch GPU to lay the foundation.

PyTorch GPU example

PyTorch allows us to seamlessly move data into or out of the GPU when calculating inside the program.

When we enter the GPU, we can use the cuda() method. When we enter the CPU, we can use the cpu() method.

We can also use the to() method. When we go to GPU, we write ('cuda '). When we go to CPU, we write ('cpu'). The to() method is the preferred method, mainly because it is more flexible. We will see an example of using the first two methods, and then we will always use the to() variant by default.

In the training process, there are two basic requirements to use our GPU. These requirements are as follows:
1. Data must be moved to GPU
2. The network must be moved to the GPU.
By default, when creating PyTorch tensor or PyTorch neural network module, the corresponding data will be initialized on the CPU. Specifically, these data exist in the memory of the CPU.
Now, let's create a tensor and a network to see how we move from CPU to GPU.
Here, we create a tensor and a network:

t = torch.ones(1,1,28,28)
network = Network()

Now, we call the cuda () method to reassign the tensor and network to the return value copied to the GPU:

t = t.cuda()
network = network.cuda()

Next, we can get a prediction from the network, and see that the device attribute of the prediction tensor confirms that the data is on cuda, which is GPU:

> gpu_pred = network(t)
> gpu_pred.device

device(type='cuda', index=0)

Similarly, we can do the opposite:

> t = t.cpu()
> network = network.cpu()

> cpu_pred = network(t)
> cpu_pred.device

device(type='cpu')

In short, this is how we take advantage of PyTorch's GPU capabilities. We should now focus on some important details hidden under the surface of the code we just saw.
For example, although we have used cuda() and cpu() methods, they are not actually our best choice. In addition, what is the difference between the method of network instance and tensor instance? After all, these are different object types, that is, the two methods are different. Finally, we will integrate this code into a working example and do a performance test.

General idea of using GPU

The main conclusion is that our network and data must exist on the GPU in order to use the GPU to perform computing, which is applicable to any programming language or framework.

As we will see in the next demonstration, this is also true for CPUs. GPU and CPU are computing devices that calculate on data, so any two values directly used in calculation must exist on the same device.

PyTorch tensor calculation on graphics processor (GPU)

Let's take a closer look by demonstrating some tensor calculations.
We first create two tensors:

t1 = torch.tensor([
    [1,2],
    [3,4]
])

t2 = torch.tensor([
    [5,6],
    [7,8]
])

Now, we will check which device these tensors are initialized on by checking the device properties:

> t1.device, t2.device

(device(type='cpu'), device(type='cpu'))

As we would expect, we see that, in fact, both tensors are on the same device, the CPU. Let's move the first tensor t1 to the GPU.

> t1 = t1.to('cuda')
> t1.device

device(type='cuda', index=0)

We can see that the tensor device has been changed to cuda, or GPU. Notice the use of the to () method here. Instead of calling a specific method to move to the device, we call the same method and pass a parameter specifying the device. Using the to () method is the preferred way to move data between devices.
In addition, please note the redistribution. The operation is not in place and needs to be reassigned.
Let's do an experiment. I want to know that we are trying to perform different calculations on the two devices t1 and t2.
Because we expect an error, we wrap the call in a try and catch an exception:

try: 
    t1 + t2
except Exception as e:
    print(e)

expected device cuda:0 but got device cpu

By reversing the order of operations, we can see that the errors have also changed:

try: 
    t2 + t1
except Exception as e: 
    print(e)

expected device cpu but got device cuda:0

Both errors tell us that binary addition expects the second parameter to have the same device as the first parameter. Understanding the meaning of this error can help debug these types of device mismatches.
Finally, to complete, let's move the second tensor to the cuda device to see if the operation is successful.

> t2 = t2.to('cuda')
> t1 + t2

tensor([[ 6,  8],
        [10, 12]], device='cuda:0')

PyTorch nn. Calculation of module on GPU

We just saw how to move tensors into and out of the device. Now, let's see how to use pytorch NN Module instance to achieve this.
In short, we are interested in understanding what it means for the network to run on devices such as GPU or CPU. PyTorch aside, this is the basic problem.
We place the network on one device by moving the parameters of the network to the device. Let's create a network:

network = Network()

Now let's look at the parameters of the network:

for name, param in network.named_parameters():
    print(name, '\t\t', param.shape)

conv1.weight        torch.Size([6, 1, 5, 5])
conv1.bias          torch.Size([6])
conv2.weight        torch.Size([12, 6, 5, 5])
conv2.bias          torch.Size([12])
fc1.weight          torch.Size([120, 192])
fc1.bias            torch.Size([120])
fc2.weight          torch.Size([60, 120])
fc2.bias            torch.Size([60])
out.weight          torch.Size([10, 60])
out.bias            torch.Size([10])

Here, we create a PyTorch network and traverse the parameters of the network. As we can see, the parameters of the network are the weights and deviations within the network.
In other words, as we have seen, these are just tensors that exist on the device. Let's verify this by checking the device for each parameter.

for n, p in network.named_parameters():
    print(p.device, '', n)

cpu  conv1.weight
cpu  conv1.bias
cpu  conv2.weight
cpu  conv2.bias
cpu  fc1.weight
cpu  fc1.bias
cpu  fc2.weight
cpu  fc2.bias
cpu  out.weight
cpu  out.bias

This indicates that all parameters in the network are initialized on the CPU by default.
An important consideration is that it explains why. Module instances like the network actually have no devices. It is not a network on the device, but a tensor within the network on the device.
Let's see what happens when we ask a network to move to the GPU:

network.to('cuda')
Network(
    (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
    (conv2): Conv2d(6, 12, kernel_size=(5, 5), stride=(1, 1))
    (fc1): Linear(in_features=192, out_features=120, bias=True)
    (fc2): Linear(in_features=120, out_features=60, bias=True)
    (out): Linear(in_features=60, out_features=10, bias=True)
)

Please note that there is no need to reassign here. This is because for network instances, the operation is performed in situ. However, this operation can be used as a reassignment operation. In order to make NN The module instance is consistent with PyTorch tensors, which is the best way.
We can see that all network parameters now have cuda devices.

for n, p in network.named_parameters():
    print(p.device, '', n)

cuda:0  conv1.weight
cuda:0  conv1.bias
cuda:0  conv2.weight
cuda:0  conv2.bias
cuda:0  fc1.weight
cuda:0  fc1.bias
cuda:0  fc2.weight
cuda:0  fc2.bias
cuda:0  out.weight
cuda:0  out.bias

Deliver samples to the network

Let's complete this demonstration by passing an example to the network.

sample = torch.ones(1,1,28,28)
sample.shape

torch.Size([1, 1, 28, 28])

This gives us a sample tensor, which we can pass:

try:
    network(sample)
except Exception as e: 
    print(e)

Expected object of device type cuda but got device type cpu for argument #1 'self' in call to _thnn_conv2d_forward

Since our network is on the GPU and the newly created sample is on the CPU by default, we get an error. This error tells us that the CPU tensor is expected to be GPU tensor when calling the forward method of the first volume layer. This is exactly what we saw when we added two tensors directly before.
We can send our samples to GPU like this to solve this problem:

try:
    pred = network(sample.to('cuda'))
    print(pred)
except Exception as e:
    print(e)

tensor([[-0.0685,  0.0201,  0.1223,  0.1075,  0.0810,  0.0686, -0.0336, -0.1088, -0.0995,  0.0639]]
, device='cuda:0'
, grad_fn=<AddmmBackward>
)

Finally, everything went as expected and we got a prediction.

Write PyTorch code unknown to the device

Before concluding, we need to talk about writing device agnostic code. The term device agnostic means that our code does not depend on the underlying device. You may encounter this term when reading the PyTorch documentation.
For example, suppose we write code that uses the cuda() method everywhere, and then we give the code to a user without a GPU. This will not work. don't worry. We have other options!
Remember the cuda () and cpu () methods we saw earlier?
One of the reasons we prefer the to () method is that the to () method is parameterized, which makes it easier to change the device we choose, that is, it is flexible!
For example, the user can use cpu or cuda as parameters of the deep learning program, which will allow the program to be device independent.
Allowing the user of the program to pass a parameter that determines the behavior of the program may be the best way to make the program device independent. However, we can also use PyTorch to check the supported GPU s and set up our devices.

torch.cuda.is_available()
True

If cuda is available, use it!

PyTorch GPU training performance test

Now let's see how to add the use of GPU to the training cycle. We will use the code we have developed so far in this series to do this addition.
This will enable us to easily compare time, CPU VS GPU.

Refactoring RunManager class

Before updating the training cycle, we need to update the RunManager class. In begin_ In the run () method, we need to modify it and pass it to add_ Image tensor device of graph method.

It should be like this:

def begin_run(self, run, network, loader):

    self.run_start_time = time.time()

    self.run_params = run
    self.run_count += 1

    self.network = network
    self.loader = loader
    self.tb = SummaryWriter(comment=f'-{run}')

    images, labels = next(iter(self.loader))
    grid = torchvision.utils.make_grid(images)

    self.tb.add_image('images', grid)
    self.tb.add_graph(
            self.network
        ,images.to(getattr(run, 'device', 'cpu'))
    )

Here, we use the built-in getattr () function to get the value of the device on the running object. If the running object has no device, the cpu is returned. This makes the code backward compatible. If we don't specify a device for our operation, it can still work.
Note that the network does not need to be moved to devices because its devices are set up before incoming. However, the image tensor is obtained from the loader.

Reconstruct training cycle

We will set our configuration parameters to have a device. The two logical options here are cuda and cpu.

params = OrderedDict(
    lr = [.01]
    ,batch_size = [1000, 10000, 20000]
    , num_workers = [0, 1]
    , device = ['cuda', 'cpu']
)

As these device values are added to our configuration, they will now be accessible in our training cycle.
At the top of our running, we will create a device to deliver in the running and training cycle.

device = torch.device(run.device)

This is the first time we use the device to initialize the network.

network = Network().to(device)

This will ensure that the network is moved to the appropriate device. Finally, we will update the image and label tensor, package them separately and send them to the device as follows:

images = batch[0].to(device)
labels = batch[1].to(device)

That's it. We're ready to run this code and see the results.

Here, we can see that cuda devices are significantly two to three times larger than CPUs. The results may vary.

Tags: Python neural networks Pytorch Deep Learning

Posted by bmbc on Mon, 23 May 2022 12:42:05 +0300