Before learning this course, I heard the name of YOLO. During the epidemic, an up master trained the detection of pedestrian masks through yolo and it was a great success. Without further ado, let's start learning.

# Basic Concepts in Object Detection

Before starting to learn YOLO, first learn some basic concepts of target detection. After all, YOLO is also a target detection model.

## bounding box

The bounding box (bbox) is a rectangular box on the image that can just frame the detected object.

For example, the green rectangular box in the figure just frames the detected object.

There are two ways to represent the position of the bounding box in the image:

- The xyxy type, that is, the upper left coordinates (x1,y1) and the lower right corner coordinates (x2,y2) of the bounding box are provided, so that a bounding box can be uniquely determined.
- The xywh type, that is, provides the coordinates of the center of the bounding box (x,y) and the height (h) and width (w) of the bounding box. This can also uniquely determine a bounding box.

The way to convert between these two bounding boxes is:

(x1,y1) = (x - w / 2 , y - h / 2) (x2,y2) = (x + w / 2 , y + h / 2)

Remember to determine which representation it is when writing your code.

## Anchor box

Anchor box is a kind of box imagined by people. First set the size and shape of the anchor frame, and then draw a rectangular frame with a point on the image as the center. Just like his name, Anchor, anchors the box somewhere on the picture. What's the point of doing this? In fact, it is for the model to easily find the target in the many anchor boxes we framed, and learn how to fine-tune the anchor box so that it can just frame the target.

Each model has its own way of generating anchor boxes, and YOLO-v3 is no exception.

## Intersection over Union (IoU)

This is a very important concept in object detection. How to evaluate the quality of the bounding box predicted by our model is how well it coincides with the real box. This will introduce the concept of intersection and comparison.

This concept comes from sets in mathematics and is used to describe the relationship between two sets A and B. It is equal to the number of elements contained in the intersection of the two sets, divided by the elements contained in their union. The specific calculation formula is as follows:

IoU=A∪B / A∩B

We will use this concept to describe the degree of coincidence between two boxes.

Let us deepen our understanding through a piece of code

# Calculate IoU, the coordinates of the rectangular box are in the form of xyxy, this function will be saved in the box_utils.py file def box_iou_xyxy(box1, box2): # Get the coordinates of the upper left and lower right corners of box1 x1min, y1min, x1max, y1max = box1[0], box1[1], box1[2], box1[3] # Calculate the area of box1 s1 = (y1max - y1min + 1.) * (x1max - x1min + 1.) # Get the coordinates of the upper left and lower right corners of box2 x2min, y2min, x2max, y2max = box2[0], box2[1], box2[2], box2[3] # Calculate the area of box2 s2 = (y2max - y2min + 1.) * (x2max - x2min + 1.) # Calculate the coordinates of the intersecting rectangles xmin = np.maximum(x1min, x2min) ymin = np.maximum(y1min, y2min) xmax = np.minimum(x1max, x2max) ymax = np.minimum(y1max, y2max) # Calculate height, width, area of intersecting rectangular rows inter_h = np.maximum(ymax - ymin + 1., 0.) inter_w = np.maximum(xmax - xmin + 1., 0.) intersection = inter_h * inter_w # Calculate the combined area union = s1 + s2 - intersection # Calculate the intersection ratio iou = intersection / union return iou

Here are a few things to note:

1. When calculating the area, 1 is added to the side length in the formula. This is because the concept of intersection ratio is introduced from mathematics. In mathematics, the area of a rectangle is a natural number, but it can also be calculated using the number of coordinate points in a pixel map. Therefore, it is theoretically possible to add this 1 or not.

2. Considering various situations, don't forget to take the maximum value of 0 when calculating the width and height of the intersecting rectangle, after all, the two rectangles do not necessarily intersect.

Now that we understand the basic concepts, let's unravel the mystery of YOLO!

# YOLO-V3

## YOLO-V3 content:

In 2015, Joseph Redmon and others proposed the YOLO (You Only Look Once, YOLO) algorithm, which is also commonly referred to as YOLO-V1; in 2016, they improved the algorithm and proposed the YOLO-V2 version; V3 version.

It mainly covers the following contents:

YOLO-V3 model design idea

⚪ Generate candidate regions

generate anchor box

Generate prediction box

Label candidate regions

⚪ Convolutional Neural Network Extraction Features

⚪ Build a loss function

Get sample labels

Create various loss functions

⚪Multi-level detection

⚪ Predict output

Calculate prediction box score and position

non-maximum suppression

## YOLO-V3 model design idea

The basic idea of the YOLO-V3 algorithm can be divided into two parts:

⚪ Generate a series of candidate regions on the picture according to certain rules, and then mark the candidate regions according to the positional relationship between these candidate regions and the real frame of the object on the picture. Those candidate regions that are close enough to the ground-truth frame will be marked as positive samples, and the position of the ground-truth frame will be used as the position target of the positive samples. Those candidate regions that deviate from the ground truth are marked as negative samples, and negative samples do not need to predict positions or categories.

⚪ Use convolutional neural network to extract image features and predict the location and category of candidate regions. In this way, each predicted frame can be regarded as a sample, and the label value is obtained by labeling the real frame relative to its position and category, and its position and category are predicted through the network model, and the network predicted value and the label value are compared. A loss function can be built.

Let's learn in the order of YOLO-V3 model design ideas.

## generate candidate regions

Machines are different from us. We want to see a certain object in the picture, our eyes will catch it, but the machine will not at first. The machine pre-generates a large number of fixed candidate regions that cover the entire image. In these candidate regions to detect whether there is an object and the category of the object. By learning over and over again, the fixed frame is fine-tuned to make it more and more suitable for the target object.

Therefore, how to generate candidate regions is an important step in object detection.

1. Generate anchor boxes

The anchor box is the pre-generated fixed box mentioned above. The way of generating anchor boxes of YOLO-V3 is to divide the original picture into m×n regions, and generate a series of anchor boxes in the center of each small region after the segmentation. An example is given below.

The height of the original image is H=640, and the width is W=480. If we choose the size of the small area to be 32×32, then m and n are 20 and 15, respectively. After segmentation, it is the black box in the figure.

After that, the anchor frame is generated at the center of each small frame, and the picture of generating the anchor frame is as follows

The blue one in the picture is the anchor frame, you can see that there are quite a lot of them, and they cover the whole picture.

So how are the sizes of these anchor boxes determined? is obtained by clustering the size of the target objects in the training set in advance. In layman's terms, it is to look at the values of the bounding box in the training set, and define the value in this set as the size of the anchor box. It is well understood that this can be done so that we can make as small adjustments as possible when we generate the prediction frame later, which is convenient for model learning. A total of 9 sizes of anchor boxes are generated after YOLO-V3 clustering on the Coco dataset.

2. Generate prediction box

Since the size and center of the anchor box are fixed, it is inevitable that it "doesn't fit well" with the target object. Since we initially divided into many small regions, each small region has its own set of anchor boxes. Therefore, we should fine-tune the center coordinates xy of the anchor box and the width and height wh of the anchor box. The method we have chosen is:

#Anchor box center fine-tuning bx=cx+σ(tx) by=cy+σ(ty) #cx,cy are the coordinates of the upper left corner of the small area #σ(x) is the sigmoid function σ(x) = 1 / (1+exp(−x)) #Anchor box resizing bh = ph * e^th bw = pw * e^tw #ph,pw are the original anchor frame size

Are there any friends who are wondering why it is so troublesome. After listening to the class, I realized that this is for our training. We know that each small area is based on the upper left corner cx,cy as the origin, and the one unit to the right of the origin is regarded as the area. If it becomes like this:

bx=cx + tx by=cy + ty bh = ph * th bw = pw * tw

It looks much simpler, but it is necessary to ensure that the fine-tuning tx,ty, th, and tw are all positive values, and tx,ty is less than 1. The machine cannot know this condition at the beginning, and it also needs to learn the range of parameters, which increases the complexity of model learning.

## 3. Label the candidate regions

The data that the dataset gives us is often the absolute position of the real box, and we need to use them to mark the data form we want.

For the generated candidate box, we need to know:

a. Whether there is a target object in the candidate box, we use the label objectness to indicate. When objectness=1, it means that there is an object in the candidate frame. Does not exist when objectness=0.

b. If it contains objects, how do we adjust the anchor frame to frame the target more perfectly, which is what the prediction frame did in the previous step. The parameters here are tx,ty,tw,th.

c. If it contains objects, we also need to know the types of targets in the candidate area. There are many objects in general target detection tasks, so classification is essential. We use the variable label to represent the label of the category to which it belongs.

To sum up, we need to label an anchor box [objectness,(tx,ty,tw,th),label] These labels. Therefore, it is necessary to mark the labels in the list with the real box. Methods as below:

#Annotate the objectness of the prediction box def get_objectness_label(img, gt_boxes, gt_labels, iou_threshold = 0.7, anchors = [116, 90, 156, 198, 373, 326], num_classes=7, downsample=32): """ img is the input image data, the shape is[N, C, H, W] gt_boxes，The ground truth box, the dimension is[N, 50, 4]，Among them, 50 is the upper limit of the number of real frames. When there are less than 50 real frames in the picture, the coordinates of the insufficient part are all 0. The real box coordinate format is xywh，Use relative values here gt_labels，The category of the ground truth box, the dimension is[N, 50] iou_threshold，When the predicted box and the real box are iou more than the iou_threshold does not consider it as a negative sample anchors，Anchor frame optional size anchor_masks，through with anchors Together to determine the size of the anchor frame that the feature map of this level should use num_classes，number of categories downsample，The ratio of the feature map to the image size change of the input network """ img_shape = img.shape batchsize = img_shape[0] num_anchors = len(anchors) // 2 input_h = img_shape[2] input_w = img_shape[3] # Divide the input image into num_rows x num_cols small square areas, and the side length of each small square is downsample # Calculate how many rows of small squares there are num_rows = input_h // downsample # Calculate how many columns of squares there are num_cols = input_w // downsample label_objectness = np.zeros([batchsize, num_anchors, num_rows, num_cols]) label_classification = np.zeros([batchsize, num_anchors, num_classes, num_rows, num_cols]) label_location = np.zeros([batchsize, num_anchors, 4, num_rows, num_cols]) scale_location = np.ones([batchsize, num_anchors, num_rows, num_cols]) # Loop through batchsize and process each image in turn for n in range(batchsize): # Loop through the real boxes on the picture, and find the anchor boxes that best match the shape of the real boxes in turn for n_gt in range(len(gt_boxes[n])): gt = gt_boxes[n][n_gt] gt_cls = gt_labels[n][n_gt] gt_center_x = gt[0] gt_center_y = gt[1] gt_width = gt[2] gt_height = gt[3] if (gt_height < 1e-3) or (gt_height < 1e-3): continue i = int(gt_center_y * num_rows) j = int(gt_center_x * num_cols) ious = [] for ka in range(num_anchors): bbox1 = [0., 0., float(gt_width), float(gt_height)] anchor_w = anchors[ka * 2] anchor_h = anchors[ka * 2 + 1] bbox2 = [0., 0., anchor_w/float(input_w), anchor_h/float(input_h)] # Calculate iou iou = box_iou_xywh(bbox1, bbox2) ious.append(iou) ious = np.array(ious) inds = np.argsort(ious) k = inds[-1] label_objectness[n, k, i, j] = 1 c = gt_cls label_classification[n, k, c, i, j] = 1. # for those prediction bbox with objectness =1, set label of location dx_label = gt_center_x * num_cols - j dy_label = gt_center_y * num_rows - i dw_label = np.log(gt_width * input_w / anchors[k*2]) dh_label = np.log(gt_height * input_h / anchors[k*2 + 1]) label_location[n, k, 0, i, j] = dx_label label_location[n, k, 1, i, j] = dy_label label_location[n, k, 2, i, j] = dw_label label_location[n, k, 3, i, j] = dh_label # scale_location is used to adjust the contribution of anchor boxes of different sizes to the loss function, which is multiplied by the weighting coefficient and the position loss function scale_location[n, k, i, j] = 2.0 - gt_width * gt_height # At present, according to all the gt box es that appear on each picture, the prediction boxes with positive objectness are marked, and the default objectness of the remaining prediction boxes is 0. # For prediction boxes with objectness of 1, the object categories they contain and the target of position regression are marked return label_objectness.astype('float32'), label_location.astype('float32'), label_classification.astype('float32'), \ scale_location.astype('float32')

The code is well understood in combination with the comments. The center point of the real box in each image of a batch is obtained to obtain the small square area where it is located. Then traverse each size anchor box and calculate the IoU, find the largest anchor box and mark its objectness as 1. It should be noted that when calculating the IoU here, only the shape of the anchor box and the real box is considered, so the deviation of the center point is not considered, so the coordinates of the box are all (0, 0).

Then mark the category (one-hot form) and the prediction frame deviation [dx,dy,tw,th] (due to tx, ty need to find the inverse sigmoid function is inconvenient, so use dx, dy).

At this point, we have finished the labeling task, that is, we have obtained the label, so how do we connect it with the image? Due to space reasons, we will study it next time.