Using Mask R-CNN in OpenCV

This article is translated from: https://www.pyimagesearch.com/2018/11/19/mask-r-cnn-with-opencv/

Transfer failed

In this tutorial, you will learn how to use Mask R-CNN in OpenCV.

Using Mask R-CNN, you can automatically segment and construct pixel level masks for each object in the image. We will apply Mask R-CNN to images and video streams.

In last week's blog post, you learned how to use YOLO object detector To detect whether there are objects in the image. Object detectors such as YOLO, fast r-cnn and single shot detector (SSD) generate four sets of (x, y) coordinates that represent the bounding box of the object in the image.

Getting the bounding box of an object is a good start, but the bounding box itself does not tell us (1) which pixels belong to the foreground object and (2) which pixels belong to the background.

This raises a question:

Can we generate a mask for each object in the image to allow us to segment foreground objects from the background?

Is this possible?

The answer is yes - we only need to use the Mask R-CNN architecture to perform instance segmentation.

To learn how to apply Mask R-CNN to image and video streaming in OpenCV, read on!

Using Mask R-CNN in OpenCV

In the first part of this tutorial, we will discuss the differences between image classification, object detection, instance segmentation and semantic segmentation.

Here, we will briefly review the Mask R-CNN architecture and its association with fast r-cnn.

Then I'll show you how to apply Mask R-CNN to image and video streams in OpenCV.

Let's start!

Instance segmentation and semantic segmentation

Figure 1: Image Classification (top left), object detection (top right), semantic segmentation (bottom left) and instance segmentation (bottom right). In this tutorial, we will use Mask R-CNN for instance segmentation. ( source)
 

It is best to visually explain the differences between traditional image classification, object detection, semantic segmentation and instance segmentation.

When performing traditional image classification, our goal is to predict a set of tags to represent the content of the input image (upper left).

Object detection is based on image classification, but this time it allows us to locate each object in the image. The image now has the following features:

  1. Bounding box (x, y) coordinates of each object
  2. Relevant class labels for each bounding box

An example of semantic segmentation can be seen at the bottom left. The semantic segmentation algorithm requires us to associate each pixel in the input image with the category label (including the category label of the background).

Pay close attention to our semantic segmentation visualization - pay attention to how each object is indeed segmented, but each "Cube" object has the same color.

Although semantic segmentation algorithms can label each object in the image, they cannot distinguish two objects of the same class.

This behavior is particularly problematic if two objects of the same class partially occlude each other - we don't know where the boundary of one object ends and where the boundary of the next object begins (as shown by two purple cubes), and we can't determine the starting position of one cube and the ending position of the other cube.

On the other hand, the instance segmentation algorithm calculates a pixel level mask for each object in the image, even if these objects have the same class label (lower right corner). Here, you can see that each cube has its own unique color, which means that our instance segmentation algorithm not only locates each individual cube, but also predicts their boundaries.

The example of CNN mask algorithm discussed in this tutorial is the example of CNN mask algorithm.

What is Mask R-CNN?

He et al. Introduced the Mask R-CNN algorithm. In their 2017 paper Mask R-CNN Yes.

Mask R-CNN was established before Girshick et al R-CNN(2013),Fast R-CNN (2015) and Faster R-CNN (2015).

In order to understand Mask R-CNN, let's start with the original R-CNN and briefly review the R-CNN variants:

Figure 2: original R-CNN architecture (source: Girshick et al., 2013)

The original R-CNN algorithm is divided into four steps:

  • Step 1: input the image into the network.
  • Step 2: use tools such as Selective search Such algorithms extract the proposed region (that is, the region in the image that may contain objects).
  • Step 3: use transfer learning, especially feature extraction, and use pre trained CNN to calculate features for each proposal (actually ROI).
  • Step 4: classify each proposal using the extracted features and support vector machine (SVM).

This method is effective because CNN has learned powerful and differentiated functions.

However, the problem with the R-CNN method is that it runs very slowly. Moreover, we are not actually learning to localize through deep neural networks, but actually building higher-level networks HOG + linear SVM detector.

In order to improve the original R-CNN, Girshick et al. Released the Fast R-CNN algorithm:

Figure 3: Fast R-CNN architecture (source: Girshick et al., 2015).

 

Similar to the original R-CNN, Fast R-CNN still uses selective search to obtain regional proposals. However, the novel contribution of this paper is the region of interest pooling (ROI) module.

The working principle of ROI pooling is to extract a fixed size window from the feature graph, and then use these features to obtain the final class label and bounding box. The main benefit here is that the network can now effectively carry out end-to-end training:

  1. We input the image and the corresponding real bounding box
  2. Extract feature map
  3. Apply ROI pooling and obtain ROI eigenvectors
  4. Finally, two fully connected layers are used to obtain (1) class label prediction and (2) each proposed bounding box location.

Although the network is now end-to-end trainable, its performance will suffer a lot in reasoning (i.e. prediction) because it depends on selective search.

In order to make the R-CNN architecture faster, we need to incorporate the regional recommendations directly into R-CNN:

Figure 4: fast r-cnn architecture (source: Girshick et al., 2015)

 

Girshick et al's fast r-cnn paper introduces the regional proposal network (RPN), which directly embeds the regional proposal into the architecture, thus reducing the demand for selective search algorithms.

Overall, the fast r-cnn architecture can run at a speed of about 7-10 FPS, which is a big step towards enabling deep learning to realize real-time object detection.

The Mask R-CNN algorithm is based on the fast r-cnn architecture and has two main contributions:

  1. Replace the ROI Pooling module with a more accurate ROI Align module
  2. Insert other branches from the ROI Align module

This additional branch accepts the output of ROI Align and then inputs it into two convolution layers.

The output of the convolution layer is the mask itself.

We can visualize the Mask R-CNN architecture in the following figure:

Figure 5: Mask R-CNN work of He et al. Replaced ROI polling module with more accurate ROI alignment module. The output of the ROI module is then fed into two CONV layers. The output of CONV layer is the mask itself.

 

Note that the branches of the two CONV layers come from the ROI Align module - this is where we actually generate the mask.

It is well known that the fast r-cnn / mask r-cnn architecture utilizes the area proposal network (RPN) to generate image areas that may contain objects.

Each of these regions is ranked based on its "objectivity score" (i.e. the possibility that a given region may contain an object), and then retains the top N most confident objectivity regions.

In the original fast r-cnn publication, Girshick et al set N = 2000, but in practice, we can use smaller N, such as N = {10100200300}, and still get good results.

He et al publication Set N = 300 in, which is also the value we use here.

Each of the 300 selected ROI s passes through three parallel branches of the network:

  1. Label prediction
  2. Bounding box prediction
  3. Mask prediction

Figure 5 above visualizes these branches.

During the forecast period, each of the 300 ROI s experienced Non maximum suppression , and the first 100 detection frames are retained to obtain a 4D tensor of 100 x L x 15 x 15, where l is the number of class labels in the dataset and 15 x15 is the size of each of the L masks.

The Mask R-CNN we use here today is COCO dataset The data set has L = 90 categories, so the volume size obtained by the mask module of Mask R CNN is 100 x 90 x 15 x 15.

To visualize the Mask R-CNN process, see the following figure:

Figure 6: the visualization effect of mask R-CNN produces a 15 x 15 mask, adjusts the mask to the original size of the image, and then finally covers the mask on the original image. (source: Deep learning of computer vision using Python,ImageNet Bundle)

 

Here, you can see that we start from the input image and transmit it through the Mask R-CNN network to obtain our mask prediction.

The predicted mask is only 15 x 15 pixels, so we adjust the mask to the original input image size.

Finally, you can overlay the resized mask on the original input image. For a more detailed discussion of how Mask R-CNN works, please make sure to refer to:

  1. Original Mask R-CNN publication by He et al.
  2. My book Deep learning of computer vision using Python In, I discussed Mask R-CNN in more detail, including how to train my Mask R-CNN on my own data from scratch.

Project structure

At present, our project contains two scripts, but there are several other important files.

I organized the project in the following way (as shown in the tree command output directly in the terminal):

$ tree
.
├── mask-rcnn-coco
│   ├── colors.txt
│   ├── frozen_inference_graph.pb
│   ├── mask_rcnn_inception_v2_coco_2018_01_28.pbtxt
│   └── object_detection_classes_coco.txt
├── images
│   ├── example_01.jpg
│   ├── example_02.jpg
│   └── example_03.jpg
├── videos
│   ├── 
├── output
│   ├──  
├── mask_rcnn.py
└── mask_rcnn_video.py
4 directories, 9 files

Our catalog contains four items:

Mask RCNN coco /: Mask R-CNN model file. There are four files:

  • frozen_inference_graph.pb: Mask R-CNN model weight. The weights were pre trained on the COCO data set.
  • mask_rcnn_inception_v2_coco_2018_01_28.pbtxt: Mask R-CNN model configuration. If you want to build + train your model on the data you annotate, see Computer vision deep learning using Python.
  • object_detection_classes_coco.txt: This text file lists all 90 classes, one per line. Open it in a text editor to see which objects our model can recognize.
  • colors.txt: This text file contains six colors that can be randomly assigned to objects found in the image.

Images /: I provided three test images in "download". Feel free to add your own images for testing.

Videos /: This is an empty directory. I actually tested it using a large video captured from YouTube (copyright below, directly above the "summary" section). My suggestion is that you can find some video downloads and tests on YouTube instead of providing a very large zip. Or you can take some videos with your mobile phone and use them on your computer!

outputs /: another empty directory for storing processed videos (assuming you set the command line parameter flag to output to this directory).

We will review two scripts today:

mask_rcnn.py: this script will perform instance segmentation and apply the mask to the image so that you can see where Mask R-CNN thinks the object is, pixel level.

mask_rcnn_video.py: this video processing script uses the same Mask R-CNN and applies the model to each frame of the video file. The script then writes the output frame back to the video file on disk.

OpenCV and Mask R-CNN are used in the picture

Now that we've reviewed how Mask R-CNN works, let's try some Python code.

Before you begin, make sure your Python environment has OpenCV 3.4.2 / 3.4.3 or later installed. You can follow my instructions OpenCV installation tutorial One to upgrade / install OpenCV. If you want to start and run in 5 minutes or less, you can consider using pip installing OpenCV . If you have other requirements, you may need to compile OpenCV from source code.

Make sure you have used the "download" section of this blog post to download the source code, trained Mask R-CNN and sample images.

Open the mask from there_ rcnn. Py file and insert the following code:

# import the necessary packages
import numpy as np
import argparse
import random
import time
import cv2
import os

First, we will import the required packages in lines 2-7. It is worth noting that we are importing NumPy and OpenCV. Most Python installations come with everything else.

From there, we will analyze Command line parameters:

# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--image", required=True,
	help="path to input image")
ap.add_argument("-m", "--mask-rcnn", required=True,
	help="base path to mask-rcnn directory")
ap.add_argument("-v", "--visualize", type=int, default=0,
	help="whether or not we are going to visualize each instance")
ap.add_argument("-c", "--confidence", type=float, default=0.5,
	help="minimum probability to filter weak detections")
ap.add_argument("-t", "--threshold", type=float, default=0.3,
	help="minimum threshold for pixel-wise mask segmentation")
args = vars(ap.parse_args())

Our script requires that command line parameter flags and parameters be passed in the terminal at run time. Our parameters are parsed in lines 10-21, of which the first two are required and the rest are optional:

  • --image: enter the path of the picture.
  • --Mask RNN: the basic path of Mask R-CNN file.
  • --Visualize (optional): a positive value indicates that we want to visualize how we extract the masked area on the screen. Either way, we will display the final output on the screen.
  • --confidence (optional): you can override the probability value of 0.5 to filter weak detection.
  • --threshold (optional): we will create a binary mask for each object in the image, which will help us filter out the prediction of weak masks. I found that the default value of 0.3 works well.

Now that our command line parameters are stored in the args dictionary, let's load the label and color:

# load the COCO class labels our Mask R-CNN was trained on
labelsPath = os.path.sep.join([args["mask_rcnn"],
	"object_detection_classes_coco.txt"])

LABELS = open(labelsPath).read().strip().split("\n")
# load the set of colors that will be used when visualizing a given
# instance segmentation
colorsPath = os.path.sep.join([args["mask_rcnn"], "colors.txt"])
COLORS = open(colorsPath).read().strip().split("\n")
COLORS = [np.array(c.split(",")).astype("int") for c in COLORS]
COLORS = np.array(COLORS, dtype="uint8")

Lines 24-26 load the COCO object class LABELS. The current Mask R-CNN can identify 90 categories, including people, vehicles, signs, animals, daily necessities, sports equipment, kitchen supplies, food and so on! I encourage you to view objects_ detection_ classes_ COCO. Txt to see the available classes.

From there, we load COLORS from the path and perform several array conversion operations (lines 30-33).

Let's load the model:

# derive the paths to the Mask R-CNN weights and model configuration
weightsPath = os.path.sep.join([args["mask_rcnn"],
	"frozen_inference_graph.pb"])
configPath = os.path.sep.join([args["mask_rcnn"],
	"mask_rcnn_inception_v2_coco_2018_01_28.pbtxt"])

# load our Mask R-CNN trained on the COCO dataset (90 classes)
# from disk
print("[INFO] loading Mask R-CNN from disk...")
net = cv2.dnn.readNetFromTensorflow(weightsPath, configPath)

First, we establish weights and configuration paths (lines 36-39), and then load the model through these paths (line 44).

In the next step, we will load the image and transfer it through Mask R-CNN neural network:

# load our input image and grab its spatial dimensions
image = cv2.imread(args["image"])
(H, W) = image.shape[:2]

# construct a blob from the input image and then perform a forward
# pass of the Mask R-CNN, giving us (1) the bounding box  coordinates
# of the objects in the image along with (2) the pixel-wise segmentation
# for each specific object
blob = cv2.dnn.blobFromImage(image, swapRB=True, crop=False)
net.setInput(blob)
start = time.time()
(boxes, masks) = net.forward(["detection_out_final", "detection_masks"])
end = time.time()

# show timing information and volume information on Mask R-CNN
print("[INFO] Mask R-CNN took {:.6f} seconds".format(end - start))
print("[INFO] boxes shape: {}".format(boxes.shape))
print("[INFO] masks shape: {}".format(masks.shape))

Here, we:

  • Load the input image and extract the size for later scaling (lines 47 and 48).
  • Pass CV2 dnn. Blobfromimage constructs a blob (line 54). You can stay with me Previous tutorials Learn why and how to use this feature in.
  • Forward pass the Blob while collecting the timestamp (lines 55-58). The results are contained in two important variables: box and mask.

Now, we have performed the forward transmission of Mask R-CNN on the image, and we will filter + visualize the results. This is exactly what the next for loop does. It's very long, so I'll start here and divide it into five code blocks:

# loop over the number of detected objects
for i in range(0, boxes.shape[2]):
	# extract the class ID of the detection along with the confidence
	# (i.e., probability) associated with the prediction
	classID = int(boxes[0, 0, i, 1])
	confidence = boxes[0, 0, i, 2]

	# filter out weak predictions by ensuring the detected probability
	# is greater than the minimum probability
	if confidence > args["confidence"]:
		# clone our original image so we can draw on it
		clone = image.copy()

		# scale the bounding box coordinates back relative to the
		# size of the image and then compute the width and the height
		# of the bounding box
		box = boxes[0, 0, i, 3:7] * np.array([W, H, W, H])
		(startX, startY, endX, endY) = box.astype("int")
		boxW = endX - startX
		boxH = endY - startY

In this block, we start the filter / visualization cycle (line 66).

We continue to extract the classID and confidence of specific detected objects (lines 69 and 70).

From there, we filter out the weak prediction by comparing the confidence with the confidence value of the command-line parameter to ensure that we exceed it (line 74).

Assuming this is the case, we will continue to clone the image (line 76). We will need this image later.

Then, we scale the bounding box of the object and calculate the box size (lines 81-84).

Image segmentation requires us to find all pixels of the existing object. Therefore, we will place a transparent overlay layer above the object to see how our algorithm works. To do this, we will calculate a mask:

		# extract the pixel-wise segmentation for the object, resize
		# the mask such that it's the same dimensions of the bounding
		# box, and then finally threshold to create a *binary* mask
		mask = masks[i, classID]
		mask = cv2.resize(mask, (boxW, boxH),
			interpolation=cv2.INTER_NEAREST)
		mask = (mask > args["threshold"])

		# extract the ROI of the image
		roi = clone[startY:endY, startX:endX]

On lines 89-91, we extract the pixel by pixel segmentation of the object and adjust it to the original image size. Finally, we threshold the mask to become a binary array / image (line 92).

We also extract the region of interest where the object is located (line 95).

Later, the mask and roi can be seen visually in Figure 8.

For convenience, if the -- visualize flag is set through the command line parameter, the next code block will visualize the mask, roi and split instances:

		# check to see if are going to visualize how to extract the
		# masked region itself
		if args["visualize"] > 0:
			# convert the mask from a boolean to an integer mask with
			# to values: 0 or 255, then apply the mask
			visMask = (mask * 255).astype("uint8")
			instance = cv2.bitwise_and(roi, roi, mask=visMask)

			# show the extracted ROI, the mask, along with the
			# segmented instance
			cv2.imshow("ROI", roi)
			cv2.imshow("Mask", visMask)
			cv2.imshow("Segmented", instance)

In this block, we:

  • Check whether we should visualize ROI, mask and split instances (line 99).
  • Converts the mask from a Boolean value to an integer, where the value "0" represents the background and the value "255" represents the foreground (line 102).
  • Perform bitwise masking to visualize only the instance itself (line 103).
  • All three images (107-109 lines) are displayed.

Similarly, these visualization images are displayed only when the -- visualize flag is set through the optional command line parameter (they are not displayed by default).

Now let's continue to visualize:

		# now, extract *only* the masked region of the ROI by passing
		# in the boolean mask array as our slice condition
		roi = roi[mask]

		# randomly select a color that will be used to visualize this
		# particular instance segmentation then create a transparent
		# overlay by blending the randomly selected color with the ROI
		color = random.choice(COLORS)
		blended = ((0.4 * color) + (0.6 * roi)).astype("uint8")

		# store the blended ROI in the original image
		clone[startY:endY, startX:endX][mask] = blended

Line 113 extracts only the mask region of the ROI by using the Boolean mask array as our slice condition.

Then, we will randomly select one of the six colors and apply the transparent overlay to the object (line 118).

Then, we mix the mask area with roi (line 119), and then put the mixed area into the cloned image (line 122).

Finally, we will draw rectangle and text class label + confidence value on the image and display the results!

		# draw the bounding box of the instance on the image
		color = [int(c) for c in color]
		cv2.rectangle(clone, (startX, startY), (endX, endY), color, 2)

		# draw the predicted label and associated probability of the
		# instance segmentation on the image
		text = "{}: {:.4f}".format(LABELS[classID], confidence)
		cv2.putText(clone, text, (startX, startY - 5),
			cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)

		# show the output image
		cv2.imshow("Output", clone)
		cv2.waitKey(0)

Finally, we:

  • Draw a colored border around the object (lines 125 and 126).
  • Build our class label + confidence text, and draw the text above the bounding box (130-132 lines).
  • Display the image until any key is pressed (lines 135 and 136).

Let's try our Mask R-CNN code!

Make sure you have used the download section of this tutorial to download the source code, trained Mask R-CNN and sample images. From there, open your terminal and execute the following command:

$ python mask_rcnn.py --mask-rcnn mask-rcnn-coco --image images/example_01.jpg
[INFO] loading Mask R-CNN from disk...
[INFO] Mask R-CNN took 0.761193 seconds
[INFO] boxes shape: (1, 1, 3, 7)
[INFO] masks shape: (100, 90, 15, 15)
Figure 7: Mask R-CNN applied to car scene. Use Python and OpenCV to generate masks.

 

In the above figure, you can see that our Mask R-CNN not only locally locates each vehicle in the image, but also constructs a pixel level mask, so that we can segment each vehicle from the image.

If we run the same command and provide the -- visualize flag this time, we can also visualize ROI, mask and instance:

Figure 8: using the -- visualize flag, we can view the ROI and mask of the Mask R-CNN pipeline built with Python and OpenCV
And split intermediate steps.

 

Let's try another example image:

$ python mask_rcnn.py --mask-rcnn mask-rcnn-coco --image images/example_02.jpg \
	--confidence 0.6
[INFO] loading Mask R-CNN from disk...
[INFO] Mask R-CNN took 0.676008 seconds
[INFO] boxes shape: (1, 1, 8, 7)
[INFO] masks shape: (100, 90, 15, 15)
Figure 9: using Python and OpenCV, we can use Mask R-CNN to perform instance segmentation.

 

Our Mask R-CNN has correctly detected and segmented people, dogs, horses and trucks from the image.

Before we continue to use Mask R-CNN in video, this is the last example:

$ python mask_rcnn.py --mask-rcnn mask-rcnn-coco --image images/example_03.jpg 
[INFO] loading Mask R-CNN from disk...
[INFO] Mask R-CNN took 0.680739 seconds
[INFO] boxes shape: (1, 1, 3, 7)
[INFO] masks shape: (100, 90, 15, 15)
Figure 10: Here you can see that I am feeding Jemma, the Beagle. Each object identified
All pixel images are masked and transparently covered on the object. The image is used using OpenCV and Python
The pre trained Mask R-CNN model is generated.

 

In this image, you can see a picture of yourself and family hound Jemma.

Our Mask R-CNN can detect and locate me, Jemma and chair with high confidence.

OpenCV and Mask R-CNN are used in video streaming

Now that we have studied how to apply Mask R-CNN to images, let's explore how to apply them to video as well.

Open mask_rcnn_video.py file and insert the following code:

# import the necessary packages
import numpy as np
import argparse
import imutils
import time
import cv2
import os

# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--input", required=True,
	help="path to input video file")
ap.add_argument("-o", "--output", required=True,
	help="path to output video file")
ap.add_argument("-m", "--mask-rcnn", required=True,
	help="base path to mask-rcnn directory")
ap.add_argument("-c", "--confidence", type=float, default=0.5,
	help="minimum probability to filter weak detections")
ap.add_argument("-t", "--threshold", type=float, default=0.3,
	help="minimum threshold for pixel-wise mask segmentation")
args = vars(ap.parse_args())

First, we import the necessary packages and parse the command line parameters.

There are two new command-line parameters (which will replace -- image in the previous script):

  • --Input: we input the path of the video
  • --Output: the path to output the video (because we will write the results to the video file on disk).

Now let's load neural networks like LABELS, COLORS and Mask R-CNN:

# load the COCO class labels our Mask R-CNN was trained on
labelsPath = os.path.sep.join([args["mask_rcnn"],
	"object_detection_classes_coco.txt"])
LABELS = open(labelsPath).read().strip().split("\n")

# initialize a list of colors to represent each possible class label
np.random.seed(42)
COLORS = np.random.randint(0, 255, size=(len(LABELS), 3),
	dtype="uint8")

# derive the paths to the Mask R-CNN weights and model configuration
weightsPath = os.path.sep.join([args["mask_rcnn"],
	"frozen_inference_graph.pb"])
configPath = os.path.sep.join([args["mask_rcnn"],
	"mask_rcnn_inception_v2_coco_2018_01_28.pbtxt"])

# load our Mask R-CNN trained on the COCO dataset (90 classes)
# from disk
print("[INFO] loading Mask R-CNN from disk...")
net = cv2.dnn.readNetFromTensorflow(weightsPath, configPath)

Our labels and colors are loaded on lines 24-31.

Before loading the Mask R-CNN neural network (lines 34-42), we defined weightpath and configPath from there.

Now let's initialize the video stream and write the video handle:

# initialize the video stream and pointer to output video file
vs = cv2.VideoCapture(args["input"])
writer = None

# try to determine the total number of frames in the video file
try:
	prop = cv2.cv.CV_CAP_PROP_FRAME_COUNT if imutils.is_cv2() \
		else cv2.CAP_PROP_FRAME_COUNT
	total = int(vs.get(prop))
	print("[INFO] {} total frames in video".format(total))

# an error occurred while trying to determine the total
# number of frames in the video file
except:
	print("[INFO] could not determine # of frames in video")
	total = -1

Our video stream (vs) and write video handles are initialized on lines 45 and 46.

We try to determine the number of frames in the video file and display the total number (lines 49-53). If not, we will catch the exception and print the status message, and set total to - 1 (lines 57-59). We will use this value to estimate the time required to process the entire video file.

Let's start the frame processing cycle:

# loop over frames from the video file stream
while True:
	# read the next frame from the file
	(grabbed, frame) = vs.read()

	# if the frame was not grabbed, then we have reached the end
	# of the stream
	if not grabbed:
		break

	# construct a blob from the input frame and then perform a
	# forward pass of the Mask R-CNN, giving us (1) the bounding box
	# coordinates of the objects in the image along with (2) the
	# pixel-wise segmentation for each specific object
	blob = cv2.dnn.blobFromImage(frame, swapRB=True, crop=False)
	net.setInput(blob)
	start = time.time()
	(boxes, masks) = net.forward(["detection_out_final",
		"detection_masks"])
	end = time.time()

We start traversing frames by defining an infinite while loop and capturing the first frame (lines 62-64). The loop will process the video until it is complete, which is handled by the exit conditions on lines 68 and 69.

Then we construct a Blob from the framework, and then pass it through the neural network to obtain the elapsed time at the same time, so that we can calculate the estimated completion time (lines 75-80). The results are contained in boxes and masks.

Now let's start traversing the detected objects:

	# loop over the number of detected objects
	for i in range(0, boxes.shape[2]):
		# extract the class ID of the detection along with the
		# confidence (i.e., probability) associated with the
		# prediction
		classID = int(boxes[0, 0, i, 1])
		confidence = boxes[0, 0, i, 2]

		# filter out weak predictions by ensuring the detected
		# probability is greater than the minimum probability
		if confidence > args["confidence"]:
			# scale the bounding box coordinates back relative to the
			# size of the frame and then compute the width and the
			# height of the bounding box
			(H, W) = frame.shape[:2]
			box = boxes[0, 0, i, 3:7] * np.array([W, H, W, H])
			(startX, startY, endX, endY) = box.astype("int")
			boxW = endX - startX
			boxH = endY - startY

			# extract the pixel-wise segmentation for the object,
			# resize the mask such that it's the same dimensions of
			# the bounding box, and then finally threshold to create
			# a *binary* mask
			mask = masks[i, classID]
			mask = cv2.resize(mask, (boxW, boxH),
				interpolation=cv2.INTER_NEAREST)
			mask = (mask > args["threshold"])

			# extract the ROI of the image but *only* extracted the
			# masked region of the ROI
			roi = frame[startY:endY, startX:endX][mask]

First, we filter out weak detection with low confidence value. Then we determine the bounding box coordinates and obtain the mask and roi.

Now, let's draw the transparent overlay, boundary rectangle and label + confidence of the object:

			# grab the color used to visualize this particular class,
			# then create a transparent overlay by blending the color
			# with the ROI
			color = COLORS[classID]
			blended = ((0.4 * color) + (0.6 * roi)).astype("uint8")

			# store the blended ROI in the original frame
			frame[startY:endY, startX:endX][mask] = blended

			# draw the bounding box of the instance on the frame
			color = [int(c) for c in color]
			cv2.rectangle(frame, (startX, startY), (endX, endY),
				color, 2)

			# draw the predicted label and associated probability of
			# the instance segmentation on the frame
			text = "{}: {:.4f}".format(LABELS[classID], confidence)
			cv2.putText(frame, text, (startX, startY - 5),
				cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)

Here, we mix roi with color and store it in the original frame, effectively creating a color transparent overlay (lines 118-122).

Then, we draw a rectangle around the object and display the category label + confidence above (lines 125-133).

Finally, let's write the video file and clean it up:

	# check if the video writer is None
	if writer is None:
		# initialize our video writer
		fourcc = cv2.VideoWriter_fourcc(*"MJPG")
		writer = cv2.VideoWriter(args["output"], fourcc, 30,
			(frame.shape[1], frame.shape[0]), True)

		# some information on processing single frame
		if total > 0:
			elap = (end - start)
			print("[INFO] single frame took {:.4f} seconds".format(elap))
			print("[INFO] estimated total time to finish: {:.4f}".format(
				elap * total))

	# write the output frame to disk
	writer.write(frame)

# release the file pointers
print("[INFO] cleaning up...")
writer.release()
vs.release()

In the first iteration of the loop, our video writer is initialized.

The estimated processing time will be printed to the terminal on lines 143-147.

The last operation of the loop is to write the frame to disk through our writer object (line 150).

You will notice that I did not display every frame on the screen. The display operation is very time-consuming. In any case, after the script is processed, you can use any media player to watch the output video.

Note: in addition, the dnn module of OpenCV does not support NVIDIA GPU. Currently, only a limited number of GPUs are supported, mainly Intel GPU s. NVIDIA GPU support is coming soon, but we can't easily use GPU with OpenCV's dnn module for the time being.

Finally, we release the video input and output file pointers (lines 154 and 155).

Now, we have encoded Mask R-CNN + OpenCV script for video stream. Let's try it!

Make sure to use the download section of this tutorial to download the source code and the Mask R-CNN model.

Then you need to use your smartphone or other recording device to collect your own videos. In addition, you can also download videos from YouTube like me.

Note: I deliberately did not include videos in today's download because they are large (400MB +). If you choose to use the same video as me, subtitles and links are at the bottom of this section.

Open a terminal from there and execute the following command:

$ python mask_rcnn_video.py --input videos/cats_and_dogs.mp4 \
	--output output/cats_and_dogs_output.avi --mask-rcnn mask-rcnn-coco
[INFO] loading Mask R-CNN from disk...
[INFO] 19312 total frames in video
[INFO] single frame took 0.8585 seconds
[INFO] estimated total time to finish: 16579.2047
Figure 11: Mask R-CNN applied to video through Python and OpenCV.

On top video In, you can find an interesting cat and dog video clip with Mask R-CNN!

This is the second example of applying OpenCV and mask r-cnn to car "sliding" in cold conditions video Examples of clips:

$ python mask_rcnn_video.py --input videos/slip_and_slide.mp4 \
	--output output/slip_and_slide_output.avi --mask-rcnn mask-rcnn-coco
[INFO] loading Mask R-CNN from disk...
[INFO] 17421 total frames in video
[INFO] single frame took 0.9341 seconds
[INFO] estimated total time to finish: 16272.9920
Figure 12: apply Mask R-CNN object detection to car video scene using Python and OpenCV.

 

You can imagine applying Mask R-CNN to busy roads to check traffic jams, traffic accidents or travelers who need immediate help and attention.

 

How to train your Mask R-CNN model?

Figure 13: in my Deep learning of computer vision using Python In this book, you will learn how to annotate
With your own training data, train custom Mask R-CNN and apply it to your own images.
I also provided two information about (1) skin lesions / cancer segmentation and (2) prescription drug segmentation
This is the first step in pill recognition.

 

The Mask R-CNN model we used in this tutorial has been pre trained on the COCO dataset

... but what if you want to train Mask R-CNN on your own custom dataset?

In my Deep learning of computer vision using Python In this book, I:

  1. Teach you how to train Mask R-CNN to automatically detect and segment cancerous skin lesions - this is the first step in building an automatic cancer risk factor classification system.
  2. Provide you with my favorite image annotation tool, so that you can create a mask for the input image.
  3. Show you how to train Mask R-CNN on a custom dataset.
  4. When training your own Mask R-CNN, I will provide you with my best practices, skills and suggestions.

All chapters of Mask R-CNN contain detailed descriptions of algorithms and codes to ensure that you can successfully train your Mask R-CNN.

summary

In this tutorial, you learned how to use the Mask R-CNN architecture with OpenCV and Python to segment objects in images and video streams.

Object detectors such as YOLO, SSD and fast r-cnn can only generate the bounding box coordinates of the object in the image - they know nothing about the actual shape of the object itself.

Using Mask R-CNN, we can generate pixel level masks for each object in the image, so that we can segment foreground objects from the background.

In addition, Mask R-CNN enables us to segment complex objects and shapes from images, which cannot be achieved by traditional computer vision algorithms.

 

 

 

 

Tags: Python OpenCV

Posted by Frame on Sat, 07 May 2022 01:50:23 +0300