openvino series 15 openvino OCR
This case mainly explains how to use OpenVINO OCR model for font detection and recognition. On the whole, the OCR module provided by OpenVINO is generally effective, because this module can only recognize numbers and letters. Meeting special characters will affect the recognition accuracy, and there are certain requirements for the angle and resolution of characters.
- Corresponding model of font detection task: horizontal-text-detection-0001.
- Corresponding model of font recognition task: text-recognition-0014.
Environment Description:
- Operating environment of this case: Win10, 10 generation i5 notebook
- IDE: VSCode
- openvino version: 2022.1
- Code link,11-OCR
1. Use of model
OpenVINO's Model Zoo provides many pre training models.
1.1 font detection pre training model
For the model of font detection, Model Zoo provides the following:
horizontal-text-detection-0001 | text-detection-0003 | text-detection-0004 | |
---|---|---|---|
explain | based on FCOS architecture with MobileNetV2-like as a backbone | based on PixelLink architecture with MobileNetV2-like as a backbone | based on PixelLink architecture with MobileNetV2, depth_multiplier=1.4 as a backbone |
input | [1,3704704], corresponding to [1,C,H,W] | [17681280,3], corresponding to [B,H,W,C] | [17681280,3], corresponding to [B,H,W,C] |
Output 1 | Boxes: [N,5], where N is the number of bounding boxes detected. The format of each detection box is: [x_min,y_min,x_max,y_max,conf] | model/link_logits_/add: [1,192,320,16],logits related to linkage between pixels and their neighbors | model/link_logits_/add: [1,192,320,16],logits related to linkage between pixels and their neighbors |
Output 2 | labels: [N], where N is the number of detected bounding boxes. In the case of text detection, the value of each detected box is equal to 0. | model/segm_logits/add: [1,192,320,2],logits related to text/no-text classification for each pixel | model/segm_logits/add: [1,192,320,2],logits related to text/no-text classification for each pixel |
B - batch size;H - image height;W - image width;C - number of channels.
1.2 FCOS review
horizontal-text-detection-0001 this model is trained by FCOS. Here we are on FCOS( Fully Convolutional One-Stage Object Detection )Do a simple review.
FCOS is an end-to-end anchor free one stage object recognition algorithm. The network structure is shown in the figure below, which is composed of the following three parts:
- backbone network;
- feature pyramid structure;
- Output part (classification / progression / center ness);
According to FPN, we detect objects of different sizes on feature maps at different levels. Specifically, we extract five feature maps and define them as{ P 3 P_3 P3, P 4 P_4 P4, P 5 P_5 P5, P 6 P_6 P6, P 7 P_7 P7}. P 3 P_3 P3, P 4 P_4 P4, P 5 P_5 P5. Characteristic diagram of trunk CNN C 3 C_3 C3, C 4 C_4 C4, C 5 C_5 C5 is obtained through a 1x1 convolution transverse connection. P 6 P_6 P6, P 7 P_7 P7 , by P 5 P_5 P5, P 6 P_6 P6 , is obtained through a convolution layer with stripe = 2. So, in the end, what we got P 3 P_3 P3, P 4 P_4 P4, P 5 P_5 P5, P 6 P_6 P6, P 7 P_7 P7 corresponds to stripe 8,16,32,64128 respectively.
The upper part of the FCOS feature is used to classify the target layer. The lower part of the FCOS feature is used to classify the target layer. The classified branch also has a center ness branch to predict the center point. Unlike the traditional form of center point + width height or coordinate point, FCOS predicts the position of the object frame through the center point and a 4D vector(l,t,r,b).
Finally, note that in FCOS, as long as a point at a certain position of the feature map falls into the bbox of the ground truth, it is considered as a positive sample. It can be seen that the number of positive samples used for training will be very large.
Cost Function will not be repeated here. We just review the overall logic of FCOS algorithm here.
1.3 PixelLink algorithm review
The algorithm behind text-detection-0003 and text-detection-0004 is based on PixelLink: Detecting Scene Text via Instance Segmentation . Here, let's make a brief review of PixelLink.
For the general text detection model based on deep learning, the main implementation steps are to judge whether it is text, and give the position and angle of the text box, as shown in the following figure:
Although the FCOS model in the previous chapter does not specifically detect words, the overall logic is similar. There is a regression and a classification at the end.
PixelLink mainly has two parts: pixel and link. PixelLink is mainly based on CNN network to make text / non text classification prediction of a pixel and classification prediction of whether there are links in the eight neighborhood directions of the pixel (that is, the eight heat maps in the dotted box in the above figure represent the connection prediction in the eight directions).
The backbone of the pixelink network structure uses VGG16 as the feature extractor to replace the last full connection layers fc6 and fc7 with the convolution layer. The method of feature fusion and pixel prediction is based on the FPN idea (feature pyramid network), that is, the size of the convolution layer is halved in turn, but the number of convolution cores is doubled in turn. The model structure has two independent headers, one for text / non text prediction and the other for Link Prediction. Both of them use Softmax and output 1x2=2 channels (text / non text classification) and 8x2=16 channels (whether there is connected classification in 8 neighborhood directions).
1.4 font recognition pre training model
For the model of font recognition, Model Zoo provides the following:
text-recognition-0012 | text-recognition-0014 | text-recognition-resnet-fc | |
---|---|---|---|
explain | VGG16-like backbone and bidirectional LSTM encoder-decoder | ResNext101-like backbone (stage-1-2) and bidirectional LSTM encoder-decoder. | model based on ResNet with Fully Connected text recognition head |
Accuracy in ICDAR13 Dataset | 0.8818 | 0.8887 | 92.96% |
input | [1,32120,1], corresponding to [B,H,W,C] | [1,1,32128], corresponding to [B,C,H,W] | [1,1,32100], corresponding to [B,C,H,W] |
be careful | source image should be tight aligned crop with detected text converted to grayscale. | source image should be tight aligned crop with detected text converted to grayscale. | source image should be tight aligned crop with detected text converted to grayscale. Mean values: [127.5, 127.5, 127.5], scale factor for each channel: 127.5. |
output | boxes: [30,1,37], corresponding to [W,B,L], order of L: 0123456789abcdefghijklmnopqrstuvwxyz# | [16,1,37], corresponding to [W,B,L], order of L: #0123456789abcdefghijklmnopqrstuvwxyz | [1,26,37], corresponding to [B,W,L], order of L: [s]0123456789abcdefghijklmnopqrstuvwxyz |
B - batch size;H - image height;W - image width;C - number of channels;W: output sequence length;L: confidence distribution across alphanumeric symbols.
1.5 final selection
Finally, we choose:
- Corresponding model of font detection task: horizontal-text-detection-0001.
- Corresponding model of font recognition task: text-recognition-0014.
2. Code
2.1 download model
First, like other models, let's download the model first.
import shutil import sys from pathlib import Path import cv2 import matplotlib.pyplot as plt import numpy as np from IPython.display import Markdown, display from PIL import Image from openvino.runtime import Core from yaspin import yaspin import numpy from PIL import Image, ImageOps ie = Core() model_dir = Path("model") precision = "FP16" detection_model = "horizontal-text-detection-0001" recognition_model = "text-recognition-0014" #base_model_dir = Path("~/open_model_zoo_models").expanduser() base_model_dir = Path("./model/open_model_zoo_models").expanduser() #omz_cache_dir = Path("~/open_model_zoo_cache").expanduser() omz_cache_dir = Path("./model/open_model_zoo_cache").expanduser() model_dir.mkdir(exist_ok=True) ''' Download model ''' print("1 - Download text detection model: horizontal-text-detection-0001, and text recognition model: text-recognition-0014 from Open Model Zoo. Both models are already in IR format.") ir_path_detection_model = Path(f"{base_model_dir}/intel/{detection_model}/{precision}/{detection_model}.xml") ir_path_recognition_model = Path(f"{base_model_dir}/intel/{recognition_model}/{precision}/{recognition_model}.xml") if not ir_path_detection_model.exists() and ir_path_recognition_model.exists(): download_command = f"omz_downloader " \ f"--name {detection_model},{recognition_model} " \ f"--output_dir {base_model_dir} " \ f"--cache_dir {omz_cache_dir} " \ f"--precision {precision}" display(Markdown(f"Download command: `{download_command}`")) with yaspin(text=f"Downloading {detection_model}, {recognition_model}") as sp: download_result = !$download_command print(download_result) sp.text = f"Finished downloading {detection_model}, {recognition_model}" sp.ok("✔") else: print("IR model already exists.")
2.2 font detection model
- Loading detection model: horizontal-text-detection-0001;
- Load the image and adjust its size to match the input size of the model;
- Model reasoning, and return the detection reasoning results.
First, let's load the detection model and take a look at the input and output of the model:
print("2 - Load detection Model: horizontal-text-detection-0001") detection_model = ie.read_model( model=ir_path_detection_model, weights=ir_path_detection_model.with_suffix(".bin") ) detection_compiled_model = ie.compile_model(model=detection_model, device_name="CPU") detection_input_layer = detection_compiled_model.input(0) detection_output_layer_box = detection_compiled_model.output('boxes') detection_output_layer_label = detection_compiled_model.output('labels') print("- Input of detection model shape: {}".format(detection_input_layer)) print("- Output `box` of detection model shape: {}".format(detection_output_layer_box)) print("- Output `label` of detection model shape: {}".format(detection_output_layer_label))
Terminal print:
2 - Load detection Model. - Input of detection model shape: <ConstOutput: names[image] shape{1,3,704,704} type: f32> - Output `box` of detection model shape: <ConstOutput: names[boxes] shape{..100,5} type: f32> - Output `label` of detection model shape: <ConstOutput: names[labels] shape{..100} type: i64>
Next, we import the image and adjust its size to match the input size of the model.
print("3 - Load Image and resize into model input shape.") # Read the image image = cv2.imread("data/label4.png") print("- Input image size: {}".format(image.shape)) # N,C,H,W = batch size, number of channels, height, width N, C, H, W = detection_input_layer.shape # Resize image to meet network expected input sizes resized_image = cv2.resize(image, (W, H)) # Reshape to network input shape input_image = np.expand_dims(resized_image.transpose(2, 0, 1), 0) print("- Input image is resized (with padding) into: {}".format(input_image.shape)) plt.imshow(cv2.cvtColor(resized_image, cv2.COLOR_BGR2RGB));
Terminal print:
3 - Load Image and resize into model input shape. - Input image size: (256, 644, 3) - Input image is resized (with padding) into: (1, 3, 704, 704)
The code of model reasoning is as follows:
''' ### Model reasoning A text box is detected in the image and`[100, 5]`The shape is returned as a data block. The format of each test description is `[x_min, y_min, x_max, y_max, conf]`. ''' print("4 - Detection model inference.") output_key = detection_compiled_model.output("boxes") boxes = detection_compiled_model([input_image])[output_key] # Remove zero only boxes boxes = boxes[~np.all(boxes == 0, axis=1)] print("- Detect {} boxes.".format(boxes.shape[0]))
Terminal print:
4 - Detection model inference. - Detect 4 boxes.
2.3 font recognition model
The import and reasoning steps of the character recognition model and the character detection model are similar. Here we will go directly to the code:
def multiply_by_ratio(ratio_x, ratio_y, box): return [ max(shape * ratio_y, 10) if idx % 2 else shape * ratio_x for idx, shape in enumerate(box[:-1]) ] def run_preprocesing_on_crop(crop, net_shape): temp_img = cv2.resize(crop, net_shape) temp_img = temp_img.reshape((1,) * 2 + temp_img.shape) return temp_img def convert_result_to_image(bgr_image, resized_image, boxes, threshold=0.3, conf_labels=True): # Define colors for boxes and descriptions colors = {"red": (255, 0, 0), "green": (0, 255, 0), "white": (255, 255, 255)} # Fetch image shapes to calculate ratio (real_y, real_x), (resized_y, resized_x) = image.shape[:2], resized_image.shape[:2] ratio_x, ratio_y = real_x / resized_x, real_y / resized_y # Convert base image from bgr to rgb format rgb_image = cv2.cvtColor(bgr_image, cv2.COLOR_BGR2RGB) # Iterate through non-zero boxes for box, annotation in boxes: # Pick confidence factor from last place in array conf = box[-1] if conf > threshold: # Convert float to int and multiply position of each box by x and y ratio (x_min, y_min, x_max, y_max) = map(int, multiply_by_ratio(ratio_x, ratio_y, box)) # Draw box based on position, parameters in rectangle function are: image, start_point, end_point, color, thickness cv2.rectangle(rgb_image, (x_min, y_min), (x_max, y_max), colors["green"], 3) # Add text to image based on position and confidence, parameters in putText function are: image, text, bottomleft_corner_textfield, font, font_scale, color, thickness, line_type if conf_labels: # Create background box based on annotation length (text_w, text_h), _ = cv2.getTextSize( f"{annotation}", cv2.FONT_HERSHEY_TRIPLEX, 0.8, 1 ) image_copy = rgb_image.copy() cv2.rectangle( image_copy, (x_min, y_min - text_h - 10), (x_min + text_w, y_min - 10), colors["white"], -1, ) # Add weighted image copy with white boxes under text cv2.addWeighted(image_copy, 0.4, rgb_image, 0.6, 0, rgb_image) cv2.putText( rgb_image, f"{annotation}", (x_min, y_min - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.8, colors["red"], 1, cv2.LINE_AA, ) return rgb_image print("5 - Load Recognition Model: text-recognition-0014") recognition_model = ie.read_model( model=ir_path_recognition_model, weights=ir_path_recognition_model.with_suffix(".bin") ) recognition_compiled_model = ie.compile_model(model=recognition_model, device_name="CPU") recognition_output_layer = recognition_compiled_model.output(0) recognition_input_layer = recognition_compiled_model.input(0) # Get height and width of input layer _, _, Hrecog, Wrecog = recognition_input_layer.shape print("- Input of recognition model shape: {}".format(recognition_input_layer)) print("- Output of recognition model shape: {}".format(recognition_output_layer)) ''' Model reasoning ''' # Calculate scale for image resizing (real_y, real_x), (resized_y, resized_x) = image.shape[:2], resized_image.shape[:2] ratio_x, ratio_y = real_x / resized_x, real_y / resized_y # Convert image to grayscale for text recognition model grayscale_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) # Get dictionary to encode output, based on model documentation letters = "~0123456789abcdefghijklmnopqrstuvwxyz" # Prepare empty list for annotations annotations = list() cropped_images = list() # fig, ax = plt.subplots(len(boxes), 1, figsize=(5,15), sharex=True, sharey=True) # For each crop, based on boxes given by detection model we want to get annotations for i, crop in enumerate(boxes): # Get coordinates on corners of crop (x_min, y_min, x_max, y_max) = map(int, multiply_by_ratio(ratio_x, ratio_y, crop)) image_crop = run_preprocesing_on_crop(grayscale_image[y_min:y_max, x_min:x_max], (Wrecog, Hrecog)) # Run inference with recognition model result = recognition_compiled_model([image_crop])[recognition_output_layer] # Squeeze output to remove unnececery dimension recognition_results_test = np.squeeze(result) # Read annotation based on probabilities from output layer annotation = list() for letter in recognition_results_test: parsed_letter = letters[letter.argmax()] # If we detect numbers, we all need - 1 if parsed_letter.isnumeric(): parsed_letter = int(parsed_letter) parsed_letter = parsed_letter + 1 if parsed_letter == 10: parsed_letter = 0 parsed_letter = str(parsed_letter) # Returning 0 index from argmax signalises end of string if parsed_letter == letters[0]: continue annotation.append(parsed_letter) annotations.append("".join(annotation)) cropped_image = Image.fromarray(image[y_min:y_max, x_min:x_max]) cropped_images.append(cropped_image) boxes_with_annotations = list(zip(boxes, annotations))
3 Results
I tried several pictures, but the effect is average. To be honest, it's not as good as Tesseract. As shown below: