openvino series 15 openvino OCR

openvino series 15 openvino OCR

This case mainly explains how to use OpenVINO OCR model for font detection and recognition. On the whole, the OCR module provided by OpenVINO is generally effective, because this module can only recognize numbers and letters. Meeting special characters will affect the recognition accuracy, and there are certain requirements for the angle and resolution of characters.

Environment Description:

  • Operating environment of this case: Win10, 10 generation i5 notebook
  • IDE: VSCode
  • openvino version: 2022.1
  • Code link,11-OCR

1. Use of model

OpenVINO's Model Zoo provides many pre training models.

1.1 font detection pre training model

For the model of font detection, Model Zoo provides the following:

horizontal-text-detection-0001text-detection-0003text-detection-0004
explainbased on FCOS architecture with MobileNetV2-like as a backbonebased on PixelLink architecture with MobileNetV2-like as a backbonebased on PixelLink architecture with MobileNetV2, depth_multiplier=1.4 as a backbone
input[1,3704704], corresponding to [1,C,H,W][17681280,3], corresponding to [B,H,W,C][17681280,3], corresponding to [B,H,W,C]
Output 1Boxes: [N,5], where N is the number of bounding boxes detected. The format of each detection box is: [x_min,y_min,x_max,y_max,conf]model/link_logits_/add: [1,192,320,16],logits related to linkage between pixels and their neighborsmodel/link_logits_/add: [1,192,320,16],logits related to linkage between pixels and their neighbors
Output 2labels: [N], where N is the number of detected bounding boxes. In the case of text detection, the value of each detected box is equal to 0.model/segm_logits/add: [1,192,320,2],logits related to text/no-text classification for each pixelmodel/segm_logits/add: [1,192,320,2],logits related to text/no-text classification for each pixel

B - batch size;H - image height;W - image width;C - number of channels.

1.2 FCOS review

horizontal-text-detection-0001 this model is trained by FCOS. Here we are on FCOS( Fully Convolutional One-Stage Object Detection )Do a simple review.

FCOS is an end-to-end anchor free one stage object recognition algorithm. The network structure is shown in the figure below, which is composed of the following three parts:

  1. backbone network;
  2. feature pyramid structure;
  3. Output part (classification / progression / center ness);

According to FPN, we detect objects of different sizes on feature maps at different levels. Specifically, we extract five feature maps and define them as{ P 3 P_3 P3​, P 4 P_4 P4​, P 5 P_5 P5​, P 6 P_6 P6​, P 7 P_7 P7​}. P 3 P_3 P3​, P 4 P_4 P4​, P 5 P_5 P5. Characteristic diagram of trunk CNN C 3 C_3 C3​, C 4 C_4 C4​, C 5 C_5 C5 is obtained through a 1x1 convolution transverse connection. P 6 P_6 P6​, P 7 P_7 P7 , by P 5 P_5 P5​, P 6 P_6 P6 , is obtained through a convolution layer with stripe = 2. So, in the end, what we got P 3 P_3 P3​, P 4 P_4 P4​, P 5 P_5 P5​, P 6 P_6 P6​, P 7 P_7 P7 corresponds to stripe 8,16,32,64128 respectively.

The upper part of the FCOS feature is used to classify the target layer. The lower part of the FCOS feature is used to classify the target layer. The classified branch also has a center ness branch to predict the center point. Unlike the traditional form of center point + width height or coordinate point, FCOS predicts the position of the object frame through the center point and a 4D vector(l,t,r,b).

Finally, note that in FCOS, as long as a point at a certain position of the feature map falls into the bbox of the ground truth, it is considered as a positive sample. It can be seen that the number of positive samples used for training will be very large.

Cost Function will not be repeated here. We just review the overall logic of FCOS algorithm here.

1.3 PixelLink algorithm review

The algorithm behind text-detection-0003 and text-detection-0004 is based on PixelLink: Detecting Scene Text via Instance Segmentation . Here, let's make a brief review of PixelLink.

For the general text detection model based on deep learning, the main implementation steps are to judge whether it is text, and give the position and angle of the text box, as shown in the following figure:

Although the FCOS model in the previous chapter does not specifically detect words, the overall logic is similar. There is a regression and a classification at the end.

PixelLink mainly has two parts: pixel and link. PixelLink is mainly based on CNN network to make text / non text classification prediction of a pixel and classification prediction of whether there are links in the eight neighborhood directions of the pixel (that is, the eight heat maps in the dotted box in the above figure represent the connection prediction in the eight directions).

The backbone of the pixelink network structure uses VGG16 as the feature extractor to replace the last full connection layers fc6 and fc7 with the convolution layer. The method of feature fusion and pixel prediction is based on the FPN idea (feature pyramid network), that is, the size of the convolution layer is halved in turn, but the number of convolution cores is doubled in turn. The model structure has two independent headers, one for text / non text prediction and the other for Link Prediction. Both of them use Softmax and output 1x2=2 channels (text / non text classification) and 8x2=16 channels (whether there is connected classification in 8 neighborhood directions).

1.4 font recognition pre training model

For the model of font recognition, Model Zoo provides the following:

text-recognition-0012text-recognition-0014text-recognition-resnet-fc
explainVGG16-like backbone and bidirectional LSTM encoder-decoderResNext101-like backbone (stage-1-2) and bidirectional LSTM encoder-decoder.model based on ResNet with Fully Connected text recognition head
Accuracy in ICDAR13 Dataset0.88180.888792.96%
input[1,32120,1], corresponding to [B,H,W,C][1,1,32128], corresponding to [B,C,H,W][1,1,32100], corresponding to [B,C,H,W]
be carefulsource image should be tight aligned crop with detected text converted to grayscale.source image should be tight aligned crop with detected text converted to grayscale.source image should be tight aligned crop with detected text converted to grayscale. Mean values: [127.5, 127.5, 127.5], scale factor for each channel: 127.5.
outputboxes: [30,1,37], corresponding to [W,B,L], order of L: 0123456789abcdefghijklmnopqrstuvwxyz#[16,1,37], corresponding to [W,B,L], order of L: #0123456789abcdefghijklmnopqrstuvwxyz[1,26,37], corresponding to [B,W,L], order of L: [s]0123456789abcdefghijklmnopqrstuvwxyz

B - batch size;H - image height;W - image width;C - number of channels;W: output sequence length;L: confidence distribution across alphanumeric symbols.

1.5 final selection

Finally, we choose:

2. Code

2.1 download model

First, like other models, let's download the model first.

import shutil
import sys
from pathlib import Path
import cv2
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import Markdown, display
from PIL import Image
from openvino.runtime import Core
from yaspin import yaspin
import numpy
from PIL import Image, ImageOps

ie = Core()
model_dir = Path("model")
precision = "FP16"
detection_model = "horizontal-text-detection-0001"
recognition_model = "text-recognition-0014"
#base_model_dir = Path("~/open_model_zoo_models").expanduser()
base_model_dir = Path("./model/open_model_zoo_models").expanduser()
#omz_cache_dir = Path("~/open_model_zoo_cache").expanduser()
omz_cache_dir = Path("./model/open_model_zoo_cache").expanduser()
model_dir.mkdir(exist_ok=True)
'''
Download model
'''
print("1 - Download text detection model: horizontal-text-detection-0001, and text recognition model: text-recognition-0014 from Open Model Zoo. Both models are already in IR format.")
ir_path_detection_model = Path(f"{base_model_dir}/intel/{detection_model}/{precision}/{detection_model}.xml")
ir_path_recognition_model = Path(f"{base_model_dir}/intel/{recognition_model}/{precision}/{recognition_model}.xml")

if not ir_path_detection_model.exists() and ir_path_recognition_model.exists():
    download_command = f"omz_downloader " \
                        f"--name {detection_model},{recognition_model} " \
                        f"--output_dir {base_model_dir} " \
                        f"--cache_dir {omz_cache_dir} " \
                        f"--precision {precision}"

    display(Markdown(f"Download command: `{download_command}`"))
    with yaspin(text=f"Downloading {detection_model}, {recognition_model}") as sp:
        download_result = !$download_command
        print(download_result)
        sp.text = f"Finished downloading {detection_model}, {recognition_model}"
        sp.ok("✔")
else:
    print("IR model already exists.")

2.2 font detection model

  • Loading detection model: horizontal-text-detection-0001;
  • Load the image and adjust its size to match the input size of the model;
  • Model reasoning, and return the detection reasoning results.

First, let's load the detection model and take a look at the input and output of the model:

print("2 - Load detection Model: horizontal-text-detection-0001")

detection_model = ie.read_model(
    model=ir_path_detection_model, weights=ir_path_detection_model.with_suffix(".bin")
)
detection_compiled_model = ie.compile_model(model=detection_model, device_name="CPU")

detection_input_layer = detection_compiled_model.input(0)
detection_output_layer_box = detection_compiled_model.output('boxes')
detection_output_layer_label = detection_compiled_model.output('labels')

print("- Input of detection model shape: {}".format(detection_input_layer))
print("- Output `box` of detection model shape: {}".format(detection_output_layer_box))
print("- Output `label` of detection model shape: {}".format(detection_output_layer_label))

Terminal print:

2 - Load detection Model.
- Input of detection model shape: <ConstOutput: names[image] shape{1,3,704,704} type: f32>
- Output `box` of detection model shape: <ConstOutput: names[boxes] shape{..100,5} type: f32>
- Output `label` of detection model shape: <ConstOutput: names[labels] shape{..100} type: i64>

Next, we import the image and adjust its size to match the input size of the model.

print("3 - Load Image and resize into model input shape.")

# Read the image
image = cv2.imread("data/label4.png")
print("- Input image size: {}".format(image.shape))
# N,C,H,W = batch size, number of channels, height, width
N, C, H, W = detection_input_layer.shape

# Resize image to meet network expected input sizes
resized_image = cv2.resize(image, (W, H))

# Reshape to network input shape
input_image = np.expand_dims(resized_image.transpose(2, 0, 1), 0)
print("- Input image is resized (with padding) into: {}".format(input_image.shape))

plt.imshow(cv2.cvtColor(resized_image, cv2.COLOR_BGR2RGB));

Terminal print:

3 - Load Image and resize into model input shape.
- Input image size: (256, 644, 3)
- Input image is resized (with padding) into: (1, 3, 704, 704)

The code of model reasoning is as follows:

'''
### Model reasoning
 A text box is detected in the image and`[100, 5]`The shape is returned as a data block. The format of each test description is `[x_min, y_min, x_max, y_max, conf]`. 
'''
print("4 - Detection model inference.")
output_key = detection_compiled_model.output("boxes")
boxes = detection_compiled_model([input_image])[output_key]

# Remove zero only boxes
boxes = boxes[~np.all(boxes == 0, axis=1)]
print("- Detect {} boxes.".format(boxes.shape[0]))

Terminal print:

4 - Detection model inference.
- Detect 4 boxes.

2.3 font recognition model

The import and reasoning steps of the character recognition model and the character detection model are similar. Here we will go directly to the code:

def multiply_by_ratio(ratio_x, ratio_y, box):
    return [
        max(shape * ratio_y, 10) if idx % 2 else shape * ratio_x
        for idx, shape in enumerate(box[:-1])
    ]


def run_preprocesing_on_crop(crop, net_shape):
    temp_img = cv2.resize(crop, net_shape)
    temp_img = temp_img.reshape((1,) * 2 + temp_img.shape)
    return temp_img


def convert_result_to_image(bgr_image, resized_image, boxes, threshold=0.3, conf_labels=True):
    # Define colors for boxes and descriptions
    colors = {"red": (255, 0, 0), "green": (0, 255, 0), "white": (255, 255, 255)}

    # Fetch image shapes to calculate ratio
    (real_y, real_x), (resized_y, resized_x) = image.shape[:2], resized_image.shape[:2]
    ratio_x, ratio_y = real_x / resized_x, real_y / resized_y

    # Convert base image from bgr to rgb format
    rgb_image = cv2.cvtColor(bgr_image, cv2.COLOR_BGR2RGB)

    # Iterate through non-zero boxes
    for box, annotation in boxes:
        # Pick confidence factor from last place in array
        conf = box[-1]
        if conf > threshold:
            # Convert float to int and multiply position of each box by x and y ratio
            (x_min, y_min, x_max, y_max) = map(int, multiply_by_ratio(ratio_x, ratio_y, box))

            # Draw box based on position, parameters in rectangle function are: image, start_point, end_point, color, thickness
            cv2.rectangle(rgb_image, (x_min, y_min), (x_max, y_max), colors["green"], 3)

            # Add text to image based on position and confidence, parameters in putText function are: image, text, bottomleft_corner_textfield, font, font_scale, color, thickness, line_type
            if conf_labels:
                # Create background box based on annotation length
                (text_w, text_h), _ = cv2.getTextSize(
                    f"{annotation}", cv2.FONT_HERSHEY_TRIPLEX, 0.8, 1
                )
                image_copy = rgb_image.copy()
                cv2.rectangle(
                    image_copy,
                    (x_min, y_min - text_h - 10),
                    (x_min + text_w, y_min - 10),
                    colors["white"],
                    -1,
                )
                # Add weighted image copy with white boxes under text
                cv2.addWeighted(image_copy, 0.4, rgb_image, 0.6, 0, rgb_image)
                cv2.putText(
                    rgb_image,
                    f"{annotation}",
                    (x_min, y_min - 10),
                    cv2.FONT_HERSHEY_SIMPLEX,
                    0.8,
                    colors["red"],
                    1,
                    cv2.LINE_AA,
                )

    return rgb_image

print("5 - Load Recognition Model: text-recognition-0014")

recognition_model = ie.read_model(
    model=ir_path_recognition_model, weights=ir_path_recognition_model.with_suffix(".bin")
)

recognition_compiled_model = ie.compile_model(model=recognition_model, device_name="CPU")

recognition_output_layer = recognition_compiled_model.output(0)
recognition_input_layer = recognition_compiled_model.input(0)

# Get height and width of input layer
_, _, Hrecog, Wrecog = recognition_input_layer.shape

print("- Input of recognition model shape: {}".format(recognition_input_layer))
print("- Output of recognition model shape: {}".format(recognition_output_layer))

'''
Model reasoning
'''
# Calculate scale for image resizing
(real_y, real_x), (resized_y, resized_x) = image.shape[:2], resized_image.shape[:2]
ratio_x, ratio_y = real_x / resized_x, real_y / resized_y

# Convert image to grayscale for text recognition model
grayscale_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

# Get dictionary to encode output, based on model documentation
letters = "~0123456789abcdefghijklmnopqrstuvwxyz"

# Prepare empty list for annotations
annotations = list()
cropped_images = list()
# fig, ax = plt.subplots(len(boxes), 1, figsize=(5,15), sharex=True, sharey=True)
# For each crop, based on boxes given by detection model we want to get annotations
for i, crop in enumerate(boxes):
    # Get coordinates on corners of crop
    (x_min, y_min, x_max, y_max) = map(int, multiply_by_ratio(ratio_x, ratio_y, crop))
    image_crop = run_preprocesing_on_crop(grayscale_image[y_min:y_max, x_min:x_max], (Wrecog, Hrecog))

    # Run inference with recognition model
    result = recognition_compiled_model([image_crop])[recognition_output_layer]

    # Squeeze output to remove unnececery dimension
    recognition_results_test = np.squeeze(result)

    # Read annotation based on probabilities from output layer
    annotation = list()
    for letter in recognition_results_test:
        parsed_letter = letters[letter.argmax()]

        # If we detect numbers, we all need - 1
        if parsed_letter.isnumeric():
            parsed_letter = int(parsed_letter)
            parsed_letter = parsed_letter + 1
            if parsed_letter == 10:
                parsed_letter = 0
            parsed_letter = str(parsed_letter)
        # Returning 0 index from argmax signalises end of string
        if parsed_letter == letters[0]:
            continue
        annotation.append(parsed_letter)
    annotations.append("".join(annotation))
    cropped_image = Image.fromarray(image[y_min:y_max, x_min:x_max])
    cropped_images.append(cropped_image)

boxes_with_annotations = list(zip(boxes, annotations))

3 Results

I tried several pictures, but the effect is average. To be honest, it's not as good as Tesseract. As shown below:

Tags: Object Detection OCR OpenVINO

Posted by villager203 on Sat, 14 May 2022 02:14:18 +0300