CASIA-HWDB2.x (offline) data set is merged into page type and marked with row-level bbox

Recently, I will do line-level handwritten document detection work, merge CASIA-HWDB2.x (offline) data, and generate a page-level dataset with corresponding bbox. If you want to exchange ocr-related work, you can join the group (at the end of the article):

CASIA-HWDB2.x (offline) data set download address: http://www.nlpr.ia.ac.cn/databases/handwriting/Download.html

What I have downloaded is this part:

CASIA-HWDB2.x (offline) dataset analysis can refer to: https://www.freesion.com/article/6894959465/

After parsing, the pictures under HWDB2.xTrain_images:

Picture Preview:

001-P16_0.jpg:

 

After parsing, the label under HWDB2.xTrain_label:

label preview:

001-P16_0.txt:

Since 2002, domestic entrepreneurs, including many well-known entrepreneurs, have been arrested for illegal crimes.

 

Now enter the topic and stitch the parsed images into a full page form:

001-P16, page result preview:

label preview:

671, 1000, 2660, 1000, 2660, 1120, 671, 1120, since 2002, domestic entrepreneurs, including many well-known entrepreneurs, have been arrested for illegal crimes.
500, 1120, 2591, 1120, 2591, 1246, 500, 1246, the number of prisoners is increasing, and reports in this regard are often extreme. not which one was caught,
500, 1246, 2565, 1246, 2565, 1347, 500, 1347, which one was sentenced, or whether this case was held, or that case was sentenced. In short, it happens almost every month
500, 1347, 770, 1347, 770, 1442, 500, 1442, news.
667, 1442, 2660, 1442, 2660, 1551, 667, 1551, entrepreneurs were sacked, sentenced, imprisoned, and even executed for capital crimes, the focus of media attention
500, 1551, 2633, 1551, 2633, 1671, 500, 1671 are often not legal issues, but more of entrepreneurial management and management issues. published in the media
500, 1671, 2655, 1671, 2655, 1769, 500, 1769, there are many economists and management scientists, but few legal experts participate in the discussion. This is a
500, 1769, 2624, 1769, 2624, 1892, 500, 1892, abnormal phenomena. No matter what problems the entrepreneur has in operation and management, if the final outcome is
500, 1892, 2660, 1892, 2660, 1984, 500, 1984, walking into prison, if the final conclusion is that the court is convicted, then the most important thing should be the law
500, 1984, 674, 1984, 674, 2065, 500, 2065, questions!

 

The main idea of ​​page generation:

1. The prefixes of the parsed images have unique identifiers, such as 001-P16_0.jpg, 001-P16_1.jpg, ..., 001-P16_9.jpg, their unique identifiers are 001-P16, which means 001 -P16 This page is divided into 0-10 lines of pictures. If it is restored to a whole page, just sort and stitch these pictures in order from top to bottom.

2. We call the segmented image segment, the width and height of each segment are different, and the height is ignored first, but when splicing from top to bottom, the width should be consistent, take the largest of these segments width, pad all segment s smaller than max_width to the corresponding length.

3. When padding, since the length of the segment header and segment is obviously smaller than the length between segments, it is obviously inappropriate to pad to the front or back end of the segment. At this time, make a simple judgment. If it is the beginning, pad to the segment. The front end, if it is the end or the segment, is pad to the back end of the segment.

4. Finally, put the picture of the pad into a page in white on the periphery of the pad. You can choose this by yourself, or you can choose not to pad, and choose according to your personal needs.

 

The main idea of ​​label generation:

1. It is composed of two parts: bbox coordinates and characters at the row level, which can be changed according to your own needs, which is very flexible

2. First generate the coordinates of the bbox, and then read the label of each segment and write it into the label of the new page level. Let’s look at the code for the generation of the bbox coordinates. I am afraid that there will be misunderstandings when writing the text - -!

 

Code part:

import numpy as np
import cv2
import os
from glob import glob
import re
from tqdm import tqdm

def get_char_nums(segments):
    nums = []
    chars = []
    for seg in segments:
        label_head = seg.split('.')[0]
        label_name = label_head + '.txt'
        with open(os.path.join(label_root,label_name), 'r', encoding='utf-8') as f:
            lines = f.readlines()
            nums.append(len(lines[0]))
            chars.append(lines[0])
    return nums, chars

def addZeros(s_):
    head, tail = s_.split('_')
    num = ''.join(re.findall(r'\d',tail))
    head_num = '0'*(4-len(num)) + num
    return head + '_' + head_num + '.jpg'

def strsort(alist):
    alist.sort(key=lambda i:addZeros(i))
    return alist

def pad(img, headpad, padding):
    assert padding>=0
    if padding>0:
        logi_matrix = np.where(img > 255*0.95, np.ones_like(img), np.zeros_like(img))
        ids = np.where(np.sum(logi_matrix, 0) == img.shape[0])
        if ids[0].tolist() != []:
            pad_array = np.tile(img[:,ids[0].tolist()[-1],:], (1, padding)).reshape((img.shape[0],-1,3))
        else:
            pad_array = np.tile(np.ones_like(img[:, 0, :]) * 255, (1, padding)).reshape((img.shape[0], -1, 3))
        if headpad:
            return np.hstack((pad_array, img))
        else:
            return np.hstack((img, pad_array))
    else:
        return img

def pad_peripheral(img, pad_size):
    assert isinstance(pad_size,tuple)
    w, h = pad_size
    result = cv2.copyMakeBorder(img, h, h, w, w, cv2.BORDER_CONSTANT, value=[255, 255, 255])
    return result



if __name__=='__main__':
    label_root = r'G:\ocr\HWDB2.xTrain_label'
    label_det = r'G:\ocr\HWDB2.xTrain_fullLabels'
    pages_root = r'G:\ocr\HWDB2.xTrain_images'
    pages_det = r'G:\ocr\HWDB2.xTrain_fullpages'
    os.makedirs(label_root, exist_ok=True)
    os.makedirs(pages_root, exist_ok=True)
    pages_for_set = os.listdir(pages_root)
    pages_set = set([pfs.split('_')[0] for pfs in pages_for_set])
    for ds in tqdm(pages_set):
        boxes = []
        pages = []
        seg_sorted = strsort([d for d in pages_for_set if ds in d])
        widths = [cv2.imread(os.path.join(pages_root, d)).shape[1] for d in seg_sorted]
        heights = [cv2.imread(os.path.join(pages_root, d)).shape[0] for d in seg_sorted]
        max_width = max(widths)
        seg_nums, chars = get_char_nums(seg_sorted)
        pad_size = (500, 1000)
        w, h = pad_size
        label_name = ds + '.txt'
        with open(os.path.join(label_det, label_name), 'w') as f:
            for i,pg in enumerate(seg_sorted):
                headpad = True if i==0 else True if seg_nums[i] - seg_nums[i-1]>5 else False
                pg_read = cv2.imread(os.path.join(pages_root, pg))
                padding = max_width - pg_read.shape[1]
                page_new = pad(pg_read, headpad, padding)
                pages.append(page_new)
                if headpad:
                    x1 = str(w + padding)
                    x2 = str(w + max_width)
                    y1 = str(h + sum(heights[:i+1]) - heights[i])
                    y2 = str(h + sum(heights[:i+1]))
                    box = np.array([int(x1),int(y1),int(x2),int(y1),int(x2),int(y2),int(x1),int(y2)])
                else:
                    x1 = str(w)
                    x2 = str(w + max_width - padding)
                    y1 = str(h + sum(heights[:i + 1]) - heights[i])
                    y2 = str(h + sum(heights[:i + 1]))
                    box = np.array([int(x1), int(y1), int(x2), int(y1), int(x2), int(y2), int(x1), int(y2)])
                boxes.append(box.reshape((4,2)))
                char = chars[i]
                f.writelines(x1 + ',' + y1 + ',' + x2 + ',' + y1 + ',' + x2 + ',' + y2 + ',' + x1 + ',' + y2 + ',' + char + '\n')
        pages_array = np.vstack(pages)
        pages_array = pad_peripheral(pages_array,pad_size)
        pages_name = ds + '.jpg'
        # cv2.polylines(pages_array, [box.astype('int32') for box in boxes], True, (0, 0, 255))
        cv2.imwrite(os.path.join(pages_det, pages_name),pages_array)

group:

Tags: Pytorch Deep Learning

Posted by nwoeddie23 on Sat, 14 May 2022 10:42:27 +0300