Several processing methods of crawler verification code have been encapsulated into classes. There is source code at the end of the article!

In fact, there have been a lot of things recently. I played the Blue Bridge Cup and was still preparing for some certificate exams. Blogs about reptiles have been put aside for some time. I really stepped back a little. It's not right. In fact, I'm also thinking, in my junior year, whether I want to take the postgraduate entrance examination, or still update my technology like this, or continue to drill the way of reptiles, Although I don't know whether this road is going well or not, I can't grasp the light. It took me more than a month at night. I finally passed Django. The rest is to operate on the official documents and some practical projects. I will also open a column to record some of my sad ways of learning Django. Learning is still like this. If you don't learn, you will lose, It's inexplicable. It's really strange that someone's scholarship depends on relationship. The winner of a project just changed his name at zero hour,...

Regardless of these, it doesn't matter. The blog here encapsulates two excellent ways to deal with picture verification codes, namely Baidu's aip and a recently popular Muggle OCR
Here I want to mainly mention Baidu's aip. There are many things in it. I also expanded a function to identify pornographic pictures. Those who are interested can play. In addition, after learning crawlers, these pictures are really overwhelmed, and there are countless websites. I hope the net action will work harder, and the rest will not be compared. Let's see the actual operation.

This article introduces the processing methods of the verification code in the crawler, and encapsulates these functions for our use, involving the calling method of Baidu AIP and the use of the latest open source library muggle identification library. Welcome to read, like and collect!

Other articles of bloggers are welcome to read!

Learn to call Baidu's aip interface:

1. First, you need to register an account:

https://login.bce.baidu.com/

Login after registration

2. Create project

Find word recognition in these technologies and click create project

After creation:

Appid, API key and secret key in the picture need to be used later.

Next, you can check the official website documents or directly use the code I wrote

3. Install the dependency library PIP install Baidu AIP

This is just an interface, which requires some previous settings.

 def return_ocr_by_baidu(self, test_image):
        """
        ps: First in__init__  Function to complete your own baidu_aip Some parameter settings of

        This test uses a high-precision version
                    If the speed is very slow, you can switch back to the normal version
                    self.client.basicGeneral(image, options)
                    Relevant reference website:
                    https://cloud.baidu.com/doc/OCR/s/3k3h7yeqa
        :param test_image: Name of the file to be tested
        :return:  Return the identification effect of this verification code. If there is an error, you can call it multiple times
        """
        image = self.return_image_content(test_image=self.return_path(test_image))

        # Call universal character recognition (high precision version)
        # self.client.basicAccurate(image)

        # If there are optional parameters, relevant parameters can be found in the above website
        options = {}
        options["detect_direction"] = "true"
        options["probability"] = "true"

        # call
        result = self.client.basicAccurate(image, options)
        result_s = result['words_result'][0]['words']
        # Do not print off
        print(result_s)
        if result_s:
            return result_s.strip()
        else:
            raise Exception("The result is None , try it !")

Expand Baidu's pornographic identification interface:

We must have fun writing code. It can't be so boring, can it?

Pornographic identification interface is in the content audit. Just find it.

Calling method source code:

# -*- coding :  utf-8 -*-
# @Time      :  2020/10/22  17:30
# @author: the hourglass is raining
# @Software  :  PyCharm
# @CSDN      :  https://me.csdn.net/qq_45906219

from aip import AipContentCensor
from ocr import MyOrc


class Auditing(MyOrc):
    """
    This is a call Baidu content audit aip Interface
    It is mainly used to check some pornography, anti-terrorism, nausea and so on
    website:  https://ai.baidu.com/ai-doc/ANTIPORN/tk3h6xgkn
    """

    def __init__(self):
        # super().__init__()
        APP_ID = 'Fill in your ID'
        API_KEY = 'Fill in your KEY'
        SECRET_KEY = 'Fill in your SECRET_KEY'

        self.client = AipContentCensor(APP_ID, API_KEY, SECRET_KEY)

    def return_path(self, test_image):
        return super().return_path(test_image)

    def return_image_content(self, test_image):
        return super().return_image_content(test_image)

    def return_Content_by_baidu_of_image(self, test_image, mode=0):
        """
        inherit ocr Because they are all put together, there is less code
        Content review: Is there any illegal and bad information in the picture
        Content review can also realize text review. I don't think it's encapsulated together if it's a little chicken ribs
        url: https://ai.baidu.com/ai-doc/ANTIPORN/Wk3h6xg56
        :param test_image: The pictures to be tested can be local files or web addresses
        :param mode:  default = 0 Represents a recognized local file   mode = 1 Indicates a recognized picture URL connection
        :return: Return recognition result
        """
        if mode == 0:
            filepath = self.return_image_content(self.return_path(test_image=test_image))
        elif mode == 1:
            filepath = test_image
        else:
            raise Exception("The mode is 0 or 1 but your mode is ", mode)
        # Call porn recognition interface
        result = self.client.imageCensorUserDefined(filepath)

        # "" "if the picture is a url, call the following" ""
        # result = self.client.imageCensorUserDefined('http://www.example.com/image.jpg')
        print(result)
        return result


a = Auditing()
a.return_Content_by_baidu_of_image("test_image/2.jpg", mode=0)

Learn muggle_ocr identification interface:

This package is popular recently. It is very simple to use. There are not many other functions

  1. Install PIP install Muggle OCR

      This download is a little slow. You'd better use the mobile hotspot
      Current mirror website(tsinghua/Ali)  It has not been updated to this package because it is the latest one ocr Model
    
  2. Call interface

 def return_ocr_by_muggle(self, test_image, mode=1):
        """
            Call this function to use muggle_ocr To identify
            :param  test_image  The name of the file to be tested should preferably be an absolute path
            :param  Model mode = 0  Namely ModelType.OCR Indicates recognition of plain printed text
                  When mode = 1 Default is ModelType.Captcha Indicates identification 4-6 Bit simple English input verification code

            Official website: https://pypi.org/project/muggle-ocr/
            :return: Return the identification result of this verification code. If there is an error, you can call it multiple times
        """
        # Identify items
        if mode == 1:
            sdk = muggle_ocr.SDK(model_type=muggle_ocr.ModelType.Captcha)
        elif mode == 0:
            sdk = muggle_ocr.SDK(model_type=muggle_ocr.ModelType.OCR)
        else:
            raise Exception("The mode is 0 or 1 , but your mode  == ", mode)

        filepath = self.return_path(test_image=test_image)

        with open(filepath, 'rb') as fr:
            captcha_bytes = fr.read()
            result = sdk.predict(image_bytes=captcha_bytes)
            # Do not print off
            print(result)
            return result.strip()

Package source code:

# -*- coding :  utf-8 -*-
# @Time      :  2020/10/22  14:12
# @author: the hourglass is raining
# @Software  :  PyCharm
# @CSDN      :  https://me.csdn.net/qq_45906219

import muggle_ocr
import os
from aip import AipOcr

"""
    PS: This function is mainly to make a package and put two commonly used pictures/The combination of verification code identification methods and how to use them depends on yourself
    
    Interface 1: muggle_ocr 
          pip install muggle-ocr This download is a little slow. You'd better use the mobile hotspot
          Current mirror website(tsinghua/Ali)  It has not been updated to this package because it is the latest one ocr Model
          
    Interface 2: baidu-aip
          pip install baidu-aip
          There should be a lot of people who know this, but I think it's still good muggle This new bag is a fierce competition
          For the calling method, please refer to the official website document: https://cloud.baidu.com/doc/OCR/index.html
          Or use my following methods ok of
    :param image_path  If the directory of the image path to be identified is very deep, the absolute path is recommended
    
"""


class MyOrc:
    def __init__(self):
        # Set up some necessary information and use your own Baidu aip content
        APP_ID = 'Yours ID'
        API_KEY = 'Yours KEY'
        SECRET_KEY = 'Yours SECRET_KEY'

        self.client = AipOcr(APP_ID, API_KEY, SECRET_KEY)

    def return_path(self, test_image):

        """:return abs image_path"""
        # Determine the path
        if os.path.isabs(test_image):
            filepath = test_image
        else:
            filepath = os.path.abspath(test_image)
        return filepath

    def return_image_content(self, test_image):
        """:return the image content """
        with open(test_image, 'rb') as fr:
            return fr.read()

    def return_ocr_by_baidu(self, test_image):
        """
        ps: First in__init__  Function to complete your own baidu_aip Some parameter settings of

        This test uses a high-precision version
                    If the speed is very slow, you can switch back to the normal version
                    self.client.basicGeneral(image, options)
                    Relevant reference website:
                    https://cloud.baidu.com/doc/OCR/s/3k3h7yeqa
        :param test_image: Name of the file to be tested
        :return:  Return the identification effect of this verification code. If there is an error, you can call it multiple times
        """
        image = self.return_image_content(test_image=self.return_path(test_image))

        # Call universal character recognition (high precision version)
        # self.client.basicAccurate(image)

        # If there are optional parameters, relevant parameters can be found in the above website
        options = {}
        options["detect_direction"] = "true"
        options["probability"] = "true"

        # call
        result = self.client.basicAccurate(image, options)
        result_s = result['words_result'][0]['words']
        # Do not print off
        print(result_s)
        if result_s:
            return result_s.strip()
        else:
            raise Exception("The result is None , try it !")

    def return_ocr_by_muggle(self, test_image, mode=1):
        """
            Call this function to use muggle_ocr To identify
            :param  test_image  The name of the file to be tested should preferably be an absolute path
            :param  Model mode = 0  Namely ModelType.OCR Indicates recognition of plain printed text
                  When mode = 1 Default is ModelType.Captcha Indicates identification 4-6 Bit simple English input verification code

            Official website: https://pypi.org/project/muggle-ocr/
            :return: Return the identification result of this verification code. If there is an error, you can call it multiple times
        """
        # Identify items
        if mode == 1:
            sdk = muggle_ocr.SDK(model_type=muggle_ocr.ModelType.Captcha)
        elif mode == 0:
            sdk = muggle_ocr.SDK(model_type=muggle_ocr.ModelType.OCR)
        else:
            raise Exception("The mode is 0 or 1 , but your mode  == ", mode)

        filepath = self.return_path(test_image=test_image)

        with open(filepath, 'rb') as fr:
            captcha_bytes = fr.read()
            result = sdk.predict(image_bytes=captcha_bytes)
            # Do not print off
            print(result)
            return result.strip()


# a = MyOrc()

# a.return_ocr_by_baidu(test_image='test_image/digit_img_1.png')

This article introduces the processing methods of the verification code in the crawler, and encapsulates these functions for our use, involving the calling method of Baidu AIP and the use of the latest open source library muggle identification library. Welcome to read, like and collect!

Tags: Python crawler

Posted by jay1318 on Tue, 10 May 2022 03:10:11 +0300