If you learn this skill, your little sister will look at you differently

preface

Today, a classmate Xiaoli came to me. An experiment needs to use the data raised easily for an analysis. But without enough data, what should we do?
I'm willing to help others. Of course, I won't ignore it~
(ps. after all, it's a young lady. It's not good to refuse, right)

So I picked up the guy and said to do it.

1, Reptile analysis

Through simple analysis, it can be found that easy financing provides an interface to return relevant data of a project, as follows:

The address is as follows. xxxxxx represents the UUID of the project:

https://gateway.qschou.com/v3.0.0/project/index/text3/xxxxxx
  • 1

In other words, as long as we have a UUID, we can send requests through requests and easily get data. It's not easy. Let's just traverse the UUID?
However, the length of UUID is 32 bits. If you want to traverse, you can go to Ba next year
The key to solving the problem (whispering: "get the favor of little sister") is to find a way to find the UUID of the project.

2, Crawl item ID

Did you find uid on official website. Therefore, we can only find another way. Finally, I decided to start with Baidu Post Bar, search for relevant posts, and extract the UUID in the project link in the post.

1. Grab the URL of the post

Traverse the whole "easy fundraising bar" and grab the URL of each post:

url_tieba = 'https://tieba.baidu.com/f?kw = easy Chip & ie = UTF-8 & PN =% d '# post bar page, PN = 0,50100
p_href = 'href="/p/(.*?)"'
hrefs = []
for i in range(1): # Pages crawled
    try:
        pn = i*50
        res = requests.get(url_tieba%pn)
        html = res.text
        list_herf = re.findall(p_href,html)
        for h in list_herf:           	
           hrefs.append('https://tieba.baidu.com/p/'+h)
        print('The first%d Page obtained successfully'%(i+1))
    except:
        pass
with open('tiezi_url.txt','w') as f:
    for h in hrefs:
        f.write(h+'\n')

 

2. Extract the UUID in the post

There are two main ways to get the UUID of each post:
(1) Extract from links in posts
First define the regular expression that matches the link:

p_projuuid = 'https://m2.qschou.com.*?projuuid=(.*?)&'
  • 1

Match posts:

for h in hrefs:
   try:
       res = requests.get(h)
       html = res.text
       list_projuuid = re.findall(p_projuuid,html) # Determine whether there is a direct link in the post
       if len(list_projuuid) != 0:
           for p in list_projuuid:
               if p not in projuuids and len(p) == 36:
                   projuuids.append(p)
                   print('%s Extract to projuuid:%s'%(h,p))
       else:  # Extract the pictures in the post and get the QR code
           list_img = re.findall(p_img,html)
           if len(list_img) == 0:
               print('%s Unable to extract projuuid'%h)
           else:
               for i in list_img:
                   txt_list = get_ewm(i)
                   if len(txt_list) != 0:
                       barcodeData = txt_list[0].data.decode("utf-8")
                       for p in re.findall('projuuid=(.*)',barcodeData):
                           if p not in projuuids and len(p) == 36:
                               projuuids.append(p)
                               print('%s Extract to projuuid:%s'%(h,p))
   except:
        pass

 

(2) Extract from the QR code in the post
First, we have to match all the pictures in the post, as follows:

p_img = 'class="BDE_Image" src="(.*?)"'
  • 1

Then read the two-dimensional code of each picture. If the item connection is included, extract the UUID:

for h in hrefs:
   try:
       res = requests.get(h)
       html = res.text
       list_projuuid = re.findall(p_projuuid,html) # Determine whether there is a direct link in the post
       if len(list_projuuid) != 0:
           for p in list_projuuid:
               if p not in projuuids and len(p) == 36:
                   projuuids.append(p)
                   print('%s Extract to projuuid:%s'%(h,p))
       else:  # Extract the pictures in the post and get the QR code
           list_img = re.findall(p_img,html)
           if len(list_img) == 0:
               print('%s Unable to extract projuuid'%h)
           else:
               for i in list_img:
                   txt_list = get_ewm(i)
                   if len(txt_list) != 0:
                       barcodeData = txt_list[0].data.decode("utf-8")
                       for p in re.findall('projuuid=(.*)',barcodeData):
                           if p not in projuuids and len(p) == 36:
                               projuuids.append(p)
                               print('%s Extract to projuuid:%s'%(h,p))
   except:
        pass

 

Among them, the function of reading QR code is as follows:

def get_ewm(img_adds):
    # Read the content of QR Code: IMG_ Add: QR code address (can be web address or local address)
    if os.path.isfile(img_adds):
        # Load QR code pictures locally
        img = Image.open(img_adds)
    else:
        # Download and load QR code pictures from the network
        rq_img = requests.get(img_adds).content
        img = Image.open(BytesIO(rq_img))

    txt_list = pyzbar.decode(img)
    #barcodeData = txt_list[0].data.decode("utf-8")
    return txt_list

 

Note: pyzbar library is needed here. You can install it through pip:

pip install pyzbar
  • 1

3. Complete code

import re
import os
import requests
from PIL import Image
from io import BytesIO
from pyzbar import pyzbar
 
 
def get_ewm(img_adds):
    # Read the content of QR Code: IMG_ The address can be a QR code or a local address
    if os.path.isfile(img_adds):
        # Load QR code pictures locally
        img = Image.open(img_adds)
    else:
        # Download and load QR code pictures from the network
        rq_img = requests.get(img_adds).content
        img = Image.open(BytesIO(rq_img))

    txt_list = pyzbar.decode(img)
    #barcodeData = txt_list[0].data.decode("utf-8")
    return txt_list
 
if __name__ == '__main__':
   # Get the number of each post
   url_tieba = 'https://tieba.baidu.com/f?kw = easy Chip & ie = UTF-8 & PN =% d '# post bar page, PN = 0,50100
   p_href = 'href="/p/(.*?)"'
   hrefs = []
   for i in range(1): # Pages crawled
       try:
           pn = i*50
           res = requests.get(url_tieba%pn)
           html = res.text
           list_herf = re.findall(p_href,html)
           for h in list_herf:
               hrefs.append('https://tieba.baidu.com/p/'+h)
           print('The first%d Page obtained successfully'%(i+1))
       except:
           pass
   with open('tiezi_url.txt','w') as f:
       for h in hrefs:
           f.write(h+'\n')
   # Crawl the links in each post
   projuuids = []
   p_projuuid = 'https://m2.qschou.com.*?projuuid=(.*?)&'
   p_img = 'class="BDE_Image" src="(.*?)"'
   for h in hrefs:
       try:
           res = requests.get(h)
           html = res.text
           list_projuuid = re.findall(p_projuuid,html) # Determine whether there is a direct link in the post
           if len(list_projuuid) != 0:
               for p in list_projuuid:
                   if p not in projuuids and len(p) == 36:
                       projuuids.append(p)
                       print('%s Extract to projuuid:%s'%(h,p))
           else:  # Extract the pictures in the post and get the QR code
               list_img = re.findall(p_img,html)
               if len(list_img) == 0:
                   print('%s Unable to extract projuuid'%h)
               else:
                   for i in list_img:
                       txt_list = get_ewm(i)
                       if len(txt_list) != 0:
                           barcodeData = txt_list[0].data.decode("utf-8")
                           for p in re.findall('projuuid=(.*)',barcodeData):
                               if p not in projuuids and len(p) == 36:
                                   projuuids.append(p)
                                   

Posted by techtheatre on Fri, 13 May 2022 05:17:01 +0300