preface
Today, a classmate Xiaoli came to me. An experiment needs to use the data raised easily for an analysis. But without enough data, what should we do?
I'm willing to help others. Of course, I won't ignore it~
(ps. after all, it's a young lady. It's not good to refuse, right)
So I picked up the guy and said to do it.
1, Reptile analysis
Through simple analysis, it can be found that easy financing provides an interface to return relevant data of a project, as follows:
The address is as follows. xxxxxx represents the UUID of the project:
https://gateway.qschou.com/v3.0.0/project/index/text3/xxxxxx
- 1
In other words, as long as we have a UUID, we can send requests through requests and easily get data. It's not easy. Let's just traverse the UUID?
However, the length of UUID is 32 bits. If you want to traverse, you can go to Ba next year
The key to solving the problem (whispering: "get the favor of little sister") is to find a way to find the UUID of the project.
2, Crawl item ID
Did you find uid on official website. Therefore, we can only find another way. Finally, I decided to start with Baidu Post Bar, search for relevant posts, and extract the UUID in the project link in the post.
1. Grab the URL of the post
Traverse the whole "easy fundraising bar" and grab the URL of each post:
url_tieba = 'https://tieba.baidu.com/f?kw = easy Chip & ie = UTF-8 & PN =% d '# post bar page, PN = 0,50100 p_href = 'href="/p/(.*?)"' hrefs = [] for i in range(1): # Pages crawled try: pn = i*50 res = requests.get(url_tieba%pn) html = res.text list_herf = re.findall(p_href,html) for h in list_herf: hrefs.append('https://tieba.baidu.com/p/'+h) print('The first%d Page obtained successfully'%(i+1)) except: pass with open('tiezi_url.txt','w') as f: for h in hrefs: f.write(h+'\n')
2. Extract the UUID in the post
There are two main ways to get the UUID of each post:
(1) Extract from links in posts
First define the regular expression that matches the link:
p_projuuid = 'https://m2.qschou.com.*?projuuid=(.*?)&'
- 1
Match posts:
for h in hrefs: try: res = requests.get(h) html = res.text list_projuuid = re.findall(p_projuuid,html) # Determine whether there is a direct link in the post if len(list_projuuid) != 0: for p in list_projuuid: if p not in projuuids and len(p) == 36: projuuids.append(p) print('%s Extract to projuuid:%s'%(h,p)) else: # Extract the pictures in the post and get the QR code list_img = re.findall(p_img,html) if len(list_img) == 0: print('%s Unable to extract projuuid'%h) else: for i in list_img: txt_list = get_ewm(i) if len(txt_list) != 0: barcodeData = txt_list[0].data.decode("utf-8") for p in re.findall('projuuid=(.*)',barcodeData): if p not in projuuids and len(p) == 36: projuuids.append(p) print('%s Extract to projuuid:%s'%(h,p)) except: pass
(2) Extract from the QR code in the post
First, we have to match all the pictures in the post, as follows:
p_img = 'class="BDE_Image" src="(.*?)"'
- 1
Then read the two-dimensional code of each picture. If the item connection is included, extract the UUID:
for h in hrefs: try: res = requests.get(h) html = res.text list_projuuid = re.findall(p_projuuid,html) # Determine whether there is a direct link in the post if len(list_projuuid) != 0: for p in list_projuuid: if p not in projuuids and len(p) == 36: projuuids.append(p) print('%s Extract to projuuid:%s'%(h,p)) else: # Extract the pictures in the post and get the QR code list_img = re.findall(p_img,html) if len(list_img) == 0: print('%s Unable to extract projuuid'%h) else: for i in list_img: txt_list = get_ewm(i) if len(txt_list) != 0: barcodeData = txt_list[0].data.decode("utf-8") for p in re.findall('projuuid=(.*)',barcodeData): if p not in projuuids and len(p) == 36: projuuids.append(p) print('%s Extract to projuuid:%s'%(h,p)) except: pass
Among them, the function of reading QR code is as follows:
def get_ewm(img_adds): # Read the content of QR Code: IMG_ Add: QR code address (can be web address or local address) if os.path.isfile(img_adds): # Load QR code pictures locally img = Image.open(img_adds) else: # Download and load QR code pictures from the network rq_img = requests.get(img_adds).content img = Image.open(BytesIO(rq_img)) txt_list = pyzbar.decode(img) #barcodeData = txt_list[0].data.decode("utf-8") return txt_list
Note: pyzbar library is needed here. You can install it through pip:
pip install pyzbar
- 1
3. Complete code
import re
import os
import requests
from PIL import Image
from io import BytesIO
from pyzbar import pyzbar
def get_ewm(img_adds):
# Read the content of QR Code: IMG_ The address can be a QR code or a local address
if os.path.isfile(img_adds):
# Load QR code pictures locally
img = Image.open(img_adds)
else:
# Download and load QR code pictures from the network
rq_img = requests.get(img_adds).content
img = Image.open(BytesIO(rq_img))
txt_list = pyzbar.decode(img)
#barcodeData = txt_list[0].data.decode("utf-8")
return txt_list
if __name__ == '__main__':
# Get the number of each post
url_tieba = 'https://tieba.baidu.com/f?kw = easy Chip & ie = UTF-8 & PN =% d '# post bar page, PN = 0,50100
p_href = 'href="/p/(.*?)"'
hrefs = []
for i in range(1): # Pages crawled
try:
pn = i*50
res = requests.get(url_tieba%pn)
html = res.text
list_herf = re.findall(p_href,html)
for h in list_herf:
hrefs.append('https://tieba.baidu.com/p/'+h)
print('The first%d Page obtained successfully'%(i+1))
except:
pass
with open('tiezi_url.txt','w') as f:
for h in hrefs:
f.write(h+'\n')
# Crawl the links in each post
projuuids = []
p_projuuid = 'https://m2.qschou.com.*?projuuid=(.*?)&'
p_img = 'class="BDE_Image" src="(.*?)"'
for h in hrefs:
try:
res = requests.get(h)
html = res.text
list_projuuid = re.findall(p_projuuid,html) # Determine whether there is a direct link in the post
if len(list_projuuid) != 0:
for p in list_projuuid:
if p not in projuuids and len(p) == 36:
projuuids.append(p)
print('%s Extract to projuuid:%s'%(h,p))
else: # Extract the pictures in the post and get the QR code
list_img = re.findall(p_img,html)
if len(list_img) == 0:
print('%s Unable to extract projuuid'%h)
else:
for i in list_img:
txt_list = get_ewm(i)
if len(txt_list) != 0:
barcodeData = txt_list[0].data.decode("utf-8")
for p in re.findall('projuuid=(.*)',barcodeData):
if p not in projuuids and len(p) == 36:
projuuids.append(p)