python crawler: crawl the takeout data of a picture. This article is enough

The text and pictures of this article come from the Internet and are only for learning and communication. They do not have any commercial use. The copyright belongs to the original author. If you have any questions, please contact us in time for handling

The following article comes from Tencent cloud Author: Python advanced

(want to learn Python? Python learning exchange group: 1039649593, meet your needs. All materials have been uploaded to the group file stream and can be downloaded by yourself! There are also a large number of the latest 2020 Python learning materials.)

1. Analyze the url parameter composition of meituan food web page

1) Search points

Meituan food, address: Beijing, search key words: hot pot

2) Crawl url

https://bj.meituan.com/s/%E7%81%AB%E9%94%85/

3) Explain

url will have the function of automatically encoding Chinese. Therefore, hotpot refers to this string of codes we don't know,% E7%81%AB%E9%94%85.

Through the url construction of keyword City, analyze bj = Beijing in the current url, / s / followed by the search keyword.

In this way, we can understand the structure of the current url.

2. Analyze the data source of the page (F12 developer tool)

Open F12 developer tool and refresh the current page: you can see that when switching to the second page, our url does not change, and the website does not automatically refresh and jump. (ajax technology in the web is the technology of loading data without refreshing the page and changing the url)

At this time, we need to find the response file corresponding to the current data in xhr in the developer tool.

From the analysis here, we can know that our data interact in json format. Analyze the request address of the json file on the second page and the request address of the json file on the third page.

Page 2: https://apimobile.meituan.com/group/v4/poi/pcsearch/1?uuid=xxx&userid= -1&limit=32&offset=32&cateId=-1&q=%E7%81%AB%E9%94%85

Page 3: https://apimobile.meituan.com/group/v4/poi/pcsearch/1?uuid=xxx&userid= -1&limit=32&offset=64&cateId=-1&q=%E7%81%AB%E9%94%85

The comparison shows that the offse parameter increases by 32 every time the page is turned, and the limit parameter is the amount of data requested at one time, the offse is the starting element of the data request, and q is the search keyword poi/pcsearch/1? Where 1 is the id number of Beijing city.

3. Structure requests to capture meituan food data

Next, the request is constructed directly and the data of each page is accessed circularly. The final code is as follows.

import requests
import re


def start():
    for w in range(0, 1600, 32):
    #Page number according to actual situation x32 OK, here I set 50 pages as the upper limit. In order to avoid setting the page number too high or too little data, I define the maximum upper limit as 1600-That's 50 pages, using try-except To detect the time exception. The exception skips the page, which is generally handled as no data
        try:
        # be careful uuid The following parameters will be available uuid after xxx Replace with your own uuid parameter
            url = 'https://apimobile.meituan.com/group/v4/poi/pcsearch/1?uuid=xxx&userid=-1&limit=32&offset='+str(w)+'&cateId=-1&q=%E7%81%AB%E9%94%85'
            #headers Your data can be found in F12 Under developer tools requests_headers To view, you need to implement the following options headers information
            #Frequent requests for advice when necessary cookie Parameter in headers within
            headers = {
                'Accept': '*/*',
                'Accept-Encoding': 'gzip, deflate, br',
                'Accept-Language': 'zh-CN,zh;q=0.9',
                'Connection': 'keep-alive',
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400',
                'Host': 'apimobile.meituan.com',
                'Origin': 'https://bj.meituan.com',
                'Referer': 'https://bj.meituan.com/s/%E7%81%AB%E9%94%85/'
            }
            response = requests.get(url, headers=headers)
            #Regular gets the data in the current response content because json Method cannot Target store specific title There is no need to obtain the key value, so regular is used
      titles = re.findall('","title":"(.*?)","address":"', response.text)
         addresses = re.findall(',"address":"(.*?)",', response.text)
         avgprices = re.findall(',"avgprice":(.*?),', response.text)
         avgscores = re.findall(',"avgscore":(.*?),',response.text)
         comments = re.findall(',"comments":(.*?),',response.text)
         #Output whether the length of the current returned data is 32
         print(len(titles), len(addresses), len(avgprices), len(avgscores), len(comments))
         for o in range(len(titles)):
         #Loop through each value and write it to the file
             title = titles[o]
             address = addresses[o]
             avgprice = avgprices[o]
             avgscore = avgscores[o]
             comment = comments[o]
             #Write to local file
             file_data(title, address, avgprice, avgscore, comment)

#File writing method
def file_data(title, address, avgprice, avgscore, comment):
    data = {
                'Shop name': title,
                'Shop address': address,
                'Average consumer price': avgprice,
                'Store score': avgscore,
                'Number of evaluators': comment
            }
    with open('Meituan cuisine.txt', 'a', encoding='utf-8')as fb:
        fb.write(json.dumps(data, ensure_ascii=False) + '\n')
        #ensure_ascii=False Must add because json.dumps Method does not turn off transcoding, which will lead to garbled code
if __name__ == '__main__':
    start()

 

The operation results are as follows:

Local files:

4. Summary

According to the change of search terms and cities, the parameters specified in the url can be changed. At the same time, remember to change the specified parameters in headers. The method is simple. With more practice, you can be familiar with ajax data capture.

Tags: Python crawler

Posted by mrMarcus on Mon, 02 May 2022 05:01:06 +0300