Post crawling on the hook -- a first glimpse of reverse crawling

Summary of common anti crawler strategies

Check the request header

User agent identification

Solution: construct a user agent list and inject one user agent into the headers at random each time

agent = ['Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
        'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
        'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)',
        'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
        'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
        'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
        'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)']
    agents = random.sample(agent, 1)[0]
    # Note: random Sample () returns a list with only one element, so we take [0]

Referer identification

In developer mode, click Network - find the target request you need to send - message header - there is a Referer in the request header, and this field indicates the source of the request.

[Referer: this content is used to identify the page from which the request is sent. The server can get this information and handle it accordingly, such as source statistics, anti-theft chain processing, etc( Cui Qingcai's personal website explains the request method, request header and request process in detail https://cuiqingcai.com/5465.html (including the meaning of various requests in the message header)]

In the crawling of dragnet, if the referer is not the search result page when sending the request, the crawling basically fails.
Note: the position of this crawl is: python data analysis, "data analysis" is Chinese characters, so urllib is required Convert urlencode under parse module to construct referer. Because url request links cannot contain non ASCII encoded characters, non ASCII encoded characters are considered unsafe, so they are encoded.
Examples are as follows.

from urllib.parse import urlencode
from urllib.parse import quote
url_search = "https://www.lagou. com/jobs/list_ "+ Quote ('python data analysis') +"? "
para = {'xl':'undergraduate','px':'default','yx':'2k-5k',
'gx':'internship','city':'Beijing','district':'Chaoyang District','isSchoolJob':'1'}
url_search = url_search + urlencode(para)

Usage and difference between quote and urlencode in python:
https://blog.csdn.net/zjz155/article/details/88060427

urllib.parse.urlencode()
Parameter: dict type
Return value: String
Function: change key: value into the form of key = encoded value

urllib.parse.quote()
Parameter: str type, Chinese character
Return value: encoded value

Cookie identification

Take dragnet as an example. The website will detect the information in the sent request cookies. observation XHR Request (also in Referer On the side, call XMLHttpRequest),The of the request was found cookies Is a page that includes search results cookies Therefore, when imitating the behavior of the browser, you need to get all the information of the search results page first cookies And add to headers Yes.
There are usually two ways to do this
①In the request header of the target request just Cookies Copy and paste all into the code. The advantage is convenience, and the disadvantage is that if cookies Changes are difficult to modify
②application session method
s = requests.Session()
s.get(url_search,headers = headers,timeout = 5) 
# The timeout setting is necessary, otherwise the server will not respond for a period of time, which will keep us stuck here. It can also be written as follows: timeout = (3,7), which means that the connection time is 3s and the response time is 7s
cookie = s.cookies

IP identification

The principle is to record the IP address and device code of the device that frequently sends requests (but I still don't know where the device code is and what it looks like), and block these IPS.
The solution is to establish an IP proxy pool by myself, but I won't. I haven't learned it yet.

Overall code and process

# -*- coding: utf-8 -*-
"""
Created on Thu Oct 15 15:17:49 2020

method: POST
 Type: XHR
@author: djx
"""
from urllib.parse import urlencode
from urllib.parse import quote
import requests
import pymongo


def getpage(url_final:str,page:int):
    j:int = 1
    url_search = "https://www.lagou. com/jobs/list_ "+ Quote ('python data analysis') +"? "
    para = {'xl':'undergraduate','px':'default','yx':'2k-5k','gx':'internship','city':'Beijing','district':'Chaoyang District','isSchoolJob':'1'}
    url_search = url_search + urlencode(para)
    agent = ['Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
        'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
        'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)',
        'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
        'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
        'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
        'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)']
    agents = random.sample(agent, 1)[0]
    headers ={'Accept':'application/json, text/javascript, */*; q=0.01',
       'Host': 'www.lagou.com',
       'User-Agent':agents,
       'Referer':url_search,# Should be a search results page
       'Accept-Language': 'en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7'}
#    This header is used to get the cookie of the search results page
#    headers = str(headers)
#    print(type(headers))
    payload = {'first':'true','pn':j,'kd':'python Data analysis'}#.encode('utf-8')
#    print(headers)
    s = requests.Session()
    s.get(url_search,headers = headers,timeout = 5)
#    response = requests.post(url_final,data = payload,headers=headers,cookies = s.cookies,timeout = 5)
#    return response.text
#    print(s.cookies)
#    new_cookies = requests.cookies.RequestsCookieJar()
#    new_cookies.set('JSESSIONID','ABAAAECABFAACEA2D7960FECF9FBC9ABF231C39F422F8CD')
#    s.cookies.update(new_cookies)
#    print(s.cookies)
    while j <= page:
        print(j)
        try:
            response = requests.post(url_final,data = payload,headers=headers,cookies = s.cookies,timeout = 5)
            if response.status_code == 200:
                j += 1
                return response.json()
            # content.decode('utf-8 ') this returns a string
        except Exception as e:
            print("error"+str(e))
            return None
        
#def save_info():
    

def main():
    url_final = "https://www.lagou.com/jobs/positionAjax.json?"
    para = {'xl':'undergraduate','px':'default','yx':'2k-5k','gx':'internship','city':'Beijing','district':'Chaoyang District','needAddtionalResult':'false','isSchoolJob':'1'}
    url_final = url_final +urlencode(para) # The final return is the url of the json request
#    print(url_final)
    content = getpage(url_final,5)
    result = content["content"]["positionResult"]["result"]
    client = pymongo.MongoClient(host = 'localhost',port=27017)
    db = client['JobInfo']
    collection = db.SimpleInfo
    
    for i in result:
        collection.insert_one(i)


main()

Tags: Python

Posted by derekbelcher on Tue, 10 May 2022 09:56:04 +0300