One click capture the recruitment information of dragnet and boss direct employment (routine operation, Selenium is not borrowed)

I have something to say: for most e-commerce users, it is self-evident that their information is stolen and timely. Similarly, their anti crawling mechanism is quite in place. The common anti crawling means encountered here are nothing more than cookies and refer fields. The information dynamically loaded by cookies is particularly disgusting. When selenium is not used to crack, the process is simply... (a direct employment website also has fixed-point ip, and you will lose it if you request more than three times...) and proxy ip is something like that... Most of the free ones can't be used or fail very quickly (here refers to the high hidden agent ip). After this period of time (the author has reviewed the sixth level and the postgraduate entrance examination round), he must take time to play selenium and scratch. Even now, although the method is stupid, it's always better than nothing. In order to reach out for the white whoring party, there are surprises in the author's code~~

First come to kangkangla:
'''
The website is: https://www.lagou.com/ , the content of crawling is relevant recruitment information about data mining engineers,
The content of crawling is the position, the full name of the company, city, monthly salary, educational background, work experience and position advantages,
Crawl at least 50 companies and store them in an Excel file with the suffix xlsx.
'''

import requests
from lxml import etree
from multiprocessing.dummy import Pool
import time
import json
import csv
class LG():
    def __init__(self):
        self.s=requests.Session()
        self.headers={
                  	"For this baby ip Long live, you can find a way here~"
            }
        self.lst=[]
        self.info=[["positionName","companyFullName","city","salary","education","workYear","companyLabelList"]]
        self.pool = Pool(5)
        
    #Get cookie information
    def Get_Cookie(self):
        url = "https://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E6%8C%96%E6%8E%98%E5%B7%A5%E7%A8%8B%E5%B8%88?labelWords=&fromSearch=true&suginput="
        res=self.s.get(url=url,headers=self.headers)
        #print(res.text)
        #return res.cookies
        self.Limit_page()
    
    #Get detailed page information
    def Get_info(self,pages):
        url="https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false"
        for page in pages:
            data={
                "first": "true"
                ,"pn": str(page)
                ,"kd": "Data mining engineer"
            }
            headers={
                "For this baby ip Long live, you can find a way here~"
            }
            res=self.s.post(url=url,headers=headers,data=data)
            res=json.loads(res.text)
            print("The first{}Page data is nearly loaded!!!".format(page))
            self.lst.append(res['content']['positionResult']['result'])
            time.sleep(1.5)
        
    #Limit page flipping
    def Limit_page(self):
        pages=[page for page in range(1,5)]
        self.Get_info(pages)
        
    #Field capture: position, full name of the company, city, monthly salary, educational background, work experience and position advantages,
    def Get_postion(self):
        self.Get_Cookie()
        for page in self.lst:
            for page1 in page:
                page=page1
                dic={
                    "positionName":page['positionName'],
                    "companyFullName":page['companyFullName'],
                    "city":page['city'],
                    "salary":page['salary'],
                    "education":page['education'],
                    "workYear":page['workYear'],
                    "positionAdvantage":page['positionAdvantage'],
                }
                self.info.append([dic["positionName"],dic["companyFullName"],dic["city"],dic["salary"],dic["education"],dic["workYear"],dic["positionAdvantage"]])
        self.Save_info()
        
    #Save the captured information to CSV
    def Save_info(self):
        f=open('./Lg.csv','a',encoding='utf-8',newline="")
        writer=csv.writer(f)
        writer.writerows(self.info)
        f.close()
        
if __name__ == '__main__':
    lg=LG()
    lg.Get_postion()
  • The anti climbing of boss's direct employment of such e-commerce websites has always been disgusting. His cookie s are short-lived and the number of visits is limited
  • What I have given is a fool like but very effective method, but to be honest, this method is only more effective for those with fewer pages (that is, when obtaining page information, we use local transfer to understand everything, and we can't help it if we don't understand...)
  • For batch operation based on this method, you can set up a cookie pool, but you have to buy ip, which is too poor to match... (free proxy ip is too top!!!)

    Come back to Kangkang's boss who has lost trouble:
    '''
    The website is: https://www.zhipin.com/ , the crawling content is the relevant recruitment information about data analysts
    The content of crawling is the position, the full name of the company, city, monthly salary, education, work experience and job description, and at least 30 companies are crawling
    Store these in a CSV file with the suffix csv
    '''
import requests
import time
import json
import time
import csv
import xlwt
from lxml import etree
class Boss:
    def __init__(self):
        self.url="https://www.zhipin.com/job_detail/?query=%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E5%B8%88&city=100010000&industry=&position="
        self.headers={
            "For this baby ip Long live, you can find a way here~"
        }
        self.info=[["Position","Full_Company","City","Salary","Eduction","WorkYear","PositionDescribe"]]
    
    #The number of requests to get html information to the current page is too many, which will prevent it from being saved to the local page
    def get_html(self):
        res=requests.get(url=self.url,headers=self.headers)
        with open("./boss1.html","w",encoding="utf-8") as f:
            f.write(res.text)
    
    #Read the local html information to obtain the demand information
    def get_info(self):
        f=open("./boss1.html","r",encoding="utf-8")
        html=f.read()
        html=etree.HTML(html)
        Position=html.xpath('//ul/li/div[@class="job-primary"]/div[1]/div[1]/div/div[1]/span[1]/a/@title')
        Full_Company=html.xpath('//ul/li/div/div[1]/div[2]/div/h3/a/@title')
        City=html.xpath('//ul/li/div/div[1]/div[1]/div/div[1]/span[2]/span/text()')
        Salary=html.xpath('//ul/li/div/div[1]/div[1]/div/div[2]/span/text()')
        PositionDescribe=html.xpath('//ul/li/div/div[2]/div[2]/text()')
        WorkYear_Eduction=html.xpath('//ul/li/div/div[1]/div[1]/div/div[2]/p')
        WorkYear=[]
        Eduction=[]
        for i in WorkYear_Eduction:
            WorkYear.append(i.xpath('./text()')[0])
            Eduction.append(i.xpath('./text()')[1])
        for j in range(len(Position)):
            self.info.append([Position[j],Full_Company[j],City[j],Salary[j],Eduction[j],WorkYear[j],PositionDescribe[j]])
        #print(Position,Full_Company,City,Salary,Eduction,WorkYear,PositionDescribe)
        
    def save_excel_info(self):
        book=xlwt.Workbook()
        sheet = book.add_sheet(sheetname="Boss_01")
        for i in range(len(self.info)):
            for j in range(len(self.info[i])):
                sheet.write(i,j,self.info[i][j])
        book.save("./Boss_01.xlsx")
        
    def save_csv_info(self):
        f=open('./Boss_01.csv','a',encoding='utf-8',newline="")
        writer=csv.writer(f)
        writer.writerows(self.info)
        f.close()
        
if __name__ == '__main__':
    boss=Boss()
    boss.get_html()
    boss.get_info() 
    boss.save_excel_info()
    boss.save_csv_info()

Tags: JSON crawler Ajax xpath csv

Posted by jweissig on Thu, 05 May 2022 23:55:01 +0300