Introduction to get and post methods in python crawler and the role of cookie s

First, determine the form submission method of the target website you want to climb, which can be seen through the developer tool. chrome is recommended here.

Here, I use e-mail 163 as an example

After opening the tool, select the website you want to know in the Name of Network, and the request method in the headers on the right is the submission method. If status is 200, it indicates that you have successfully accessed the following header information, and the cookie is the stored session information generated after you log in. The first time you visit the website, you need to provide a user Name and password. After that, you can log in by providing a cookie in the headers.

The requests library will provide get and post methods.

import requests
import ssl

user_agent="Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0"
accept='text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
accept_language='zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3'

upgrade='1'

headers={ 'User-Agent':user_agent, 'Accept':accept, 'Accept-Language':accept_language, 'Cookie':'....'#Fill in the information generated after you log in here cookie }
r
= requests.get("http://mail.163.com/js6/main.jsp?sid=OAwUtGgglzEJoANLHPggrsKKAhsyheAT&df=mail163_letter#module=welcome.WelcomeModule%7C%7B%7D",headers=headers,verify=False) fp = open("/temp/csdn.txt","w",encoding='utf-8') fp.write(str(r.content,'utf-8')) fp.close()



I introduced the ssl library here because the certificate of the web page I visited for the first time expired.

If we use a crawler to enter such a website, we will report an error: SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:581)

In the get and post methods of requests, one parameter is verify. Setting it to False will disable the certificate request

 

 

python crawler urllib module initiates post request process parsing

POST request initiated by urlib module

Case: crawling the translation results of Baidu translation

1. Find the url of the POST request through the browser bag catching tool

 

To obtain the url corresponding to the ajax page request, you need to use the browser's bag catching tool. View the url corresponding to the Ajax request sent by Baidu translation for a note

Click the clear button to clear the packet capture tool and the captured request

Then fill in the translation note and send the ajax request. The red box is the sent ajax request

 

The All button of the packet capture tool represents All the captured requests, including GET, POST and ajax based POST requests


XHR represents only ajax based POST requests that are captured

Which is the ajax based POST request we want? This POST request is the apple request parameter that carries the translation note

Take another look at the request URL corresponding to the POST request. This URL is the URL we want to request

Before initiating a POST request, the parameters carried in the POST request should be processed

3-step process:

1, Encapsulate POST requests into Dictionaries

2, Use URLEncode in parse module (return value type is string type) for encoding

3, Convert the encoding result of step 2 into byte type

import urllib.request
import urllib.parse
# 1.appoint url url = 'https://fanyi.baidu.com/sug'

# launch POST Before the request, it needs to be processed POST Parameter flow carried by the request:

# 1, Will POST Request encapsulated in dictionary data = { # take POST Request all carried parameters to be put into the dictionary 'kw':'Apple', }

# 2, Use parse In the module urlencode(The return value type is string type)Code processing data = urllib.parse.urlencode(data)
# 3, Convert the encoding result of step 2 into byte type data = data.encode() '''2. launch POST request:urlopen Functional data The parameter represents the processed data POST Parameters carried by the request ''' response = urllib.request.urlopen(url=url,data=data) data = response.read() print(data)

Take the translation results to JSON online format verification (online JSON verification formatting tool (Be JSON)),

Click Format checksum to convert unicode to Chinese

 

import re,json,requests,os
from hashlib import md5
from urllib.parse import urlencode
from requests.exceptions import RequestException
from bs4 import BeautifulSoup
from multiprocessing import Pool

#Request index page def get_page_index(offset,keyword): #Transmitted data data={ 'offset': offset, 'format': 'json', 'keyword': keyword, 'autoload': 'true', 'count': '20', 'cur_tab': 1 } #Automatically encode as recognized by the server url url="https://www.toutiao.com/search_content/?"+urlencode(data) #exception handling try: #Get the returned page response=requests.get(url) #Judge whether the status code of the web page is obtained normally if response.status_code==200: #Returns the decoded web page return response.text #Abnormal acquisition, return None return None except RequestException: #Prompt information print("Error requesting index page") return None
#Parse the requested index page data def parse_page_index(html): #json Load conversion data=json.loads(html) #The data is true, and data The key value exists in the data if data and 'data' in data.keys(): #Traversal returns the location of the atlas url for item in data.get('data'): yield item.get('article_url')
#Atlas detail page request def get_page_detail(url): #set up UA,Simulate normal browser access head = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'} #exception handling try: response=requests.get(url,headers=head) if response.status_code==200: return response.text return None except RequestException: print("Error in request details page") return None
#Analyze the data of atlas detail page def parse_page_detail(html,url): #exception handling try: #Format conversion and Atlas Title Extraction soup=BeautifulSoup(html,'lxml') title=soup.select('title')[0].get_text() print(title)
#Regular find atlas link image_pattern = re.compile('gallery: (.*?),\n', re.S) result = re.search(image_pattern, html) if result: #Data optimization result=result.group(1) result = result[12:] result = result[:-2] #replace result = re.sub(r'\\', '', result) #json load data = json.loads(result) #Judge that the data is not empty and ensure sub-images Among them if data and 'sub_images' in data.keys(): #sub_images Data extraction sub_images=data.get('sub_images') #List data extraction images=[item.get('url') for item in sub_images] #Picture download for image in images:download_images(image) #Return dictionary return { 'title':title, 'url':url, 'images':images } except Exception: pass
#picture url request def download_images(url): #Prompt information print('Downloading',url) #Browser simulation head = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'} #exception handling try: response = requests.get(url, headers=head) if response.status_code == 200: #pictures saving save_image(response.content) return None except RequestException: print("Error requesting picture") return None
#pictures saving def save_image(content): #Judge whether the folder exists, and create if it does not exist if 'Street shot' not in os.listdir(): os.makedirs('Street shot') #Set the folder location where files are written os.chdir('E:\python Write web crawler\CSDN Reptile learning\Street shot') #Path, name, suffix file_path='{0}/{1}.{2}'.format(os.getcwd(),md5(content).hexdigest(),'jpg') #pictures saving with open(file_path,'wb') as f: f.write(content) f.close()
#Main function def mian(offset): #Web page acquisition html=get_page_index(offset,'Street shot') #atlas url for url in parse_page_index(html): if url!=None: #Atlas page details html=get_page_detail(url) #Atlas content result=parse_page_detail(html,url) if __name__ == '__main__':
#Create access list (0)-9)page group=[i*10 for i in range(10)]
#Create a multithreaded process pool pool=Pool()
#The process pool starts, and the incoming data pool.map(mian,group)

 

Detailed explanation of get request implementation of python crawler based on requests module

import requests
# 1.appoint url
url = 'https://www.sogou.com/'
# 2.launch get request:get Method will return the response object of successful request
response = requests.get(url=url)
# 3.Get the data in the response: text Property is used to obtain page data in the form of string in the response object page_data = response.text
# 4.Persistent data with open("sougou.html","w",encoding="utf-8") as f: f.write(page_data) f.close() print("ok")

How does the requests module handle get requests with parameters and return requests with parameters

Search dog: get a search result corresponding to the specified page

Previously, the urllib module processed the parameters on the url in Chinese, and the requests will automatically process the url encoding

Initiate get request with parameters

params can be dictionaries or lists

def get(url, params=None, **kwargs):
  r"""Sends a GET request.
  :param url: URL for the new :class:`Request` object.
  :param params: (optional) Dictionary, list of tuples or bytes to send
    in the body of the :class:`Request`.
  :param \*\*kwargs: Optional arguments that ``request`` takes.
  :return: :class:`Response <Response>` object
  :rtype: requests.Response
import requests
# appoint url
url = 'https://www.sogou.com/web'

# encapsulation get Request parameters prams = { 'query':'Jay Chou', 'ie':'utf-8' } response = requests.get(url=url,params=prams) page_text = response.text with open("Jay Chou.html","w",encoding="utf-8") as f: f.write(page_text) f.close() print("ok")

Use the requests module to customize the request header information and initiate a get request with parameters

The get method has a headers parameter, which assigns the dictionary of the request header information to the headers parameter

import requests
# appoint url
url = 'https://www.sogou.com/web'

# encapsulation get Request parameters prams = { 'query':'Jay Chou', 'ie':'utf-8' }

# Custom request header information headers={ 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36', }
response
= requests.get(url=url,params=prams,headers=headers) page_text = response.text
with open(
"Jay Chou.html","w",encoding="utf-8") as f: f.write(page_text) f.close() print("ok")

python crawler initiates ajax get request based on requests module to realize parsing

ajax get request based on requests module

Demand: climb the ranking list of Douban films https://movie.douban.com/ Movie detail data in

Use the packet capture tool to capture the request to load the page using ajax

Drag the scroll wheel down the mouse to load more movie information. This local refresh is an ajax request initiated by the current page,

Use the packet capture tool to capture the ajax get request for page refresh and the request initiated by the scroll wheel at the bottom

This get request is the url of the request initiated this time

ajax get requests carry parameters

The obtained response content is no longer page data, but json string, which is the movie detail information obtained through asynchronous request

It should be noted that the movie details obtained by changing the start and limit parameters are different

import requests
import json
# appoint ajax-get Requested url((obtained by capturing packets) url = 'https://movie.douban.com/j/chart/top_list?'

# encapsulation ajax of get Parameters carried by the request(Get from the packet capture tool) Encapsulate to dictionary param = { 'type': '13', 'interval_id': '100:90', 'action': '', 'start': '20', # Get details from the 20th movie 'limit': '20', # How many movie details are available # The movie details obtained by changing these two parameters are different }
# Customize the request header information, and the relevant header information must be encapsulated in the dictionary structure headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36', }

# launch ajax of get Request or use get method response = requests.get(url=url,params=param,headers=headers) # Get response content: the response content is json character string data = response.text data = json.loads(data) for data_dict in data: print(data_dict["rank"],data_dict["title"]) ''' Furong Town As quiet as the sea Gold Rush circus troupe Love confusion Soldier song War and Peace palpitate with excitement Moonlight treasure box of Dahua journey to the West Roman Holiday Sound of music one by one Song in the rain I love you! Modi Nights of Cabiria marriage Benjamin·Barton wonder Love letter Spring is suddenly released '''

 

post request:

#!user/bin/python
#coding=utf-8 
perhaps 
#-*-coding:utf-8-*-

#Import tool, built-in Library
import urllib
import urllib2
#Add one\Can wrap #response = \ #urllib2.urlopen("https://hao.360.cn/?wd_xp1") #print response.read()
request = urllib2.Request('http://www.baidu.com') #response = urllib2.urlopen(request)

#structure post request params={} params['account']='jredu' params['pwd']=''

#Encode data data = urllib.urlencode(params) response = urllib2.urlopen(request,data) print response.url print response.code print response.read()

get request:

#Import tool, built-in Library
import urllib
import urllib2
#Add one\Can wrap
#response = \
  #urllib2.urlopen("https://hao.360.cn/?wd_xp1")
#print response.read()
url='http://www.baidu.com'
#response = urllib2.urlopen(request)

#structure post request params={} params['account']='jredu' params['pwd']=''

#Encode data data = urllib.urlencode(params) request = urllib2.Request(url+"?"+data) response = urllib2.urlopen(request) print response.url print response.code print response.read()

Example of parsing html web page files using lxml library for Python big data

lxml is an html/xml parsing and dom building library of Python. lxml is characterized by powerful function and good performance. xml includes ElementTree, html5lib, beautfulsup and other libraries.

Precautions before using lxml: first make sure that html has been decoded by UTF-8, that is, code = html Decode ('utf-8 ',' ignore '), otherwise there will be parsing errors. Because Chinese is encoded into UTF-8 and then becomes in the form of '/ u2541', lxml will think its label ends as soon as it encounters "/".

Specific usage: element node operation

1. Parsing HTMl to build DOM

from lxml import etree
dom = etree.HTML(html)

 

2. View the number of sub elements in dom. len(dom)

3. View the content of a node: etree tostring(dom[0])

4. Get the label name of the node: DOM [0] tag

5. Get the parent node of a node: DOM [0] getparent()

6. Get the content of the attribute node of a node: DOM [0] Get ("property name")

 

Support for xpath paths:

XPath is an XML path language, which uses a method similar to directory tree to describe the path in XML documents. For example, "/" is used as the separation between upper and lower levels. The first "/" represents the root node of the document (note that it does not refer to the tag node at the outermost layer of the document, but to the document itself). For example, for an html file, the outermost node should be "/ html".

How xpath selects elements:

1. Absolute path, such as page XPath ("/ HTML / body / p"), which will find all the p tags under the body node

2. Relative path, page XPath ("/ / p"), which will find all p tags in the whole html code.

 

xpath filter method:

1. When selecting elements, a list can be found by index [n]

2. Filter element by attribute value p = page xpath("//p[@style='font-size:200%']")

3. If there is no attribute, it can be filtered through text() (get the text in the element), position() (get the element position), last(), etc

 

Get property value

dom.xpath(.//a/@href)

 

Get text

dom.xpath(".//a/text()")

#!/usr/bin/python
# -*- coding:utf-8 -*-
from scrapy.spiders import Spider
from lxml import etree
from jredu.items import JreduItem
class JreduSpider(Spider):
  name = 'tt' #The name of the reptile is mandatory and unique
  allowed_domains = ['sohu.com']
  start_urls = [
    'http://www.sohu.com'
  ]
  def parse(self, response):
    content = response.body.decode('utf-8')
    dom = etree.HTML(content)
    for ul in dom.xpath("//div[@class='focus-news-box']/div[@class='list16']/ul"):
      lis = ul.xpath("./li")
      for li in lis:
        item = JreduItem() #Define object
        if ul.index(li) == 0:
          strong = li.xpath("./a/strong/text()")
          li.xpath("./a/@href")
          item['title']= strong[0]
          item['href'] = li.xpath("./a/@href")[0]
        else:
          la = li.xpath("./a[last()]/text()")
          item['title'] = la[0]
          item['href'] = li.xpath("./a[last()]/href")[0]
        yield item

 

Posted by kerplunk on Sun, 15 May 2022 16:26:38 +0300