Chinese University mooc crawler notes

Getting Started with the Requests Library

Seven main methods

method illustrate
requests.request() Construct a request that supports the underlying methods of the following methods
requests.get() The main method for obtaining HTML pages, corresponding to HTTP GET
requests.head() The method for obtaining the header information of HTML pages, corresponding to the HEAD of HTTP
requests.post() A method for submitting a POST request to an HTML web page, corresponding to HTTP POST
requests.put() A method for submitting a PUT request to an HTML web page, corresponding to HTTP PUT
requests.patch() Submit a partial modification request to an HTML page, corresponding to HTTP PATCH
requests.delete() Submit a delete request to an HTML page, corresponding to HTTP DELETE

1. requests.get(url, params=None, **kwargs)

url : the url link of the page to be retrieved
params : extra parameters in url, dictionary or byte stream format, optional
**kwargs: 12 parameters that control access
get() returns the response object:
Properties of the response object

Attributes illustrate
r.status_code The return status of the HTTP request, 200 means the connection is successful, 404 and other values ​​mean failure
r.text HTTP The string form of the response content, that is, the page content corresponding to the url
r.encoding Response content encoding guessed from HTTP header s
r.apparent_encoding Response content encoding method analyzed from the content (alternative encoding method)
r.content The binary form of the HTTP response content

We have made some functional extensions and syntax support for the Markdown editor. In addition to the standard Markdown editor functions, we have added the following new functions to help you write blogs with it:

Common exceptions to the Requests library

abnormal illustrate
requests.ConnectionError Abnormal network connection errors, such as DNS query failure, connection refused, etc.
requests.HTTPError HTTP error exception
requests.URLRequired URL missing exception
requests.TooManyRedirects If the maximum number of redirects is exceeded, a redirection exception is generated
requests.ConnectTimeout Connecting to the remote server exceeds an exception when inserting a code snippet here
requests.Timeout The request URL timed out, resulting in a timeout exception

Common code framework for crawling web pages

Code:

import requests
def getHTMLText(url):
	try:
		r=requests.get(url,timeout=30)
		r.raise_for_status()
		r.encoding=r.apparent_encoding
		return r.text
	except:
		return "generate exception"
if __name__=='__main__':
	url="http://www.baidu.com"
	print(getHTMLText(url))

HTTP protocol

HTTP, Hypertext Transfer Protocol, Hypertext Transfer Protocol
HTTP is a stateless application layer protocol based on the "request and response" model
The HTTP protocol uses URL as the identifier for locating network resources. The URL format is as follows:
http://host[:port][path]
host: legal Internet host domain name or IP address
port: port number, the default port is 80
path: the path of the requested resource

Operation of HTTP protocol on resources

method illustrate
GET Request to get the resource at the URL location
HEAD Request to obtain the response message report of the URL location resource, that is, to obtain the header information of the resource
POST Append new data after request to the resource at the URL location
PUT Request to store a resource at the URL location, overwriting the resource at the original URL location
PATCH Request a partial update of the resource at the URL location, that is, change part of the content of the resource at that location
DELETE Request to delete the resource stored at the URL location

2. The head() method of the Requests library

>>> r = requests.head('http://httpbin.org/get')
>>> r.headers
{'Date': 'Tue, 18 Aug 2020 01:30:20 GMT', 'Content-Type': 'application/json', 'Content-Length': '307', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true'}
>>> r.text
''

3. The post() method of Requests

>>> payload = {'key1': 'value1', 'key2': 'value2'}
>>> r = requests.post('http://httpbin.org/post', data = payload)
>>> print(r.text)
{ ...
"form": {
"key2": "value2", #POST a dictionary to the URL
"key1": "value1" #Automatically encode as form (form)
},
}

4. The put() method of Requests

>>> payload = {'key1': 'value1', 'key2': 'value2'}
>>> r = requests.put('http://httpbin.org/put', data = payload)
>>> print(r.text)
{ ...
"form": {
"key2": "value2",
"key1": "value1"
},
}

5.requests.request(methoud,url,**kwargs) method

requests.request(method, url, **kwargs) is the basis for the remaining six methods

Subclass correspond
method Request method, corresponding to 7 types such as get/put/post
url : The url link of the page to be retrieved
**kwargs: Parameters that control access, a total of 13

method : request method

r = requests.request('GET', url, **kwargs)
r = requests.request('HEAD', url, **kwargs)
r = requests.request('POST', url, **kwargs)
r = requests.request('PUT', url, **kwargs)
r = requests.request('PATCH', url, **kwargs)
r = requests.request('delete', url, **kwargs)
r = requests.request('OPTIONS', url, **kwargs)

**kwargs: parameters that control access, all optional

params : dictionary or byte sequence, added to the url as a parameter

code

>>> kv = {'key1': 'value1', 'key2': 'value2'}
>>> r = requests.request('GET', 'http://python123.io/ws', params=kv)
>>> print(r.url)
http://python123.io/ws?key1=value1&key2=value2

data : dictionary, byte sequence or file object, as the content of the Request

>>> kv = {'key1': 'value1', 'key2': 'value2'}
>>> r = requests.request('POST', 'http://python123.io/ws', data=kv)
>>> body = 'Main content'
>>> r = requests.request('POST', 'http://python123.io/ws', data=body)

json : data in JSON format, as the content of Request

>>> kv = {'key1': 'value1'}
>>> r = requests.request('POST', 'http://python123.io/ws', json=kv)

headers : dictionary, HTTP custom headers

>>> hd = {'user‐agent': 'Chrome/10'}
>>> r = requests.request('POST', 'http://python123.io/ws', headers=hd)

cookies : dictionary or CookieJar, cookies in Request

auth : tuple, support HTTP authentication function

files : dictionary type, transfer files

>>> fs = {'file': open('data.xls', 'rb')}
>>> r = requests.request('POST', 'http://python123.io/ws', files=fs)

timeout : Set the timeout time, in seconds

>>> r = requests.request('GET', 'http://www.baidu.com', timeout=10)

proxies : dictionary type, set the access proxy server, you can add login authentication

>>> pxs = { 'http': 'http://user:pass@10.10.10.1:1234'
'https': 'https://10.10.10.1:4321' }
>>> r = requests.request('GET', 'http://www.baidu.com', proxies=pxs)

allow_redirects : True/False, the default is True, redirect switch

stream : True/False, the default is True, get the content immediately download switch

verify : True/False, the default is True, verify the SSL certificate switch

cert : local SSL certificate path

6.requests.patch()

7.requests.delete()

Requests library web crawling combat

Example 1: Crawling of JD.com product pages

interactive interface code

>>> import requests
>>> r=requests.get("https://item.jd.com/2967929.html")
>>> r.status_code
200
>>> r.encoding
'UTF-8'
>>> r.text[:1000]
"<script>window.location.href='https://passport.jd.com/uc/login?ReturnUrl=http%3A%2F%2Fitem.jd.com%2F2967929.html'</script>"

Different from the video example, entering the URL displayed by r.text requires logging in to JD.com. Problems to be solved
full code

import requests
url="https://item.jd.com/2967929.html"
try:
	r=requests.get(url)
	r.raise_for_status()
	r.encoding=r.apparent_encoding
	print(r.text[:1000])
except:
		return "Crawl failed"

Example 2: Crawling of Amazon product pages

interactive page

>>> import requests
>>> r=requests.get("https://www.amazon.cn/gp/product/B01M8L5Z3Y")
>>> r.status_code
503
>>> r.encoding
'ISO-8859-1'
>>> r.encoding=r.apparent_encoding
>>> r.text
......#Crawled content failed
>>> kv={'User-Agent':'Mozilla/5.0'}
>>> url="https://www.amazon.cn/gp/product/B01M8L5Z3Y"
>>> r=requests.get(url,headers=kv)
>>> r.status_code
200
>>> r.request.headers
{'User-Agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
>>> r.text[:1000]
'<!DOCTYPE html>\n<!--[if lt IE 7]> <html ...
#Crawled successfully

Analysis, because the user accessed by the requests library is a python crawler, it is necessary to modify the access user.
full code

import requests
url="https://www.amazon.cn/gp/product/B01M8L5Z3Y" 
try:
    kv={'User-Agent':'Mozilla/5.0'}
    r=requests.get(url,headers=kv)
    r.raise_for_status()
	r.encoding=r.apparent_encoding
	print(r.text[:1000])
except:
		return "Crawl failed"

Example 3: Baidu/360 search keyword submission

key interface

Baidu's keyword interface:

http://www.baidu.com/s?wd=keyword

360 keyword interface:

http://www.so.com/s?q=keyword
interactive code

>>> import requests
>>> kv={'wd':'Python'}
>>> r=requests.get("http://www.baidu.com/s",params=kv)
>>> r.status_code
200
>>> r.request.url
'https://wappass.baidu.com/static/captcha/tuxing.html?&ak=c27bbc89afca0463650ac9bde68ebe06&backurl=https%3A%2F%2Fwww.baidu.com%2Fs%3Fwd%3DPython&logid=12091394642915979213&signature=7f7635eaf67ce2b315dc5dd09e36b7e2&timestamp=1597718265'
>>> len(r.text)
1519

full code

import requests
keyword='Python'
try:
	kv={'wd':keyword}
	r=requests.get("http://www.baidu.com/s",params=kv)
	print(r.request.url)
	r.raise_for_status()
	print(len(r.text))
except:
	print("Crawl failed")

Example 4: Crawling and storage of network pictures

Format of web image links:

http://www.example.com/picture.jpg
jpeg format can also be used in the same way

Optional image address:

http://image.ngchina.com.cn/userpic/109099/2020/08011238101090991864.jpeg
interactive code

>>> import requests
>>> path="D://234.jpeg"
>>> url="http://image.ngchina.com.cn/userpic/109099/2020/08011238101090991864.jpeg"
>>> r=requests.get(url)
>>> r.status_code
200
>>> with open(path,'wb') as f:
	f.write(r.content)	
	
343023
>>> f.close()

Then you can find the image in the relevant path
full code

import requests
import os
url ='http://image.ngchina.com.cn/userpic/109099/2020/08011238101090991864.jpeg'
root = 'D://pics//'
path = root + url.split('/')[-1]
try:
    if not os.path.exists(root):
        os.mkdir(root)
    if not os.path.exists(path):
        r = requests.get(url)
        with open(path,'wb') as f:
            f.write(r.content)
            f.close()
            print('file saved successfully')
    else:
        print('File already exists')
except:
    print('Crawl failed')

Example 5: Automatic query of IP address attribution

http://m.ip138.com/ip.asp?ip=ipaddress
user-interface

>>> import requests
>>> url ='https://m.ip138.com/ip.asp?ip='
>>> r = requests.get(url+ '202.204.80.112',timeout=2)
'''
Traceback (most recent call last):
  File "D:\python\lib\site-packages\urllib3\connection.py", line 159, in _new_conn
    conn = connection.create_connection(
  File "D:\python\lib\site-packages\urllib3\util\connection.py", line 61, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "D:\python\lib\socket.py", line 918, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 11001] getaddrinfo failed
'''

The estimated error is timeout, and it is estimated that the vpn of BIT can be used before it can be used.
full code

import requests
url ='https://m.ip138.com/ip.asp?ip='

try:
    r = requests.get(url+'202.204.80.112')
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[-500:])
except:
    print('Crawl failed')

summary

Preliminarily master some basic applications of the requests library, and infer the content of regular expressions related to crawlers based on the content of regular expressions that you have learned, which should make crawlers really efficient

Tags: Python crawler

Posted by busin3ss on Sun, 22 May 2022 16:59:39 +0300