Getting Started with the Requests Library
Seven main methods
method | illustrate |
---|---|
requests.request() | Construct a request that supports the underlying methods of the following methods |
requests.get() | The main method for obtaining HTML pages, corresponding to HTTP GET |
requests.head() | The method for obtaining the header information of HTML pages, corresponding to the HEAD of HTTP |
requests.post() | A method for submitting a POST request to an HTML web page, corresponding to HTTP POST |
requests.put() | A method for submitting a PUT request to an HTML web page, corresponding to HTTP PUT |
requests.patch() | Submit a partial modification request to an HTML page, corresponding to HTTP PATCH |
requests.delete() | Submit a delete request to an HTML page, corresponding to HTTP DELETE |
1. requests.get(url, params=None, **kwargs)
url : the url link of the page to be retrieved
params : extra parameters in url, dictionary or byte stream format, optional
**kwargs: 12 parameters that control access
get() returns the response object:
Properties of the response object
Attributes | illustrate |
---|---|
r.status_code | The return status of the HTTP request, 200 means the connection is successful, 404 and other values mean failure |
r.text HTTP | The string form of the response content, that is, the page content corresponding to the url |
r.encoding | Response content encoding guessed from HTTP header s |
r.apparent_encoding | Response content encoding method analyzed from the content (alternative encoding method) |
r.content | The binary form of the HTTP response content |
We have made some functional extensions and syntax support for the Markdown editor. In addition to the standard Markdown editor functions, we have added the following new functions to help you write blogs with it:
Common exceptions to the Requests library
abnormal | illustrate |
---|---|
requests.ConnectionError | Abnormal network connection errors, such as DNS query failure, connection refused, etc. |
requests.HTTPError | HTTP error exception |
requests.URLRequired | URL missing exception |
requests.TooManyRedirects | If the maximum number of redirects is exceeded, a redirection exception is generated |
requests.ConnectTimeout | Connecting to the remote server exceeds an exception when inserting a code snippet here |
requests.Timeout | The request URL timed out, resulting in a timeout exception |
Common code framework for crawling web pages
Code:
import requests def getHTMLText(url): try: r=requests.get(url,timeout=30) r.raise_for_status() r.encoding=r.apparent_encoding return r.text except: return "generate exception" if __name__=='__main__': url="http://www.baidu.com" print(getHTMLText(url))
HTTP protocol
HTTP, Hypertext Transfer Protocol, Hypertext Transfer Protocol
HTTP is a stateless application layer protocol based on the "request and response" model
The HTTP protocol uses URL as the identifier for locating network resources. The URL format is as follows:
http://host[:port][path]
host: legal Internet host domain name or IP address
port: port number, the default port is 80
path: the path of the requested resource
Operation of HTTP protocol on resources
method | illustrate |
---|---|
GET | Request to get the resource at the URL location |
HEAD | Request to obtain the response message report of the URL location resource, that is, to obtain the header information of the resource |
POST | Append new data after request to the resource at the URL location |
PUT | Request to store a resource at the URL location, overwriting the resource at the original URL location |
PATCH | Request a partial update of the resource at the URL location, that is, change part of the content of the resource at that location |
DELETE | Request to delete the resource stored at the URL location |
2. The head() method of the Requests library
>>> r = requests.head('http://httpbin.org/get') >>> r.headers {'Date': 'Tue, 18 Aug 2020 01:30:20 GMT', 'Content-Type': 'application/json', 'Content-Length': '307', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true'} >>> r.text ''
3. The post() method of Requests
>>> payload = {'key1': 'value1', 'key2': 'value2'} >>> r = requests.post('http://httpbin.org/post', data = payload) >>> print(r.text) { ... "form": { "key2": "value2", #POST a dictionary to the URL "key1": "value1" #Automatically encode as form (form) }, }
4. The put() method of Requests
>>> payload = {'key1': 'value1', 'key2': 'value2'} >>> r = requests.put('http://httpbin.org/put', data = payload) >>> print(r.text) { ... "form": { "key2": "value2", "key1": "value1" }, }
5.requests.request(methoud,url,**kwargs) method
requests.request(method, url, **kwargs) is the basis for the remaining six methods
Subclass | correspond |
---|---|
method | Request method, corresponding to 7 types such as get/put/post |
url : | The url link of the page to be retrieved |
**kwargs: | Parameters that control access, a total of 13 |
method : request method
r = requests.request('GET', url, **kwargs)
r = requests.request('HEAD', url, **kwargs)
r = requests.request('POST', url, **kwargs)
r = requests.request('PUT', url, **kwargs)
r = requests.request('PATCH', url, **kwargs)
r = requests.request('delete', url, **kwargs)
r = requests.request('OPTIONS', url, **kwargs)
**kwargs: parameters that control access, all optional
params : dictionary or byte sequence, added to the url as a parameter
code
>>> kv = {'key1': 'value1', 'key2': 'value2'} >>> r = requests.request('GET', 'http://python123.io/ws', params=kv) >>> print(r.url) http://python123.io/ws?key1=value1&key2=value2
data : dictionary, byte sequence or file object, as the content of the Request
>>> kv = {'key1': 'value1', 'key2': 'value2'} >>> r = requests.request('POST', 'http://python123.io/ws', data=kv) >>> body = 'Main content' >>> r = requests.request('POST', 'http://python123.io/ws', data=body)
json : data in JSON format, as the content of Request
>>> kv = {'key1': 'value1'} >>> r = requests.request('POST', 'http://python123.io/ws', json=kv)
headers : dictionary, HTTP custom headers
>>> hd = {'user‐agent': 'Chrome/10'} >>> r = requests.request('POST', 'http://python123.io/ws', headers=hd)
cookies : dictionary or CookieJar, cookies in Request
auth : tuple, support HTTP authentication function
files : dictionary type, transfer files
>>> fs = {'file': open('data.xls', 'rb')} >>> r = requests.request('POST', 'http://python123.io/ws', files=fs)
timeout : Set the timeout time, in seconds
>>> r = requests.request('GET', 'http://www.baidu.com', timeout=10)
proxies : dictionary type, set the access proxy server, you can add login authentication
>>> pxs = { 'http': 'http://user:pass@10.10.10.1:1234' 'https': 'https://10.10.10.1:4321' } >>> r = requests.request('GET', 'http://www.baidu.com', proxies=pxs)
allow_redirects : True/False, the default is True, redirect switch
stream : True/False, the default is True, get the content immediately download switch
verify : True/False, the default is True, verify the SSL certificate switch
cert : local SSL certificate path
6.requests.patch()
7.requests.delete()
Requests library web crawling combat
Example 1: Crawling of JD.com product pages
interactive interface code
>>> import requests >>> r=requests.get("https://item.jd.com/2967929.html") >>> r.status_code 200 >>> r.encoding 'UTF-8' >>> r.text[:1000] "<script>window.location.href='https://passport.jd.com/uc/login?ReturnUrl=http%3A%2F%2Fitem.jd.com%2F2967929.html'</script>"
Different from the video example, entering the URL displayed by r.text requires logging in to JD.com. Problems to be solved
full code
import requests url="https://item.jd.com/2967929.html" try: r=requests.get(url) r.raise_for_status() r.encoding=r.apparent_encoding print(r.text[:1000]) except: return "Crawl failed"
Example 2: Crawling of Amazon product pages
interactive page
>>> import requests >>> r=requests.get("https://www.amazon.cn/gp/product/B01M8L5Z3Y") >>> r.status_code 503 >>> r.encoding 'ISO-8859-1' >>> r.encoding=r.apparent_encoding >>> r.text ......#Crawled content failed >>> kv={'User-Agent':'Mozilla/5.0'} >>> url="https://www.amazon.cn/gp/product/B01M8L5Z3Y" >>> r=requests.get(url,headers=kv) >>> r.status_code 200 >>> r.request.headers {'User-Agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'} >>> r.text[:1000] '<!DOCTYPE html>\n<!--[if lt IE 7]> <html ... #Crawled successfully
Analysis, because the user accessed by the requests library is a python crawler, it is necessary to modify the access user.
full code
import requests url="https://www.amazon.cn/gp/product/B01M8L5Z3Y" try: kv={'User-Agent':'Mozilla/5.0'} r=requests.get(url,headers=kv) r.raise_for_status() r.encoding=r.apparent_encoding print(r.text[:1000]) except: return "Crawl failed"
Example 3: Baidu/360 search keyword submission
key interface
Baidu's keyword interface:
http://www.baidu.com/s?wd=keyword
360 keyword interface:
http://www.so.com/s?q=keyword
interactive code
>>> import requests >>> kv={'wd':'Python'} >>> r=requests.get("http://www.baidu.com/s",params=kv) >>> r.status_code 200 >>> r.request.url 'https://wappass.baidu.com/static/captcha/tuxing.html?&ak=c27bbc89afca0463650ac9bde68ebe06&backurl=https%3A%2F%2Fwww.baidu.com%2Fs%3Fwd%3DPython&logid=12091394642915979213&signature=7f7635eaf67ce2b315dc5dd09e36b7e2×tamp=1597718265' >>> len(r.text) 1519
full code
import requests keyword='Python' try: kv={'wd':keyword} r=requests.get("http://www.baidu.com/s",params=kv) print(r.request.url) r.raise_for_status() print(len(r.text)) except: print("Crawl failed")
Example 4: Crawling and storage of network pictures
Format of web image links:
http://www.example.com/picture.jpg
jpeg format can also be used in the same way
Optional image address:
http://image.ngchina.com.cn/userpic/109099/2020/08011238101090991864.jpeg
interactive code
>>> import requests >>> path="D://234.jpeg" >>> url="http://image.ngchina.com.cn/userpic/109099/2020/08011238101090991864.jpeg" >>> r=requests.get(url) >>> r.status_code 200 >>> with open(path,'wb') as f: f.write(r.content) 343023 >>> f.close()
Then you can find the image in the relevant path
full code
import requests import os url ='http://image.ngchina.com.cn/userpic/109099/2020/08011238101090991864.jpeg' root = 'D://pics//' path = root + url.split('/')[-1] try: if not os.path.exists(root): os.mkdir(root) if not os.path.exists(path): r = requests.get(url) with open(path,'wb') as f: f.write(r.content) f.close() print('file saved successfully') else: print('File already exists') except: print('Crawl failed')
Example 5: Automatic query of IP address attribution
http://m.ip138.com/ip.asp?ip=ipaddress
user-interface
>>> import requests >>> url ='https://m.ip138.com/ip.asp?ip=' >>> r = requests.get(url+ '202.204.80.112',timeout=2) ''' Traceback (most recent call last): File "D:\python\lib\site-packages\urllib3\connection.py", line 159, in _new_conn conn = connection.create_connection( File "D:\python\lib\site-packages\urllib3\util\connection.py", line 61, in create_connection for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM): File "D:\python\lib\socket.py", line 918, in getaddrinfo for res in _socket.getaddrinfo(host, port, family, type, proto, flags): socket.gaierror: [Errno 11001] getaddrinfo failed '''
The estimated error is timeout, and it is estimated that the vpn of BIT can be used before it can be used.
full code
import requests url ='https://m.ip138.com/ip.asp?ip=' try: r = requests.get(url+'202.204.80.112') r.raise_for_status() r.encoding = r.apparent_encoding print(r.text[-500:]) except: print('Crawl failed')
summary
Preliminarily master some basic applications of the requests library, and infer the content of regular expressions related to crawlers based on the content of regular expressions that you have learned, which should make crawlers really efficient