urllib Library
preface
In the actual operation of the crawler, we only need two lines of code to get the web content. This function is the implementation of the third-party library
In Python 2, our library has urllib 2 and urllib, but in Python 3, we directly merged this library into urllib, which is why we can't afford to install it when we install it at ordinary times.
Using the urllib Library
Request: it is the most basic HTTP request module and can be used to simulate sending requests.
error: exception handling module. If there are request errors, we can catch these exceptions and retry or other operations to ensure that the program will not terminate unexpectedly.
parse: a tool module that provides many URL processing methods, such as splitting, parsing, merging, etc.
Send request
Taking Baidu as an example, we first get its web page information. Again, first install the library. My previous article introduced the detailed installation method.
urlopen method
import urllib.request from urllib import request response = urllib.request.urlopen("https://www.baidu.com",timeout=2) # timeout operation to avoid other websites blocking computer IP print(response.read().decode("utf-8")) #Request web page data source code, decode ("utf-8")
The function of timeout here is timeout. Its purpose is that if I don't get the page within 2 seconds, I will report an error. Generally, if I climb a large website, we can set some parameters to solve our difficult operation. Look at the output below
We only need two lines of code to complete the output of the web page source code read() is a calling method decode("utf-8") is a powerful function of code conversion. There are many such attribute methods in the request, such as the following
import urllib.request from urllib import request response = urllib.request.urlopen("https://www.baidu.com",timeout=2) # timeout operation to avoid other websites blocking computer IP print(response.read().decode("utf-8")) #Request web page data source code, decode ("utf-8") print(type(response)) #Output response type print(response.status) #The return status code 200 is normal! print(response.getheaders()) #Get the header information of the response print(response.getheader("Server")) #Get server build type
I won't print here
Request method
We know that urlopen method can request web pages and get something, but it is not enough to solve our practical problems. For example, we need to add some headers information
class urllib. request. Request (ur1, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)
from urllib import request,parse url = "http://httpbin.org/post " headers = { "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36" } dict={ "name":"wangxiaownag" } data=bytes(parse.urlencode(dict),encoding="utf-8") req = request.Request(url=url, headers=headers,data=data,method="POST") #Solve the crawler identified by the system and disguise it as a browser response = request.urlopen(req) #Request web page again print(response.read().decode("utf-8")) ''' data Parameter, must be passed in bytes(Byte stream) type. If it is a dictionary type, let's use it first urllib.parse Inside urlencode()code For example: dict={"key":"value"};data=bytes(parse.urlencode(dict),encoding="utf-8") method Parameter, indicating the request usage method, generally including GET,POST,PUT '''
Advanced Usage
In fact, opener is similar to urlopen, but it belongs to advanced usage. It is used for websites that need to be verified. For example, the following example obtains the verified source code
from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener from urllib.error import URLError username = ' username' password = 'password ' url = " http: //localhost:sooo/" p = HTTPPasswordMgrWithDefaultRealm() p.add_password(None, url, username, password) auth_handler = HTTPBasicAuthHandler(p) opener = build_opener(auth_handler) try: result = opener.open(url) html = result.read().decode(" utf 8") print(html) except URLError as e: print(e.reason)
agent
from urllib.error import URLError from urllib.request import ProxyHandler, build_opener proxy_handler = ProxyHandler({ "http": " http://127.o.o .1:9743 ", "https": " https://127.0 .0.1:9743 " }) opener = build_opener(proxy_handler) try: response = opener.open("' https://www.baidu.com'") print(response.read().decode(" utf-8")) except URLError as e: print(e.reason)
An agent is built locally, which runs on port 9743. ProxyHandler is used. Its parameter is a dictionary, the key name is the protocol type (such as HTTP or HTTPS), and the key value is the proxy link. Multiple proxies can be added.
Global agent entrance
Another wave!
last hole!
cookies
import http.cookiejar, urllib.request cookie = http. cookiejar. CookieJar() handler = urllib .request.HTTPCookieProcessor (cookie) opener = urllib.request.build_opener(handler ) response = opener. open(' http://www.baidu.com') for item in cookie: print(item.name+'='+item.value)
Handling exceptions
from urllib import request,error try: response=request.urlopen("https:www.cuiqingcai.com/index.htm") #A website address scribbled except error.URLError as e: print(e.reason) #Return error value
When handling exceptions, we have HTTPError, which has the following attributes reasom . code . In practice, we found that only by learning to make good use of exception handling can we optimize the code, run efficiently and reduce the occurrence of bug s!!!!
Resolve links
Previously, we talked about how to use urllib to obtain the source code of the web page. Our next step is to analyze the web page.
For a website, we have standard link explanations
urlparse()
Its function is to identify the elements of a website and segment them, such as the picture above
from urllib.parse import urlparse,urlencode,quote result=urlparse("https://www.baidu.com",scheme="https",allow_fragments=False) print(type(result),result) #Returns the type object of the web address
urlunparse()
Construct the URL
from urllib.parse import urlparse,urlencode,quote result=urlparse("https://www.baidu.com",scheme="https",allow_fragments=False) print(type(result),result) #Returns the type object of the web address ''' urlunpaser() urlsplit()It will not be resolved separately params In this part, only 5 results are returned urlunsplit() urljoin()Complete link merge urlencode()In construction GET Request time serialization parse_qs()Deserialization parse_qsl()Convert parameters to tuples quote()Convert content to URL Coding format, avoid URL There is Chinese in it, which leads to the problem of garbled code ''' params={"name":"wangxiaowang", "age":100 } base_url="htts://www.baidu.com?" keys="Wang Xiaowang" url=base_url+urlencode(params)+quote("keys") print(url)
robotas protocol
It is used to judge which pages can be crawled and which can not Click for details!
Well, this article introduces the use of urllib library. See you in the next article!