Use of urllib Library of reptiles

urllib Library

preface
In the actual operation of the crawler, we only need two lines of code to get the web content. This function is the implementation of the third-party library

In Python 2, our library has urllib 2 and urllib, but in Python 3, we directly merged this library into urllib, which is why we can't afford to install it when we install it at ordinary times.

Using the urllib Library

Request: it is the most basic HTTP request module and can be used to simulate sending requests.
error: exception handling module. If there are request errors, we can catch these exceptions and retry or other operations to ensure that the program will not terminate unexpectedly.
parse: a tool module that provides many URL processing methods, such as splitting, parsing, merging, etc.

Send request

Taking Baidu as an example, we first get its web page information. Again, first install the library. My previous article introduced the detailed installation method.

urlopen method

import urllib.request
from urllib import request

response = urllib.request.urlopen("https://www.baidu.com",timeout=2) # timeout operation to avoid other websites blocking computer IP
print(response.read().decode("utf-8"))    #Request web page data source code, decode ("utf-8")

The function of timeout here is timeout. Its purpose is that if I don't get the page within 2 seconds, I will report an error. Generally, if I climb a large website, we can set some parameters to solve our difficult operation. Look at the output below
We only need two lines of code to complete the output of the web page source code read() is a calling method decode("utf-8") is a powerful function of code conversion. There are many such attribute methods in the request, such as the following

import urllib.request
from urllib import request

response = urllib.request.urlopen("https://www.baidu.com",timeout=2) # timeout operation to avoid other websites blocking computer IP
print(response.read().decode("utf-8"))    #Request web page data source code, decode ("utf-8")
print(type(response)) #Output response type
print(response.status)  #The return status code 200 is normal!
print(response.getheaders())    #Get the header information of the response
print(response.getheader("Server")) #Get server build type

I won't print here

Request method

We know that urlopen method can request web pages and get something, but it is not enough to solve our practical problems. For example, we need to add some headers information

class urllib. request. Request (ur1, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)
from urllib import request,parse
url = "http://httpbin.org/post "
headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
}
dict={
    "name":"wangxiaownag"
}
data=bytes(parse.urlencode(dict),encoding="utf-8")
req = request.Request(url=url, headers=headers,data=data,method="POST") #Solve the crawler identified by the system and disguise it as a browser
response = request.urlopen(req) #Request web page again
print(response.read().decode("utf-8"))
'''
data Parameter, must be passed in bytes(Byte stream) type. If it is a dictionary type, let's use it first urllib.parse Inside urlencode()code
 For example: dict={"key":"value"};data=bytes(parse.urlencode(dict),encoding="utf-8")
method Parameter, indicating the request usage method, generally including GET,POST,PUT
'''


Advanced Usage


In fact, opener is similar to urlopen, but it belongs to advanced usage. It is used for websites that need to be verified. For example, the following example obtains the verified source code

from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener
from urllib.error import URLError

username = ' username'
password = 'password '
url = " http: //localhost:sooo/"
p = HTTPPasswordMgrWithDefaultRealm()
p.add_password(None, url, username, password)
auth_handler = HTTPBasicAuthHandler(p)
opener = build_opener(auth_handler)
try:
    result = opener.open(url)
    html = result.read().decode(" utf 8")
    print(html)
except URLError as e:
    print(e.reason)

agent

from urllib.error import URLError
from urllib.request import ProxyHandler, build_opener

proxy_handler = ProxyHandler({
    "http": " http://127.o.o .1:9743 ",
    "https": " https://127.0 .0.1:9743 "
})
opener = build_opener(proxy_handler)
try:
    response = opener.open("' https://www.baidu.com'")
    print(response.read().decode(" utf-8"))
except URLError as e:
    print(e.reason)

An agent is built locally, which runs on port 9743. ProxyHandler is used. Its parameter is a dictionary, the key name is the protocol type (such as HTTP or HTTPS), and the key value is the proxy link. Multiple proxies can be added.
Global agent entrance
Another wave!
last hole!


cookies

import http.cookiejar, urllib.request
cookie = http. cookiejar. CookieJar()
handler = urllib .request.HTTPCookieProcessor (cookie)
opener = urllib.request.build_opener(handler )
response = opener. open(' http://www.baidu.com')
for item in cookie:
	print(item.name+'='+item.value)

Handling exceptions

from urllib import request,error
try:
    response=request.urlopen("https:www.cuiqingcai.com/index.htm")    #A website address scribbled
except error.URLError as e:
    print(e.reason) #Return error value

When handling exceptions, we have HTTPError, which has the following attributes reasom . code . In practice, we found that only by learning to make good use of exception handling can we optimize the code, run efficiently and reduce the occurrence of bug s!!!!

Resolve links

Previously, we talked about how to use urllib to obtain the source code of the web page. Our next step is to analyze the web page.

For a website, we have standard link explanations

urlparse()
Its function is to identify the elements of a website and segment them, such as the picture above

from urllib.parse import urlparse,urlencode,quote
result=urlparse("https://www.baidu.com",scheme="https",allow_fragments=False)
print(type(result),result)  #Returns the type object of the web address


urlunparse()
Construct the URL

from urllib.parse import urlparse,urlencode,quote
result=urlparse("https://www.baidu.com",scheme="https",allow_fragments=False)
print(type(result),result)  #Returns the type object of the web address
''' urlunpaser()
    urlsplit()It will not be resolved separately params In this part, only 5 results are returned
    urlunsplit()
    urljoin()Complete link merge
    urlencode()In construction GET Request time serialization
    parse_qs()Deserialization
    parse_qsl()Convert parameters to tuples
    quote()Convert content to URL Coding format, avoid URL There is Chinese in it, which leads to the problem of garbled code
'''
params={"name":"wangxiaowang",
        "age":100
}
base_url="htts://www.baidu.com?"
keys="Wang Xiaowang"
url=base_url+urlencode(params)+quote("keys")
print(url)


robotas protocol

It is used to judge which pages can be crawled and which can not Click for details!

Well, this article introduces the use of urllib library. See you in the next article!

Tags: Python css http

Posted by ReDucTor on Mon, 23 May 2022 11:16:21 +0300