Python web crawler (practice)

Welcome to official account: Python crawler data analysis and mining. Reply to [open source code] to get more open source project source code for free

 

01 quick crawl web page

1.1 urlopen() function

 
import urllib.request
file=urllib.request.urlopen("http://www.baidu.com")
data=file.read()
fhandle=open("./1.html","wb")
fhandle.write(data)
fhandle.close()

 

 

There are three common ways to read content. The usage is:
file.read() reads the entire contents of the file and assigns the read contents to a string variable
file. Assign the contents of lines () to the list and read all the contents of lines ()
file.readline() reads a line of the file

 

1.2 urlretrieve() function

The urlretrieve() function can directly write the corresponding information to the local file.

import urllib.request
filename=urllib.request.urlretrieve("http://edu.51cto.com",filename="./1.html")
# urlretrieve()During execution, some caches will be generated, which can be used urlcleanup()Clear
urllib.request.urlcleanup()

 



1.3 other common uses in urllib

 

import urllib.request
file=urllib.request.urlopen("http://www.baidu.com")
# Get information about the current environment
print(file.info())
 
# Bdpagetype: 1
# Bdqid: 0xb36679e8000736c1
# Cache-Control: private
# Content-Type: text/html;charset=utf-8
# Date: Sun, 24 May 2020 10:53:30 GMT
# Expires: Sun, 24 May 2020 10:52:53 GMT
# P3p: CP=" OTI DSP COR IVA OUR IND COM "
# P3p: CP=" OTI DSP COR IVA OUR IND COM "
# Server: BWS/1.1
# Set-Cookie: BAIDUID=D5BBF02F4454CBA7D3962001F33E17C6:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
# Set-Cookie: BIDUPSID=D5BBF02F4454CBA7D3962001F33E17C6; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
# Set-Cookie: PSTM=1590317610; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
# Set-Cookie: BAIDUID=D5BBF02F4454CBA7FDDF8A87AF5416A6:FG=1; max-age=31536000; expires=Mon, 24-May-21 10:53:30 GMT; domain=.baidu.com; path=/; version=1; comment=bd
# Set-Cookie: BDSVRTM=0; path=/
# Set-Cookie: BD_HOME=1; path=/
# Set-Cookie: H_PS_PSSID=31729_1436_21118_31592_31673_31464_31322_30824; path=/; domain=.baidu.com
# Traceid: 1590317610038396263412927153817753433793
# Vary: Accept-Encoding
# Vary: Accept-Encoding
# X-Ua-Compatible: IE=Edge,chrome=1
# Connection: close
# Transfer-Encoding: chunked

# Get the status code of the current crawling web page
print(file.getcode())                     
# 200

# Get the current crawl URL address
print(file.geturl())                      
# 'http://www.baidu.com'

 

Generally speaking, only some ASCII characters, such as numbers, letters and some symbols, are allowed in the URL standard, while other characters, such as men, do not meet the URL standard. In this case, URL coding is required to solve the problem.

import urllib.request
print(urllib.request.quote("http://www.baidu.com"))
# http%3A//www.baidu.com
print(urllib.request.unquote("http%3A//www.baidu.com"))
# http://www.baidu.com

 

 

 

 

02 browser simulation - Header attribute

In order to prevent others from collecting their information maliciously, some web pages have made some anti crawler settings. When we crawl, 403 errors will appear.
You can set some Headers information to simulate the browser to access these websites.
There are two settings that allow crawlers to simulate browser access.

 

2.1 using build_opener() modify header

import urllib.request

url= "http://www.baidu.com"
headers=("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0")
opener = urllib.request.build_opener()
opener.addheaders = [headers]
data=opener.open(url).read()
fhandle=open("./2.html","wb")
fhandle.write(data)
fhandle.close()

 

 

2.2 using add_header() add header

 

import urllib.request

url= "http://www.baidu.com"
req=urllib.request.Request(url)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0')
data=urllib.request.urlopen(req).read()
fhandle=open("./2.html","wb")
fhandle.write(data)
fhandle.close()

 

03 timeout setting

When visiting a web page, if the web page does not respond for a long time, the system will judge that the web page has timed out, that is, the web page cannot be opened.

 

import urllib.request

# timeout Set timeout in seconds
file = urllib.request.urlopen("http://yum.iqianyue.com", timeout=1)
data = file.read()

 

04 proxy server

When using a proxy server to crawl the content of a website, the other website displays not our real IP address, but the IP address of the proxy server. In this way, even if the other party shields the displayed IP address, it doesn't matter, because we can change another IP address to continue crawling.

 

def use_proxy(proxy_addr,url):
    import urllib.request
    proxy= urllib.request.ProxyHandler({'http':proxy_addr})
    opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler)
    urllib.request.install_opener(opener)
    data = urllib.request.urlopen(url).read().decode('utf-8')
    return data
proxy_addr="xxx.xx.xxx.xx:xxxx"
data=use_proxy(proxy_addr,"http://www.baidu.com")
print(len(data))

 

Use urllib request. install_ If opener () creates a global opener object, the opener object we installed will also be used when using urlopen().

05 Cookie

If we only use HTTP protocol, if the login is successful when we log in to a website, but when we visit other pages of the website, the login status will disappear. At this time, we need to log in again, so we need to save the corresponding session information, such as login success, in some ways.
There are two common ways:
1) Save session information through cookies
2) Save Session information through Session
However, no matter which method is used for session control, cookies will be used most of the time.
A common procedure for Cookie processing is as follows:
1) Import Cookie processing module http cookiejar.
2) Use HTTP CookieJar. CookieJar() creates a CookieJar object.
3) Use HTTP cookie processor to create a cookie processor and build an opener object with it as a parameter.
4) Create a global default opener object.

import urllib.request
import urllib.parse
import http.cookiejar
url = "http://xx.xx.xx/1.html"
postdata = urllib.parse.urlencode({
    "username":"xxxxxx",
    "password":"xxxxxx"
}).encode("utf-8")
req = urllib.request.Request(url,postdata)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0')
# use http.cookiejar.CookieJar()establish CookieJar object
cjar = http.cookiejar.CookieJar()
# use HTTPCookieProcessor establish cookie Processor and build it as a parameter opener object
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cjar))
# Create global default opener object
urllib.request.install_opener(opener)
file = opener.open(req)

data=file.read()
fhandle=open("./4.html","wb")
fhandle.write(data)
fhandle.close()

url1 = "http://xx.xx.xx/2.html"
data1= urllib.request.urlopen(url1).read()
fhandle1=open("./5.html","wb")
fhandle1.write(data1)
fhandle1.close()

 

06 DebugLog

Print the debug Log while executing the program.

 

import urllib.request
httphd=urllib.request.HTTPHandler(debuglevel=1)
httpshd=urllib.request.HTTPSHandler(debuglevel=1)
opener=urllib.request.build_opener(httphd,httpshd)
urllib.request.install_opener(opener)
data=urllib.request.urlopen("http://www.baidu.com")

 

07 exception handling - URLError

 

import urllib.request
import urllib.error
try:
    urllib.request.urlopen("http://blog.baidusss.net")
except urllib.error.HTTPError as e:
    print(e.code)
    print(e.reason)
except urllib.error.URLError as e:
    print(e.reason)

 

perhaps

import urllib.request
import urllib.error
try:
    urllib.request.urlopen("http://blog.csdn.net")
except urllib.error.URLError as e:
    if hasattr(e,"code"):
        print(e.code)
    if hasattr(e,"reason"):
        print(e.reason)

 

08 HTTP protocol request practice

HTTP protocol requests are mainly divided into six types. The main functions of each type are as follows:
1) GET request: the GET request will transfer information through the URL. You can write the information to be transferred directly in the URL or transfer it by the form.
If a form is used for delivery, the information in the form will be automatically converted to the data in the URL address and delivered through the URL address.
2) POST request: it can submit data to the server, which is a mainstream and safer data transmission method.
3) PUT request: request the server to store a resource, usually specifying the storage location.
4) DELETE request: requests the server to DELETE a resource.
5) HEAD request: request to obtain the corresponding HTTP header information.
6) OPTIONS request: you can obtain the request type supported by the current URL
In addition, there are TRACE requests and CONNECT requests. TRACE requests are mainly used for testing or diagnosis.

8.1 GET request instance

Use the GET request. The steps are as follows:
1) Build the corresponding URL address, which contains the field name, field content and other information of the GET request.
GET request format: http: / / Web address? Field 1 = field content & field 2 = field content
2) Take the corresponding URL as the parameter to build the Request object.
3) Open the built Request object through urlopen().
4) Follow up processing as required.



import urllib.request

url="http://www.baidu.com/s?wd="
key="Hello"
key_code=urllib.request.quote(key)
url_all=url+key_code
req=urllib.request.Request(url_all)
data=urllib.request.urlopen(req).read()
fh=open("./3.html","wb")
fh.write(data)
fh.close()

 

8.2 POST request instance

To use POSt request, the steps are as follows:
1) Set the URL.
2) Build the form data and use urllib parse. URLEncode encodes the data.
3) Create a Request object with parameters including URL address and data to be delivered.
4) Use add_header() adds header information and simulates the browser to crawl.
5) Use urllib Request. Urlopen () opens the corresponding Request object to complete the transfer of information.
6) Follow up treatment.

 

import urllib.request
import urllib.parse

url = "http://www.xxxx.com/post/"
postdata =urllib.parse.urlencode({
"name":"xxx@xxx.com",
"pass":"xxxxxxx"
}).encode('utf-8') 
req = urllib.request.Request(url,postdata)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0')
data=urllib.request.urlopen(req).read()
fhandle=open("D:/Python35/myweb/part4/6.html","wb")
fhandle.write(data)
fhandle.close()

 

 

Posted by ktstowell on Tue, 17 May 2022 21:18:21 +0300