Welcome to official account: Python crawler data analysis and mining. Reply to [open source code] to get more open source project source code for free
01 quick crawl web page
1.1 urlopen() function
import urllib.request file=urllib.request.urlopen("http://www.baidu.com") data=file.read() fhandle=open("./1.html","wb") fhandle.write(data) fhandle.close()
There are three common ways to read content. The usage is:
file.read() reads the entire contents of the file and assigns the read contents to a string variable
file. Assign the contents of lines () to the list and read all the contents of lines ()
file.readline() reads a line of the file
1.2 urlretrieve() function
The urlretrieve() function can directly write the corresponding information to the local file.
import urllib.request filename=urllib.request.urlretrieve("http://edu.51cto.com",filename="./1.html") # urlretrieve()During execution, some caches will be generated, which can be used urlcleanup()Clear urllib.request.urlcleanup()
1.3 other common uses in urllib
import urllib.request file=urllib.request.urlopen("http://www.baidu.com") # Get information about the current environment print(file.info()) # Bdpagetype: 1 # Bdqid: 0xb36679e8000736c1 # Cache-Control: private # Content-Type: text/html;charset=utf-8 # Date: Sun, 24 May 2020 10:53:30 GMT # Expires: Sun, 24 May 2020 10:52:53 GMT # P3p: CP=" OTI DSP COR IVA OUR IND COM " # P3p: CP=" OTI DSP COR IVA OUR IND COM " # Server: BWS/1.1 # Set-Cookie: BAIDUID=D5BBF02F4454CBA7D3962001F33E17C6:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com # Set-Cookie: BIDUPSID=D5BBF02F4454CBA7D3962001F33E17C6; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com # Set-Cookie: PSTM=1590317610; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com # Set-Cookie: BAIDUID=D5BBF02F4454CBA7FDDF8A87AF5416A6:FG=1; max-age=31536000; expires=Mon, 24-May-21 10:53:30 GMT; domain=.baidu.com; path=/; version=1; comment=bd # Set-Cookie: BDSVRTM=0; path=/ # Set-Cookie: BD_HOME=1; path=/ # Set-Cookie: H_PS_PSSID=31729_1436_21118_31592_31673_31464_31322_30824; path=/; domain=.baidu.com # Traceid: 1590317610038396263412927153817753433793 # Vary: Accept-Encoding # Vary: Accept-Encoding # X-Ua-Compatible: IE=Edge,chrome=1 # Connection: close # Transfer-Encoding: chunked # Get the status code of the current crawling web page print(file.getcode()) # 200 # Get the current crawl URL address print(file.geturl()) # 'http://www.baidu.com'
Generally speaking, only some ASCII characters, such as numbers, letters and some symbols, are allowed in the URL standard, while other characters, such as men, do not meet the URL standard. In this case, URL coding is required to solve the problem.
import urllib.request print(urllib.request.quote("http://www.baidu.com")) # http%3A//www.baidu.com print(urllib.request.unquote("http%3A//www.baidu.com")) # http://www.baidu.com
02 browser simulation - Header attribute
In order to prevent others from collecting their information maliciously, some web pages have made some anti crawler settings. When we crawl, 403 errors will appear.
You can set some Headers information to simulate the browser to access these websites.
There are two settings that allow crawlers to simulate browser access.
2.1 using build_opener() modify header
import urllib.request url= "http://www.baidu.com" headers=("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0") opener = urllib.request.build_opener() opener.addheaders = [headers] data=opener.open(url).read() fhandle=open("./2.html","wb") fhandle.write(data) fhandle.close()
2.2 using add_header() add header
import urllib.request url= "http://www.baidu.com" req=urllib.request.Request(url) req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0') data=urllib.request.urlopen(req).read() fhandle=open("./2.html","wb") fhandle.write(data) fhandle.close()
03 timeout setting
When visiting a web page, if the web page does not respond for a long time, the system will judge that the web page has timed out, that is, the web page cannot be opened.
import urllib.request # timeout Set timeout in seconds file = urllib.request.urlopen("http://yum.iqianyue.com", timeout=1) data = file.read()
04 proxy server
When using a proxy server to crawl the content of a website, the other website displays not our real IP address, but the IP address of the proxy server. In this way, even if the other party shields the displayed IP address, it doesn't matter, because we can change another IP address to continue crawling.
def use_proxy(proxy_addr,url): import urllib.request proxy= urllib.request.ProxyHandler({'http':proxy_addr}) opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler) urllib.request.install_opener(opener) data = urllib.request.urlopen(url).read().decode('utf-8') return data proxy_addr="xxx.xx.xxx.xx:xxxx" data=use_proxy(proxy_addr,"http://www.baidu.com") print(len(data))
Use urllib request. install_ If opener () creates a global opener object, the opener object we installed will also be used when using urlopen().
05 Cookie
If we only use HTTP protocol, if the login is successful when we log in to a website, but when we visit other pages of the website, the login status will disappear. At this time, we need to log in again, so we need to save the corresponding session information, such as login success, in some ways.
There are two common ways:
1) Save session information through cookies
2) Save Session information through Session
However, no matter which method is used for session control, cookies will be used most of the time.
A common procedure for Cookie processing is as follows:
1) Import Cookie processing module http cookiejar.
2) Use HTTP CookieJar. CookieJar() creates a CookieJar object.
3) Use HTTP cookie processor to create a cookie processor and build an opener object with it as a parameter.
4) Create a global default opener object.
import urllib.request import urllib.parse import http.cookiejar url = "http://xx.xx.xx/1.html" postdata = urllib.parse.urlencode({ "username":"xxxxxx", "password":"xxxxxx" }).encode("utf-8") req = urllib.request.Request(url,postdata) req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0') # use http.cookiejar.CookieJar()establish CookieJar object cjar = http.cookiejar.CookieJar() # use HTTPCookieProcessor establish cookie Processor and build it as a parameter opener object opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cjar)) # Create global default opener object urllib.request.install_opener(opener) file = opener.open(req) data=file.read() fhandle=open("./4.html","wb") fhandle.write(data) fhandle.close() url1 = "http://xx.xx.xx/2.html" data1= urllib.request.urlopen(url1).read() fhandle1=open("./5.html","wb") fhandle1.write(data1) fhandle1.close()
06 DebugLog
Print the debug Log while executing the program.
import urllib.request httphd=urllib.request.HTTPHandler(debuglevel=1) httpshd=urllib.request.HTTPSHandler(debuglevel=1) opener=urllib.request.build_opener(httphd,httpshd) urllib.request.install_opener(opener) data=urllib.request.urlopen("http://www.baidu.com")
07 exception handling - URLError
import urllib.request import urllib.error try: urllib.request.urlopen("http://blog.baidusss.net") except urllib.error.HTTPError as e: print(e.code) print(e.reason) except urllib.error.URLError as e: print(e.reason)
perhaps
import urllib.request import urllib.error try: urllib.request.urlopen("http://blog.csdn.net") except urllib.error.URLError as e: if hasattr(e,"code"): print(e.code) if hasattr(e,"reason"): print(e.reason)
08 HTTP protocol request practice
HTTP protocol requests are mainly divided into six types. The main functions of each type are as follows:
1) GET request: the GET request will transfer information through the URL. You can write the information to be transferred directly in the URL or transfer it by the form.
If a form is used for delivery, the information in the form will be automatically converted to the data in the URL address and delivered through the URL address.
2) POST request: it can submit data to the server, which is a mainstream and safer data transmission method.
3) PUT request: request the server to store a resource, usually specifying the storage location.
4) DELETE request: requests the server to DELETE a resource.
5) HEAD request: request to obtain the corresponding HTTP header information.
6) OPTIONS request: you can obtain the request type supported by the current URL
In addition, there are TRACE requests and CONNECT requests. TRACE requests are mainly used for testing or diagnosis.
8.1 GET request instance
Use the GET request. The steps are as follows:
1) Build the corresponding URL address, which contains the field name, field content and other information of the GET request.
GET request format: http: / / Web address? Field 1 = field content & field 2 = field content
2) Take the corresponding URL as the parameter to build the Request object.
3) Open the built Request object through urlopen().
4) Follow up processing as required.
import urllib.request url="http://www.baidu.com/s?wd=" key="Hello" key_code=urllib.request.quote(key) url_all=url+key_code req=urllib.request.Request(url_all) data=urllib.request.urlopen(req).read() fh=open("./3.html","wb") fh.write(data) fh.close()
8.2 POST request instance
To use POSt request, the steps are as follows:
1) Set the URL.
2) Build the form data and use urllib parse. URLEncode encodes the data.
3) Create a Request object with parameters including URL address and data to be delivered.
4) Use add_header() adds header information and simulates the browser to crawl.
5) Use urllib Request. Urlopen () opens the corresponding Request object to complete the transfer of information.
6) Follow up treatment.
import urllib.request import urllib.parse url = "http://www.xxxx.com/post/" postdata =urllib.parse.urlencode({ "name":"xxx@xxx.com", "pass":"xxxxxxx" }).encode('utf-8') req = urllib.request.Request(url,postdata) req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0') data=urllib.request.urlopen(req).read() fhandle=open("D:/Python35/myweb/part4/6.html","wb") fhandle.write(data) fhandle.close()