Tip: there are benefits at the end of the article! Latest Python crawler materials / Study Guide > > Poke me directly
preface
Urlib2 Library
Urlib2 is python2 7 comes with a module (no need to download), which supports a variety of network protocols, such as FTP, HTTP, HTTPS, etc
Urlib2 in Python 3 Changed to urllib in X request
Don't talk much and start learning
Urlib2 Library
Learning purpose
Using urlib2, an interface urlopen function is provided
Urlib2 official documents
https://docs.python.org/2/library/urllib2.html
urlopen
urlopen(url, data, timeout,....)
(1) The first parameter URL is the URL, and the first parameter URL must be transmitted
(2) The second parameter data is the data to be transmitted when accessing the URL. Data is empty by default
(3) The third timeout is to set the timeout. The timeout defaults to 60s (socket._GLOBAL_DEFAULT_TIMEOUT)
GET request mode
To grab http://www.itcast.cn take as an example
import urllib2 response = urllib2.urlopen('http://www.itcast.cn/') data = response.read() print data print response.code
Save as demo Py, enter the directory of the file, execute the following command to view the running results and feel it.
python demo.py
Using urlib2 Request class, add Header information
Using urlib2 The request method can be used to construct an Http request message
help(urllib2.Request)
Regular: headers to dict
^(.*):\s(.*)$ "\1":"\2",
# -*- coding: utf-8 -*- import urllib2 get_headers={ 'Host': 'www.itcast.cn', 'Connection': 'keep-alive', 'Pragma': 'no-cache', 'Cache-Control': 'no-cache', 'Upgrade-Insecure-Requests': '1', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', #Here is the compression algorithm; It's not easy to view, so you need to decompress it #'Accept-Encoding': 'gzip, deflate, sdch', 'Accept-Language': 'zh-CN,zh;q=0.8', 'Cookie': 'pgv_pvi=7044633600; tencentSig=6792114176; IESESSION=alive; pgv_si=s3489918976; CNZZDATA4617777=cnzz_eid%3D768417915-1468987955-%26ntime%3D1470191347; _qdda=3-1.1; _qddab=3-dyl6uh.ireawgo0; _qddamta_800068868=3-0' } request = urllib2.Request("http://www.itcast.cn/",headers=get_headers) #request.add_header('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36') response = urllib2.urlopen(request) print response.code data = response.read() print data
Q: Why are these two ways of writing right?
A: One header doesn't write, and the other does well; The reason is that the web server can understand the requested data, and there is no verification mechanism
POST request mode
Capture Recruitment Information
http://www.lagou.com/jobs/list_?px=new&city=%E5%85%A8%E5%9B%BD#order
# -*- coding: utf-8 -*- import urllib2 import urllib proxy_handler = urllib2.ProxyHandler({"http" : 'http://192.168.17.1:8888'}) opener = urllib2.build_opener(proxy_handler) urllib2.install_opener(opener) Sum = 1 output = open('lagou.json', 'w') for page in range(1,Sum+1): formdata = 'first=false&pn='+str(page)+'&kd=' print 'Run to (%2d) page' %(page) send_headers={ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36', 'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8', 'Accept': 'application/json, text/javascript, */*; q=0.01', ' X-Requested-With': 'XMLHttpRequest' } request =urllib2.Request('http://www.lagou.com/jobs/positionAjax.json?px=new&needAddtionalResult=false',headers=send_headers) #request.add_header('X-Requested-With','XMLHttpRequest') #request.headers=send_headers request.add_data(formdata) print request.get_data() response = urllib2.urlopen(request) print response.code resHtml =response.read() #print resHtml output.write(resHtml+'\n') output.close() print '-'*4 + 'end'+'-'*4
Think about it
- If you want to collect the positions in Beijing > > Chaoyang District > > Wangjing District, take this website as an example, how to understand this url
urlencode encoding / decoding online tool
# -*- coding: utf-8 -*- import urllib2 import urllib query = { 'city':'Beijing', 'district':'Chaoyang District', 'bizArea':'Wangjing' } print urllib.urlencode(query) page =3 values = { 'first':'false', 'pn':str(page), 'kd':'Back end development', } formdata = urllib.urlencode(values) print formdata
Summary
Content length: refers to the length of the content other than the Header of the Header, and refers to the length of the form data
X-Requested-With: XMLHttpRequest: indicates an Ajax asynchronous request
Content-Type: application/x-www-form-urlencoded
Indicates that the submitted form data will be encoded in the form of name/value
For example:
name1=value1&name2=value2...
Both name and value are URL encoded (utf-8, gb2312)
summary
That's all for the Urllib2 library Python basic introduction series (12). Lao tie, who is learning to crawl, remembers to keep paying attention! A Xing, I wish you to cultivate and become a reptile boss as soon as possible! Of course, if you are going to systematically learn about crawlers and more python programming techniques, you can stamp the business card at the end of my article and Free to get the latest Python crawler materials / Free consultation / learning planning~
Poke my business card and receive benefits