Urlib2 Library - Introduction to Python crawler Foundation Series (12)

Tip: there are benefits at the end of the article! Latest Python crawler materials / Study Guide > > Poke me directly


Urlib2 Library
Urlib2 is python2 7 comes with a module (no need to download), which supports a variety of network protocols, such as FTP, HTTP, HTTPS, etc
Urlib2 in Python 3 Changed to urllib in X request

Don't talk much and start learning

Urlib2 Library

Learning purpose

Using urlib2, an interface urlopen function is provided

Urlib2 official documents



urlopen(url, data, timeout,....)

(1) The first parameter URL is the URL, and the first parameter URL must be transmitted

(2) The second parameter data is the data to be transmitted when accessing the URL. Data is empty by default

(3) The third timeout is to set the timeout. The timeout defaults to 60s (socket._GLOBAL_DEFAULT_TIMEOUT)

GET request mode

To grab http://www.itcast.cn take as an example

import urllib2
response = urllib2.urlopen('http://www.itcast.cn/')
data = response.read()
print data
print response.code

Save as demo Py, enter the directory of the file, execute the following command to view the running results and feel it.

python demo.py

Using urlib2 Request class, add Header information

Using urlib2 The request method can be used to construct an Http request message


Regular: headers to dict

# -*- coding: utf-8 -*-
import urllib2
      'Host': 'www.itcast.cn',
      'Connection': 'keep-alive',
      'Pragma': 'no-cache',
      'Cache-Control': 'no-cache',
      'Upgrade-Insecure-Requests': '1',
      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
      #Here is the compression algorithm; It's not easy to view, so you need to decompress it
      #'Accept-Encoding': 'gzip, deflate, sdch',
      'Accept-Language': 'zh-CN,zh;q=0.8',
      'Cookie': 'pgv_pvi=7044633600; tencentSig=6792114176; IESESSION=alive; pgv_si=s3489918976; CNZZDATA4617777=cnzz_eid%3D768417915-1468987955-%26ntime%3D1470191347; _qdda=3-1.1; _qddab=3-dyl6uh.ireawgo0; _qddamta_800068868=3-0'
request = urllib2.Request("http://www.itcast.cn/",headers=get_headers)
#request.add_header('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36')
response = urllib2.urlopen(request)
print response.code
data = response.read()
print data

Q: Why are these two ways of writing right?

A: One header doesn't write, and the other does well; The reason is that the web server can understand the requested data, and there is no verification mechanism

POST request mode

Capture Recruitment Information


# -*- coding: utf-8 -*-
import urllib2
import urllib
proxy_handler = urllib2.ProxyHandler({"http" : ''})
opener = urllib2.build_opener(proxy_handler)
Sum = 1
output = open('lagou.json', 'w')
for page in range(1,Sum+1): 
      formdata = 'first=false&pn='+str(page)+'&kd='
      print 'Run to (%2d) page' %(page)
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
            'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
            'Accept': 'application/json, text/javascript, */*; q=0.01',
           ' X-Requested-With': 'XMLHttpRequest'
      request =urllib2.Request('http://www.lagou.com/jobs/positionAjax.json?px=new&needAddtionalResult=false',headers=send_headers)
      print request.get_data()
      response = urllib2.urlopen(request)
      print response.code
      resHtml =response.read()
      #print resHtml
print '-'*4 + 'end'+'-'*4

Think about it

  • If you want to collect the positions in Beijing > > Chaoyang District > > Wangjing District, take this website as an example, how to understand this url


urlencode encoding / decoding online tool

# -*- coding: utf-8 -*-
import urllib2
import urllib
query = {
      'district':'Chaoyang District',
print urllib.urlencode(query)
page =3
values = {
      'kd':'Back end development',
formdata = urllib.urlencode(values)
print formdata


Content length: refers to the length of the content other than the Header of the Header, and refers to the length of the form data

X-Requested-With: XMLHttpRequest: indicates an Ajax asynchronous request

Content-Type: application/x-www-form-urlencoded

Indicates that the submitted form data will be encoded in the form of name/value

For example:


Both name and value are URL encoded (utf-8, gb2312)

Online test string length


That's all for the Urllib2 library Python basic introduction series (12). Lao tie, who is learning to crawl, remembers to keep paying attention! A Xing, I wish you to cultivate and become a reptile boss as soon as possible! Of course, if you are going to systematically learn about crawlers and more python programming techniques, you can stamp the business card at the end of my article and Free to get the latest Python crawler materials / Free consultation / learning planning~

Poke my business card and receive benefits

Tags: Python crawler Data Mining http

Posted by jagguy on Wed, 04 May 2022 11:44:28 +0300