Urlib2 Library - Introduction to Python crawler Foundation Series (12)

Tip: there are benefits at the end of the article! Latest Python crawler materials / Study Guide > > Poke me directly

preface

Urlib2 Library
Urlib2 is python2 7 comes with a module (no need to download), which supports a variety of network protocols, such as FTP, HTTP, HTTPS, etc
Urlib2 in Python 3 Changed to urllib in X request

Don't talk much and start learning

Urlib2 Library

Learning purpose

Using urlib2, an interface urlopen function is provided

Urlib2 official documents

https://docs.python.org/2/library/urllib2.html

urlopen

urlopen(url, data, timeout,....)

(1) The first parameter URL is the URL, and the first parameter URL must be transmitted

(2) The second parameter data is the data to be transmitted when accessing the URL. Data is empty by default

(3) The third timeout is to set the timeout. The timeout defaults to 60s (socket._GLOBAL_DEFAULT_TIMEOUT)

GET request mode

To grab http://www.itcast.cn take as an example

import urllib2
response = urllib2.urlopen('http://www.itcast.cn/')
data = response.read()
print data
print response.code

Save as demo Py, enter the directory of the file, execute the following command to view the running results and feel it.

python demo.py

Using urlib2 Request class, add Header information

Using urlib2 The request method can be used to construct an Http request message

help(urllib2.Request)

Regular: headers to dict

^(.*):\s(.*)$
"\1":"\2",
# -*- coding: utf-8 -*-
import urllib2
get_headers={
      'Host': 'www.itcast.cn',
      'Connection': 'keep-alive',
      'Pragma': 'no-cache',
      'Cache-Control': 'no-cache',
      'Upgrade-Insecure-Requests': '1',
      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
      #Here is the compression algorithm; It's not easy to view, so you need to decompress it
      #'Accept-Encoding': 'gzip, deflate, sdch',
      'Accept-Language': 'zh-CN,zh;q=0.8',
      'Cookie': 'pgv_pvi=7044633600; tencentSig=6792114176; IESESSION=alive; pgv_si=s3489918976; CNZZDATA4617777=cnzz_eid%3D768417915-1468987955-%26ntime%3D1470191347; _qdda=3-1.1; _qddab=3-dyl6uh.ireawgo0; _qddamta_800068868=3-0'
 }
request = urllib2.Request("http://www.itcast.cn/",headers=get_headers)
#request.add_header('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36')
response = urllib2.urlopen(request)
print response.code
data = response.read()
print data

Q: Why are these two ways of writing right?

A: One header doesn't write, and the other does well; The reason is that the web server can understand the requested data, and there is no verification mechanism

POST request mode

Capture Recruitment Information

http://www.lagou.com/jobs/list_?px=new&city=%E5%85%A8%E5%9B%BD#order

# -*- coding: utf-8 -*-
import urllib2
import urllib
proxy_handler = urllib2.ProxyHandler({"http" : 'http://192.168.17.1:8888'})
opener = urllib2.build_opener(proxy_handler)
urllib2.install_opener(opener)
Sum = 1
output = open('lagou.json', 'w')
for page in range(1,Sum+1): 
      formdata = 'first=false&pn='+str(page)+'&kd='
      print 'Run to (%2d) page' %(page)
      send_headers={
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
            'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
            'Accept': 'application/json, text/javascript, */*; q=0.01',
           ' X-Requested-With': 'XMLHttpRequest'
      }
      request =urllib2.Request('http://www.lagou.com/jobs/positionAjax.json?px=new&needAddtionalResult=false',headers=send_headers)
      #request.add_header('X-Requested-With','XMLHttpRequest')
      #request.headers=send_headers
      request.add_data(formdata)
      print request.get_data()
      response = urllib2.urlopen(request)
      print response.code
      resHtml =response.read()
      #print resHtml
      output.write(resHtml+'\n')
output.close()
print '-'*4 + 'end'+'-'*4

Think about it

  • If you want to collect the positions in Beijing > > Chaoyang District > > Wangjing District, take this website as an example, how to understand this url

http://www.lagou.com/jobs/list_?px=default&city=%E5%8C%97%E4%BA%AC&district=%E6%9C%9D%E9%98%B3%E5%8C%BA&bizArea=%E6%9C%9B%E4%BA%AC#filterBox

urlencode encoding / decoding online tool

# -*- coding: utf-8 -*-
import urllib2
import urllib
query = {
      'city':'Beijing',
      'district':'Chaoyang District',
      'bizArea':'Wangjing'
}
print urllib.urlencode(query)
page =3
values = {
      'first':'false',
      'pn':str(page),
      'kd':'Back end development',
}
formdata = urllib.urlencode(values)
print formdata

Summary

Content length: refers to the length of the content other than the Header of the Header, and refers to the length of the form data

X-Requested-With: XMLHttpRequest: indicates an Ajax asynchronous request

Content-Type: application/x-www-form-urlencoded

Indicates that the submitted form data will be encoded in the form of name/value

For example:

name1=value1&name2=value2...

Both name and value are URL encoded (utf-8, gb2312)

Online test string length

summary

That's all for the Urllib2 library Python basic introduction series (12). Lao tie, who is learning to crawl, remembers to keep paying attention! A Xing, I wish you to cultivate and become a reptile boss as soon as possible! Of course, if you are going to systematically learn about crawlers and more python programming techniques, you can stamp the business card at the end of my article and Free to get the latest Python crawler materials / Free consultation / learning planning~

Poke my business card and receive benefits

Tags: Python crawler Data Mining http

Posted by jagguy on Wed, 04 May 2022 11:44:28 +0300