The text and pictures of this article come from the Internet and are only for learning and communication. They do not have any commercial use. The copyright belongs to the original author. If you have any questions, please contact us in time for handling
The following article comes from Tencent cloud Author: python learning tutorial
(want to learn Python? Python learning exchange group: 1039649593, meet your needs. All materials have been uploaded to the group file stream and can be downloaded by yourself! There are also a large number of the latest 2020 Python learning materials.)
1 request Library
1.requests
Requests library should be the most popular and practical library for crawlers now, which is very humanized. I also wrote an article about its use before. Let's take a look at the requests Library of Python. You can go and have a look.
For the most detailed usage of requests, please refer to the official documents: https://requests.readthedocs.io/en/master/
Use case:
>>> import requests >>> r = requests.get('https://api.github.com/user', auth=('user', 'pass')) >>> r.status_code 200 >>> r.headers['content-type'] 'application/json; charset=utf8' >>> r.encoding 'utf-8' >>> r.text u'{"type":"User"...' >>> r.json() {u'disk_usage': 368627, u'private_gists': 484, ...}
2.urllib3
urllib3 is a very powerful http request library, which provides a series of functions to operate URL s.
For detailed usage methods, please refer to: https://urllib3.readthedocs.io/en/latest/
Use case:
>>> import urllib3 >>> http = urllib3.PoolManager() >>> r = http.request('GET', 'http://httpbin.org/robots.txt') >>> r.status 200 >>> r.data 'User-agent: *\nDisallow: /deny\n'
3.selenium
Automated test tools. A driver that calls the browser. Through this library, you can directly call the browser to complete some operations, such as entering the verification code.
For this library, it is not only Python that can be used. For example, JAVA, Python, C# and so on, selenium can be used
Use case:
from selenium import webdriver browser = webdriver.Firefox() browser.get('http://seleniumhq.org/'
4.aiohttp
HTTP framework based on asyncio implementation. With the help of async/await keyword, asynchronous operation uses asynchronous library to grab data, which can greatly improve the efficiency.
This is an asynchronous library that must be mastered by advanced crawlers.
Use case:
import aiohttp import asyncio async def fetch(session, url): async with session.get(url) as response: return await response.text() async def main(): async with aiohttp.ClientSession() as session: html = await fetch(session, 'http://python.org') print(html) if __name__ == '__main__': loop = asyncio.get_event_loop() loop.run_until_complete(main())
2, Parsing library
1.beautifulsoup
html and XML parsing, extract information from web pages, and have a powerful API and a variety of parsing methods. A parsing library I often use is very easy to use for html parsing. For those who write reptiles, this is also a library that must be mastered.
2.lxml
Support HTML and XML parsing, support XPath parsing, and the parsing efficiency is very high.
3.pyquery
The Python implementation of jQuery can operate and parse HTML documents with the syntax of jQuery, with good ease of use and parsing speed.
3, Data repository
1,pymysql
A MySQL client operation library implemented in pure Python. Very practical, very simple.
2,pymongo
As the name suggests, it is a library used to directly connect to mongodb database for query operation.
3,redisdump
Redis dump is a tool for converting redis and json to each other; Redis dump is developed based on Ruby and requires a ruby environment. The new version of redis dump requires Ruby Version above 2.2.2. yum in centos can only install Ruby Version 2.0. You need to install the ruby management tool rvm first and install a higher version of ruby;