(hard goods) you don't know any of these libraries. You mean Python crawler

The text and pictures of this article come from the Internet and are only for learning and communication. They do not have any commercial use. The copyright belongs to the original author. If you have any questions, please contact us in time for handling

The following article comes from Tencent cloud Author: python learning tutorial

(want to learn Python? Python learning exchange group: 1039649593, meet your needs. All materials have been uploaded to the group file stream and can be downloaded by yourself! There are also a large number of the latest 2020 Python learning materials.)

1 request Library

1.requests

Requests library should be the most popular and practical library for crawlers now, which is very humanized. I also wrote an article about its use before. Let's take a look at the requests Library of Python. You can go and have a look.

For the most detailed usage of requests, please refer to the official documents: https://requests.readthedocs.io/en/master/

Use case:

>>> import requests
>>> r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
>>> r.status_code
200
>>> r.headers['content-type']
'application/json; charset=utf8'
>>> r.encoding
'utf-8'
>>> r.text
u'{"type":"User"...'
>>> r.json()
{u'disk_usage': 368627, u'private_gists': 484, ...}

 

2.urllib3

urllib3 is a very powerful http request library, which provides a series of functions to operate URL s.

For detailed usage methods, please refer to: https://urllib3.readthedocs.io/en/latest/

Use case:

>>> import urllib3
>>> http = urllib3.PoolManager()
>>> r = http.request('GET', 'http://httpbin.org/robots.txt')
>>> r.status
200
>>> r.data
'User-agent: *\nDisallow: /deny\n'

 

3.selenium

Automated test tools. A driver that calls the browser. Through this library, you can directly call the browser to complete some operations, such as entering the verification code.

For this library, it is not only Python that can be used. For example, JAVA, Python, C# and so on, selenium can be used

Use case:

from selenium import webdriver

browser = webdriver.Firefox()
browser.get('http://seleniumhq.org/'

 

4.aiohttp

HTTP framework based on asyncio implementation. With the help of async/await keyword, asynchronous operation uses asynchronous library to grab data, which can greatly improve the efficiency.

This is an asynchronous library that must be mastered by advanced crawlers.
Use case:

import aiohttp
import asyncio

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

async def main():
    async with aiohttp.ClientSession() as session:
        html = await fetch(session, 'http://python.org')
        print(html)

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

 

2, Parsing library

1.beautifulsoup

html and XML parsing, extract information from web pages, and have a powerful API and a variety of parsing methods. A parsing library I often use is very easy to use for html parsing. For those who write reptiles, this is also a library that must be mastered.

2.lxml

Support HTML and XML parsing, support XPath parsing, and the parsing efficiency is very high.

3.pyquery

The Python implementation of jQuery can operate and parse HTML documents with the syntax of jQuery, with good ease of use and parsing speed.

3, Data repository

1,pymysql

A MySQL client operation library implemented in pure Python. Very practical, very simple.

2,pymongo

As the name suggests, it is a library used to directly connect to mongodb database for query operation.

3,redisdump

Redis dump is a tool for converting redis and json to each other; Redis dump is developed based on Ruby and requires a ruby environment. The new version of redis dump requires Ruby Version above 2.2.2. yum in centos can only install Ruby Version 2.0. You need to install the ruby management tool rvm first and install a higher version of ruby;

Tags: Python crawler

Posted by gojiita on Thu, 05 May 2022 22:45:19 +0300