Big data acquisition case: Python web crawler instance

Web crawler:

Web crawler (also known as web page) spider , network robot, in FOAF In the middle of the community, more often referred to as web page chaser), it is a kind of automatic crawling according to certain rules web A program or script that contains information. Other names that are not often used are Ants , automatic indexing, emulator, or worm.

The above is Baidu of web crawler. Now let's introduce the use of Python for web crawler to obtain data.

It is used to obtain real-time data of COVID-19. Tool used PyCharm Create a new Python file named get_data Use the request module most commonly used by crawlers

Part I:

Get web page information:

import requests
url = "https://voice.baidu.com/act/newpneumonia/newpneumonia"
response = requests.get(url)

Part II:

Characteristics of observable data: The data is contained in the script tag, and xpath is used to get the data. Import a module from lxml import etree Generate an html object and parse it You can get a content of type list, and you can get all the content by using the first item Next, first get the content of the component. At this time, use the json module to convert the string type into a dictionary (Python data structure) In order to obtain domestic data, you need to find caseList in component

Next, add the following code:

from lxml import etree
import json
# Generate HTML objects
html = etree.HTML(response.text)
result = html.xpath('//script[@type="application/json"]/text()')
result = result[0]
# json. The load () method converts a string to a python data type
result = json.loads(result)
result_in = result['component'][0]['caseList'] 

Part III:

Store domestic data in excel table: Using the openyxl module, import openpyxl First create a workbook and create a worksheet under the workbook Next, name the worksheet and assign attributes to the worksheet

The code is as follows:

import openpyxl
#Create Workbook
wb = openpyxl.Workbook()
#Create worksheet
ws = wb.active
ws.title = "Domestic epidemic"
ws.append(['province', 'Cumulative diagnosis', 'death', 'cure', 'Existing diagnosis', 'Cumulative diagnosis increment', 'Increment of death', 'Cure increment', 'Existing diagnosis increment'])
'''
area --> Mostly provinces
city --> city
confirmed --> Cumulative
crued --> range
relativeTime -->
confirmedRelative --> Cumulative increment
curedRelative --> Increment of value range
curConfirm --> Existing quezhen
curConfirmRelative --> Increment of existing towns
'''
for each in result_in:
    temp_list = [each['area'], each['confirmed'], each['died'], each['crued'], each['curConfirm'],
                 each['confirmedRelative'], each['diedRelative'], each['curedRelative'],
                 each['curConfirmRelative']]
    for i in range(len(temp_list)):
        if temp_list[i] == '':
            temp_list[i] = '0'
    ws.append(temp_list)
wb.save('./data.xlsx')

Part IV:

Store foreign data in excel: Get foreign data in the globalList of component Then create sheet s in excel to represent different continents

The code is as follows:

data_out = result['component'][0]['globalList']
for each in data_out:
    sheet_title = each['area']
    # Create a new worksheet
    ws_out = wb.create_sheet(sheet_title)
    ws_out.append(['country', 'Cumulative diagnosis', 'death', 'cure', 'Existing diagnosis', 'Cumulative diagnosis increment'])
    for country in each['subList']:
        list_temp = [country['country'], country['confirmed'], country['died'], country['crued'],
                     country['curConfirm'], country['confirmedRelative']]
        for i in range(len(list_temp)):
            if list_temp[i] == '':
                list_temp[i] = '0'
        ws_out.append(list_temp)
wb.save('./data.xlsx')

The overall code is as follows:

import requests
from lxml import etree
import json
import openpyxl
 
url = "https://voice.baidu.com/act/newpneumonia/newpneumonia"
response = requests.get(url)
#print(response.text)
# Generate HTML objects
html = etree.HTML(response.text)
result = html.xpath('//script[@type="application/json"]/text()')
result = result[0]
# json. The load () method converts a string to a python data type
result = json.loads(result)
#Create Workbook
wb = openpyxl.Workbook()
#Create worksheet
ws = wb.active
ws.title = "Domestic epidemic"
ws.append(['province', 'Cumulative diagnosis', 'death', 'cure', 'Existing diagnosis', 'Cumulative diagnosis increment', 'Increment of death', 'Cure increment', 'Existing diagnosis increment'])
result_in = result['component'][0]['caseList']
data_out = result['component'][0]['globalList']
'''
area --> Mostly provinces
city --> city
confirmed --> Cumulative
crued --> range
relativeTime -->
confirmedRelative --> Cumulative increment
curedRelative --> Increment of value range
curConfirm --> Existing quezhen
curConfirmRelative --> Increment of existing towns
'''
for each in result_in:
    temp_list = [each['area'], each['confirmed'], each['died'], each['crued'], each['curConfirm'],
                 each['confirmedRelative'], each['diedRelative'], each['curedRelative'],
                 each['curConfirmRelative']]
    for i in range(len(temp_list)):
        if temp_list[i] == '':
            temp_list[i] = '0'
    ws.append(temp_list)
# Obtain foreign epidemic data
for each in data_out:
    sheet_title = each['area']
    # Create a new worksheet
    ws_out = wb.create_sheet(sheet_title)
    ws_out.append(['country', 'Cumulative diagnosis', 'death', 'cure', 'Existing diagnosis', 'Cumulative diagnosis increment'])
    for country in each['subList']:
        list_temp = [country['country'], country['confirmed'], country['died'], country['crued'],
                     country['curConfirm'], country['confirmedRelative']]
        for i in range(len(list_temp)):
            if list_temp[i] == '':
                list_temp[i] = '0'
        ws_out.append(list_temp)
wb.save('./data.xlsx')

The results are as follows:

Domestic:

Abroad:

recommend:

  • 020 is continuously updated. There are new contents in the boutique circle every day, and the concentration of dry goods is very high.
  • Strong contacts, discuss technology, you want here!
  • Get in the group first and beat your peers! (no charge for joining the group)
  • Click here to exchange and learn with Python development Daniel
  • Group No.: 858157650

Upon application:

  • Python software installation package, python practical tutorial
  • Materials are free, including basic Python learning, advanced learning, crawler, artificial intelligence, automatic operation and maintenance, automatic testing, etc

Tags: Python Big Data strip xlsx lxml openpyxl

Posted by lobobr on Mon, 23 May 2022 13:35:16 +0300