Web crawler:
Web crawler (also known as web page) spider , network robot, in FOAF In the middle of the community, more often referred to as web page chaser), it is a kind of automatic crawling according to certain rules web A program or script that contains information. Other names that are not often used are Ants , automatic indexing, emulator, or worm.
The above is Baidu of web crawler. Now let's introduce the use of Python for web crawler to obtain data.
It is used to obtain real-time data of COVID-19.
Tool used PyCharm
Create a new Python file named get_data
Use the request module most commonly used by crawlers
Part I:
Get web page information:
import requests url = "https://voice.baidu.com/act/newpneumonia/newpneumonia" response = requests.get(url)
Part II:
Characteristics of observable data:
The data is contained in the script tag, and xpath is used to get the data.
Import a module from lxml import etree
Generate an html object and parse it
You can get a content of type list, and you can get all the content by using the first item
Next, first get the content of the component. At this time, use the json module to convert the string type into a dictionary (Python data structure)
In order to obtain domestic data, you need to find caseList in component
Next, add the following code:
from lxml import etree import json # Generate HTML objects html = etree.HTML(response.text) result = html.xpath('//script[@type="application/json"]/text()') result = result[0] # json. The load () method converts a string to a python data type result = json.loads(result) result_in = result['component'][0]['caseList']
Part III:
To store domestic data in excel:
Using the openyxl module, import openpyxl
First create a workbook and create a worksheet under the workbook
Next, name the worksheet and assign attributes to the worksheet
The code is as follows:
import openpyxl #Create Workbook wb = openpyxl.Workbook() #Create worksheet ws = wb.active ws.title = "Domestic epidemic" ws.append(['province', 'Cumulative diagnosis', 'death', 'cure', 'Existing diagnosis', 'Cumulative diagnosis increment', 'Increment of death', 'Cure increment', 'Confirmed existing increment']) ''' area --> Mostly provinces city --> city confirmed --> Cumulative crued --> range relativeTime --> confirmedRelative --> Cumulative increment curedRelative --> Increment of value range curConfirm --> Existing quezhen curConfirmRelative --> Increment of existing towns ''' for each in result_in: temp_list = [each['area'], each['confirmed'], each['died'], each['crued'], each['curConfirm'], each['confirmedRelative'], each['diedRelative'], each['curedRelative'], each['curConfirmRelative']] for i in range(len(temp_list)): if temp_list[i] == '': temp_list[i] = '0' ws.append(temp_list) wb.save('./data.xlsx')
Part IV:
Store foreign data in excel:
Get foreign data in the globalList of component
Then create sheet s in excel to represent different continents
The code is as follows:
data_out = result['component'][0]['globalList'] for each in data_out: sheet_title = each['area'] # Create a new worksheet ws_out = wb.create_sheet(sheet_title) ws_out.append(['country', 'Cumulative diagnosis', 'death', 'cure', 'Existing diagnosis', 'Cumulative diagnosis increment']) for country in each['subList']: list_temp = [country['country'], country['confirmed'], country['died'], country['crued'], country['curConfirm'], country['confirmedRelative']] for i in range(len(list_temp)): if list_temp[i] == '': list_temp[i] = '0' ws_out.append(list_temp) wb.save('./data.xlsx')
The overall code is as follows:
import requests from lxml import etree import json import openpyxl url = "https://voice.baidu.com/act/newpneumonia/newpneumonia" response = requests.get(url) #print(response.text) # Generate HTML objects html = etree.HTML(response.text) result = html.xpath('//script[@type="application/json"]/text()') result = result[0] # json. The load () method converts a string to a python data type result = json.loads(result) #Create Workbook wb = openpyxl.Workbook() #Create worksheet ws = wb.active ws.title = "Domestic epidemic" ws.append(['province', 'Cumulative diagnosis', 'death', 'cure', 'Existing diagnosis', 'Cumulative diagnosis increment', 'Increment of death', 'Cure increment', 'Existing diagnosis increment']) result_in = result['component'][0]['caseList'] data_out = result['component'][0]['globalList'] ''' area --> Mostly provinces city --> city confirmed --> Cumulative crued --> range relativeTime --> confirmedRelative --> Cumulative increment curedRelative --> Increment of value range curConfirm --> Existing quezhen curConfirmRelative --> Increment of existing towns ''' for each in result_in: temp_list = [each['area'], each['confirmed'], each['died'], each['crued'], each['curConfirm'], each['confirmedRelative'], each['diedRelative'], each['curedRelative'], each['curConfirmRelative']] for i in range(len(temp_list)): if temp_list[i] == '': temp_list[i] = '0' ws.append(temp_list) # Obtain foreign epidemic data for each in data_out: sheet_title = each['area'] # Create a new worksheet ws_out = wb.create_sheet(sheet_title) ws_out.append(['country', 'Cumulative diagnosis', 'death', 'cure', 'Existing diagnosis', 'Cumulative diagnosis increment']) for country in each['subList']: list_temp = [country['country'], country['confirmed'], country['died'], country['crued'], country['curConfirm'], country['confirmedRelative']] for i in range(len(list_temp)): if list_temp[i] == '': list_temp[i] = '0' ws_out.append(list_temp) wb.save('./data.xlsx')
The results are as follows:
Domestic:

Abroad:

recommend:
- 020 is continuously updated. There are new contents in the boutique circle every day, and the concentration of dry goods is very high.
- Strong contacts, discuss technology, you want here!
- Get in the group first and beat your peers! (no charge for joining the group)
- Click here to exchange and learn with Python development Daniel
- Group No.: 858157650
Upon application:
- Python software installation package, python practical tutorial
- Materials are free, including basic Python learning, advanced learning, crawler, artificial intelligence, automatic operation and maintenance, automatic testing, etc
