Recently, Mr. Ma Baoguo is very popular in station b. the video playback volume of him is very high. The video comment area of station b is full of talents who speak well. Write a crawler to crawl the user information and comment content in the comment area of station b.
1, Preparatory work
1. Tools
(1) Chrome Google browser installation address: https://www.google.cn/chrome/ (plugin: JSON handle download address: http://jsonhandle.sinaapp.com/ , JSON handle installation method:
https://blog.csdn.net/xb12369/article/details/79002208
Used to analyze the structure of web pages
(2)python 3.x installation address: https://www.python.org/ For code writing, development tool: JetBrains PyCharm Professional download address: https://www.jetbrains.com/pycharm/
(3) Mongodb database storage data installation address: https://www.mongodb.com/try/download/community , the database visualization tool MongoDB Compass is used to dynamically manage the database. For the installation tutorial, see:
https://blog.csdn.net/weixin_41466575/article/details/105326230
2, Train of thought
1. Overall thinking
PS: if you need Python learning materials, you can click the link below to get them by yourself
python free learning materials and group communication solutions Click to join
2. Reptile ideas
3, Analyze web pages
If you want to write a good crawler, you must first analyze the web page structure thoroughly.
1. Analyze web page loading mode
We want to crawl user information and comments, so open a video first.
Right click to view the source code, search for relevant comments in the source code, and no relevant data can be found. You can judge that this page is rendered by ajax asynchronous loading data.
2. Analysis data interface
Go back to the video page F12, open the developer tool, refresh it, search with ctrl+f, and find that the comment data is in this json.
This json points to the following address:
https://api.bilibili.com/x/v2/reply?callback=jQuery1720631904798407396_1605664873948&jsonp=jsonp&pn=1&type=1&oid=82179137&sort=2&_=1605664874976
Using the json handle to view this json, you can see that the user information is in member and the comment information is in message.
Back to this interface, this interface needs to pass the following parameters:
callback: jQuery1720631904798407396_1605664873948 #After testing, it can not be transmitted jsonp: jsonp #After testing, it can not be transmitted pn: 1 #Page number identification type: 1 #Type oid: 82179137 #Video identification sort: 2 #Classification _: 1605664874976 #The current timestamp can not be transmitted after testing 1234567
Through analysis, it is found that the key parameters are oid and pn, sort. I guess oid is the video ID, pn is the number of pages where comments are located, and sort is the category. We want to get oid.
3. Get oid
Try to search for oid in the video web page source code.
It's easy to do if we retrieve 7 relevant information. We first request the video web page to obtain the oid, and then construct the loading interface through the oid.
In this way, we can find out the loading method and loading interface of video comment information, and we can write a crawler.
4, Write crawler
No nonsense, go directly to the source code
import json import requests import pymongo import re import time import random class Bilibili_Comment_Parse(object): #Set parameters. The data is stored in different sets according to different video titles, and all sets are stored in Bilibili library def set_page(self): host='127.0.0.1' port=27017 myclient=pymongo.MongoClient(host=host,port=port) mydb='Bilibili' sheetname=self.video_title db=myclient[mydb] self.post=db[sheetname] #Get oid and video title according to the entered URL def get_oid(self,url): headers={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'} r = requests.get(url,headers=headers) res = r.text patten = '<meta data-vue-meta="true" itemprop="url" content="https://www.bilibili.com/video/av(.*?)/">' oid = re.findall(patten, res) aim_oid = oid[0] patten1='<meta data-vue-meta="true" property="og:title" content="(.*?)">' video_title=re.findall(patten1,res) if video_title: self.video_title=video_title[0].split('_Bilibili')[0] return aim_oid #Crawler main program, used to parse video comment user information def parse(self,oid): base_url=f'https://api.bilibili.com/x/v2/reply?jsonp=jsonp&type=1&oid={oid}&sort=2' n=0 url=base_url+'&pn={}' headers={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'} try: while True: r=requests.get(url.format(n),headers=headers) _json=json.loads(r.text) replies=_json.get('data').get('replies') item={} n+=1 if len(replies)!=0: print(f'\033[34;47m--------------------Crawling{n}page--------------------') for replie in replies: item['user_id']=replie.get('member').get('mid')#User id item['user_name']=replie.get('member').get('uname')#user name item['user_sex']=replie.get('member').get('sex')#Gender item['user_level']=replie.get('member').get('level_info').get('current_level')#Grade vip=replie.get('member').get('vip').get('vipStatus')#vip if vip==1: item['user_is_vip']='Y' elif vip==0 : item['user_is_vip']='N' comment_date=replie.get('member').get('ctime')#Comment date timeArray = time.localtime(comment_date) otherStyleTime = time.strftime("%Y-%m-%d %H:%M:%S", timeArray) item['apprecate_count']=replie.get('like')#Number of likes item['reply_count']=replie.get('rcount')#Number of replies item['comment_date']=otherStyleTime item['comment']=replie.get('content').get('message')#Comment content #It can also be used to judge whether there is this document in the database res=self.post.count_documents(item) if res==0: data = dict(item) self.post.insert(data) print(f'\033[35;46m{item}\033[0m') else: print('\033[31;44m pass\033[0m') time.sleep(0.5) else: print(f'\033[31;44m--------------------Procedure at{n}Page exit normally!--------------------\033[0m') break except: pass def main(): while True: type = input('Please enter the video address:') #Simple filtering using regular expressions patten1 = re.compile(r'^(?:http|https)?://(\S)*') flag= re.findall(patten1, type) if flag: try: bilibili_parse=Bilibili_Comment_Parse() oid =bilibili_parse.get_oid(type) bilibili_parse.set_page() bilibili_parse.parse(oid) break except: print('\033[35;49m The information you entered is incorrect! Please check!\033[0m') else: print('\033[35;49m You have entered the information incorrectly! Please check!\033[0m') if __name__ == '__main__': main() 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102
Let's talk about it here. At first, I judged the input content with regularity, and I didn't pass all the parameters of the api, such as_ Such a timestamp can be used as time Time() is used to construct user information, which can be extracted as needed.
The console keeps outputting data until the end.
5, Store data
For data storage, I choose mongodb database. You can choose the storage method by yourself.
There are more than 9000 entries in total. You can choose to export the data in json or csv format.
6, Summary
- This crawl is the user information and comment data of station b video comment. It is not only applicable to Mr. Ma's video, but also general. Station b has restrictions on the frequency of ip access, so to be on the safe side, the speed limit is added. The focus of this crawler is to find the comment api interface. After the url is constructed, the data extraction is simple and easy. There are more than 9000 pieces of data in total, and more than 1w3K pieces of data shown below the video should contain replies. If there are any deficiencies in ideas and codes, you are welcome to correct and criticize!