Climb to station b video comment user information! These comments are the great God!

Recently, Mr. Ma Baoguo is very popular in station b. the video playback volume of him is very high. The video comment area of station b is full of talents who speak well. Write a crawler to crawl the user information and comment content in the comment area of station b.

1, Preparatory work

1. Tools

(1) Chrome Google browser installation address: https://www.google.cn/chrome/ (plugin: JSON handle download address: http://jsonhandle.sinaapp.com/ , JSON handle installation method:

https://blog.csdn.net/xb12369/article/details/79002208

Used to analyze the structure of web pages
(2)python 3.x installation address: https://www.python.org/ For code writing, development tool: JetBrains PyCharm Professional download address: https://www.jetbrains.com/pycharm/
(3) Mongodb database storage data installation address: https://www.mongodb.com/try/download/community , the database visualization tool MongoDB Compass is used to dynamically manage the database. For the installation tutorial, see:

https://blog.csdn.net/weixin_41466575/article/details/105326230

2, Train of thought

1. Overall thinking

PS: if you need Python learning materials, you can click the link below to get them by yourself

python free learning materials and group communication solutions Click to join

2. Reptile ideas

3, Analyze web pages

If you want to write a good crawler, you must first analyze the web page structure thoroughly.

1. Analyze web page loading mode

We want to crawl user information and comments, so open a video first.


Right click to view the source code, search for relevant comments in the source code, and no relevant data can be found. You can judge that this page is rendered by ajax asynchronous loading data.

2. Analysis data interface

Go back to the video page F12, open the developer tool, refresh it, search with ctrl+f, and find that the comment data is in this json.


This json points to the following address:

https://api.bilibili.com/x/v2/reply?callback=jQuery1720631904798407396_1605664873948&jsonp=jsonp&pn=1&type=1&oid=82179137&sort=2&_=1605664874976

Using the json handle to view this json, you can see that the user information is in member and the comment information is in message.

Back to this interface, this interface needs to pass the following parameters:

callback: jQuery1720631904798407396_1605664873948  #After testing, it can not be transmitted
jsonp: jsonp  #After testing, it can not be transmitted
pn: 1  #Page number identification
type: 1  #Type
oid: 82179137  #Video identification
sort: 2  #Classification
_: 1605664874976  #The current timestamp can not be transmitted after testing
1234567

Through analysis, it is found that the key parameters are oid and pn, sort. I guess oid is the video ID, pn is the number of pages where comments are located, and sort is the category. We want to get oid.

3. Get oid

Try to search for oid in the video web page source code.

It's easy to do if we retrieve 7 relevant information. We first request the video web page to obtain the oid, and then construct the loading interface through the oid.

In this way, we can find out the loading method and loading interface of video comment information, and we can write a crawler.

4, Write crawler

No nonsense, go directly to the source code

import json
import requests
import pymongo
import re
import time
import random

class Bilibili_Comment_Parse(object):

    #Set parameters. The data is stored in different sets according to different video titles, and all sets are stored in Bilibili library
    def set_page(self):
        host='127.0.0.1'
        port=27017
        myclient=pymongo.MongoClient(host=host,port=port)
        mydb='Bilibili'
        sheetname=self.video_title
        db=myclient[mydb]
        self.post=db[sheetname]

    #Get oid and video title according to the entered URL
    def get_oid(self,url):
        headers={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
        r = requests.get(url,headers=headers)
        res = r.text
        patten = '<meta data-vue-meta="true" itemprop="url" content="https://www.bilibili.com/video/av(.*?)/">'
        oid = re.findall(patten, res)
        aim_oid = oid[0]
        patten1='<meta data-vue-meta="true" property="og:title" content="(.*?)">'
        video_title=re.findall(patten1,res)
        if video_title:
            self.video_title=video_title[0].split('_Bilibili')[0]
        return aim_oid

    #Crawler main program, used to parse video comment user information
    def parse(self,oid):
        base_url=f'https://api.bilibili.com/x/v2/reply?jsonp=jsonp&type=1&oid={oid}&sort=2'
        n=0
        url=base_url+'&pn={}'
        headers={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
        try:
            while True:
                r=requests.get(url.format(n),headers=headers)
                _json=json.loads(r.text)
                replies=_json.get('data').get('replies')
                item={}
                n+=1
                if len(replies)!=0:
                    print(f'\033[34;47m--------------------Crawling{n}page--------------------')
                    for replie in replies:
                        item['user_id']=replie.get('member').get('mid')#User id
                        item['user_name']=replie.get('member').get('uname')#user name
                        item['user_sex']=replie.get('member').get('sex')#Gender
                        item['user_level']=replie.get('member').get('level_info').get('current_level')#Grade
                        vip=replie.get('member').get('vip').get('vipStatus')#vip
                        if vip==1:
                            item['user_is_vip']='Y'
                        elif vip==0 :
                            item['user_is_vip']='N'
                        comment_date=replie.get('member').get('ctime')#Comment date
                        timeArray = time.localtime(comment_date)
                        otherStyleTime = time.strftime("%Y-%m-%d %H:%M:%S", timeArray)
                        item['apprecate_count']=replie.get('like')#Number of likes
                        item['reply_count']=replie.get('rcount')#Number of replies
                        item['comment_date']=otherStyleTime
                        item['comment']=replie.get('content').get('message')#Comment content
                        #It can also be used to judge whether there is this document in the database
                        res=self.post.count_documents(item)
                        if res==0:
                            data = dict(item)
                            self.post.insert(data)
                            print(f'\033[35;46m{item}\033[0m')
                        else:
                            print('\033[31;44m pass\033[0m')
                    time.sleep(0.5)
                else:
                    print(f'\033[31;44m--------------------Procedure at{n}Page exit normally!--------------------\033[0m')
                    break
        except:
            pass

def main():
    while True:
        type = input('Please enter the video address:')
        #Simple filtering using regular expressions
        patten1 = re.compile(r'^(?:http|https)?://(\S)*')
        flag= re.findall(patten1, type)
        if flag:
            try:
                bilibili_parse=Bilibili_Comment_Parse()
                oid =bilibili_parse.get_oid(type)
                bilibili_parse.set_page()
                bilibili_parse.parse(oid)
                break
            except:
                print('\033[35;49m The information you entered is incorrect! Please check!\033[0m')
        else:
            print('\033[35;49m You have entered the information incorrectly! Please check!\033[0m')


if __name__ == '__main__':
    main()

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102

Let's talk about it here. At first, I judged the input content with regularity, and I didn't pass all the parameters of the api, such as_ Such a timestamp can be used as time Time() is used to construct user information, which can be extracted as needed.

The console keeps outputting data until the end.

5, Store data

For data storage, I choose mongodb database. You can choose the storage method by yourself.


There are more than 9000 entries in total. You can choose to export the data in json or csv format.

6, Summary

  • This crawl is the user information and comment data of station b video comment. It is not only applicable to Mr. Ma's video, but also general. Station b has restrictions on the frequency of ip access, so to be on the safe side, the speed limit is added. The focus of this crawler is to find the comment api interface. After the url is constructed, the data extraction is simple and easy. There are more than 9000 pieces of data in total, and more than 1w3K pieces of data shown below the video should contain replies. If there are any deficiencies in ideas and codes, you are welcome to correct and criticize!

 

Tags: Python Big Data Visualization Data Mining

Posted by Hiro on Sat, 07 May 2022 07:36:40 +0300