Analyze what netizens of exciting offer2 are roast about?

Variety shows are not only a way to relax after a hard day, but also our talk after dinner. Looking at their favorite variety shows, time is beautiful enough. And "exciting offer" is a good variety choice. Some people say it makes me feel inferior, and I think it's very interesting.

"Exciting offer" has been broadcast for two seasons so far. In the first quarter, Douban scored 8.3 points, with a total score of more than 50000 people. At present, the score in the second quarter is lower than that in the first quarter, with a score of only 7.1 points.


This paper crawls through the 130000 + bullet screen of the second season of exciting offer for visual analysis and emotional analysis.

Data acquisition

The second season of "exciting offer" was exclusively broadcast in Tencent video. At present, it has been broadcast in four issues (including interview). This paper adopts diversity crawling.

#-*- coding = uft-8 -*-
#@Time : 2020/11/30 21:35 
#@Author: official account cuisine J learning Python
#@File : tengxun_danmu.py

import requests
import json
import time
import pandas as pd

target_id = "6130942571%26" #Target of interview_ id
vid = "%3Dt0034o74jpr" #vid of interview
df = pd.DataFrame()
for page in range(15, 3214, 30):  #The video duration is 3214 seconds in total
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'}
    url = 'https://mfm.video.qq.com/danmu?otype=json&timestamp={0}&target_id={1}vid{2}&count=80'.format(page,target_id,vid)
    print("Extracting page" + str(page) + "page")
    html = requests.get(url,headers = headers)
    bs = json.loads(html.text,strict = False)  #The strict parameter solves the error of json format parsing of some contents
    time.sleep(1)
    #Traverse to get the target field
    for i in bs['comments']:
        content = i['content']  #bullet chat
        upcount = i['upcount']  #Number of likes
        user_degree =i['uservip_degree'] #Membership level
        timepoint = i['timepoint']  #Release time
        comment_id = i['commentid']  #Barrage id
        cache = pd.DataFrame({'bullet chat':[content],'Membership level':[user_degree],'Release time':[timepoint],'Bullet screen praise':[upcount],'bullet chat id':[comment_id]})
        df = pd.concat([df,cache])
df.to_csv('Interview article.csv',encoding = 'utf-8')

After crawling, put the four barrage csv files into a folder.

Open the interview csv file and preview it as follows:

Data cleaning

Merge barrage data

Firstly, the data of four barrage csv files are merged by concat method.

import pandas as pd
import numpy as np
df1 = pd.read_csv("/food J learn Python/bullet chat/tencent/Exciting offer/Interview article.csv")
df1["Number of periods"] = "Interview article"
df2 = pd.read_csv("/food J learn Python/bullet chat/tencent/Exciting offer/Phase 1.csv")
df2["Number of periods"] = "Phase 1"
df3 = pd.read_csv("/food J learn Python/bullet chat/tencent/Exciting offer/Phase 2.csv")
df3["Number of periods"] = "Phase 2"
df4 = pd.read_csv("/food J learn Python/bullet chat/tencent/Exciting offer/Issue 3.csv")
df4["Number of periods"] = "Issue 3"
df = pd.concat([df1,df2,df3,df4])

Preview the merged data:

df.sample(10)

View data information

df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: one hundred and thirty-three thousand six hundred and twenty-seven entries, 0 to 34923
Data columns (total 8 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  one hundred and thirty-three thousand six hundred and twenty-seven non-null  int64 
 1   user name         forty-nine thousand and forty non-null   object
 2   content          one hundred and thirty-three thousand six hundred and twenty-six non-null  object
 3   Membership level        one hundred and thirty-three thousand six hundred and twenty-seven non-null  int64 
 4   Comment time point       one hundred and thirty-three thousand six hundred and twenty-seven non-null  int64 
 5   Comment like        133627 non-null  int64 
 6   comment id        133627 non-null  int64 
 7   Number of periods          133627 non-null  object
dtypes: int64(5), object(3)
memory usage: 9.2+ MB

The following problems are found in the data:
1. Field name can be adjusted (personal cleanliness)
2. The unnamed field is redundant
3. The user name field has missing value, which can be filled in
4. The content and comment time point field types need to be adjusted
5. Comment id is meaningless for analysis and can be deleted

Rename field

df = df.rename(columns={'user name':'User nickname','content':'Barrage content','Comment time point':'Sending time','Comment like':'Bullet screen praise','Number of periods':'Number of periods'})

Filter field

#Select the field to analyze
df = df[["User nickname","Barrage content","Membership level","Sending time","Bullet screen praise","Number of periods"]]

Missing value processing

df["User nickname"] = df["User nickname"].fillna("anonymous person")

Send time processing

The sending time field is seconds, which needs to be changed to time. Here, customize a time_ The change function.

def time_change(seconds):
    m, s = divmod(seconds, 60)
    h, m = divmod(m, 60)
    ss_time = "%d:%02d:%02d" % (h, m, s)
    print(ss_time)
    return ss_time
time_change(seconds=8888)

Will time_ The change function applies to the send time field:

df["Sending time"] = df["Sending time"].apply(time_change)

Set to the desired time format:

df['Sending time'] = pd.to_datetime(df['Sending time'])
df['Sending time'] = df['Sending time'].apply(lambda x : x.strftime('%H:%M:%S'))

Barrage content processing

Change the object data type to str:

df["Barrage content"] = df["Barrage content"].astype("str")

Mechanical compression weight removal:

#Define mechanical compression function
def yasuo(st):
    for i in range(1,int(len(st)/2)+1):
        for j in range(len(st)):
            if st[j:j+i] == st[j+i:j+2*i]:
                k = j + i
                while st[k:k+i] == st[k+i:k+2*i] and k<len(st):   
                    k = k + i
                st = st[:j] + st[k:]    
    return st
yasuo(st="food J learn Python It's really, really, really delicious")
#Call mechanical compression function
df["Barrage content"] = df["Barrage content"].apply(yasuo)

Special character filtering:

df['Barrage content'] = df['Barrage content'].str.extract(r"([\u4e00-\u9fa5]+)") #Extract Chinese content
df = df.dropna()  #Pure expression bullet screen can be deleted directly

The data preview after cleaning is as follows:

Data analysis

Comparison of the number of barrages in each period

The second season of "exciting offer" has been broadcast for four issues (including interview). The first issue: the rules are upgraded. Interns face high-pressure assessment, with the largest number of bullets, reaching 42422. The interview: interns are tortured by the soul during the interview, with the smallest number of bullets, only 17332.

import pyecharts.options as opts
from pyecharts.charts import *
from pyecharts.globals import ThemeType  

df7 = df["Number of periods"].value_counts()
print(df7.index.to_list())
print(df7.to_list())
c = (
    Bar(init_opts=opts.InitOpts(theme=ThemeType.DARK))
    .add_xaxis(df7.index.to_list())
    .add_yaxis("",df7.to_list()) 
    .set_global_opts(title_opts=opts.TitleOpts(title="Number of barrages in each period",subtitle="Data source: Tencent video screen \t Drawing: dishes J learn Python",pos_left = 'left'),
                       xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=13)), #Change abscissa font size
                       yaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=13)), #Change ordinate font size
                       )
    .set_series_opts(label_opts=opts.LabelOpts(font_size=16,position='top'))
    )
c.render_notebook()

Who is the barrage transmitter

The user's nickname is "want too many de cats". A total of 227 barrages have been launched in several phases, far ahead of other barrage parties and a veritable barrage transmitter.

df8 = df["User nickname"].value_counts()[1:11]
df8 = df8.sort_values(ascending=True)
df8 = df8.tail(10)
print(df8.index.to_list())
print(df8.to_list())
c = (
    Bar(init_opts=opts.InitOpts(theme=ThemeType.DARK))
    .add_xaxis(df8.index.to_list())
    .add_yaxis("",df8.to_list()).reversal_axis() #Exchange sequence of X-axis and y-axis
    .set_global_opts(title_opts=opts.TitleOpts(title="Number of barrages sent TOP10",subtitle="Data source: Tencent video \t Drawing: dishes J learn Python",pos_left = 'left'),
                       xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=13)), #Change abscissa font size
                       yaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=13)), #Change ordinate font size
                       )
    .set_series_opts(label_opts=opts.LabelOpts(font_size=16,position='right'))
    )
c.render_notebook()


Randomly select the bullet screen information of too many de cats and find their deep love for the second season of exciting offer. The content of the barrage revealed that it was quite serious to watch the video, and almost every barrage received some praise.

df[df["User nickname"]=="Think too much de cat"].sample(10)

Membership level distribution

The total number of people who watched Tencent's "offer" was 74.5%, accounting for 31.5% of the total number of users in the second quarter, accounting for 74.5% of the total number of people who did not watch Tencent's "offer" in the second quarter, accounting for the same proportion as that in the third quarter.

df2 = df["Membership level"].astype("str").value_counts()
print(df2)
df2 = df2.sort_values(ascending=False)
regions = df2.index.to_list()
values = df2.to_list()
c = (
        Pie(init_opts=opts.InitOpts(theme=ThemeType.DARK))
        .add("", list(zip(regions,values)))
        .set_global_opts(legend_opts = opts.LegendOpts(is_show = False),title_opts=opts.TitleOpts(title="Membership level distribution",subtitle="Data source: Tencent video\t Drawing: dishes J learn Python",pos_top="0.5%",pos_left = 'left'))
        .set_series_opts(label_opts=opts.LabelOpts(formatter="Grade{b}Proportion:{d}%",font_size=14))
        
    )
c.render_notebook()

What is the barrage talking about

By making a word cloud picture of 13 + bullet screen, we found that the words with high frequency in the bullet screen include Ding Hui, lawyer, like, come on, Xu law, dry rice, teacher SA, etc. As the youngest and oldest member of the eight interns, Ding Hui was hotly debated by the audience from the beginning. As a tutor in the first season, Xu Lv's vigorous and intelligent professional style has already won the love of the majority of the audience. As a very popular online vocabulary recently, it is not surprising that dry rice appears in the popular variety show. As an offer of this season, Mr. Sa's funny role and the role of Versailles have also been hotly discussed by the majority of the audience.

# Define word segmentation function
def get_cut_words(content_series):
    # Read in stop list
    stop_words = [] 
    with open("/food J learn Python/offer/stop_words.txt", 'r', encoding='utf-8') as f:
        lines = f.readlines()
        for line in lines:
            stop_words.append(line.strip())
    # Add keyword
    my_words = ['Teacher SA', 'Fan Chengcheng','First season']  
    for i in my_words:
        jieba.add_word(i) 
    # Custom stop words
    my_stop_words = ['be like', 'really','feel']   
    stop_words.extend(my_stop_words)               
    # participle
    word_num = jieba.lcut(content_series.str.cat(sep='. '), cut_all=False)
    # Conditional screening
    word_num_selected = [i for i in word_num if i not in stop_words and len(i)>=2]
    return word_num_selected
# Draw word cloud
text1 = get_cut_words(content_series=df['Barrage content'])
stylecloud.gen_stylecloud(text=' '.join(text1), max_words=100,
                          collocations=False,
                          font_path='Character KuTang clear regular script.ttf',
                          icon_name='fas fa-square',
                          size=653,
                          #palette='matplotlib.Inferno_9',
                          output_name='./offer.png')
Image(filename='./offer.png') 

How do you comment on the eight interns

Let's first look at the photos of eight interns:

Among all the bullet screens, Ding Hui was mentioned by the audience far more than the other seven interns, a total of 9298 times, followed by Zhan Qiuyi, 2455 times by the audience, and Liu Yucheng was mentioned by the audience at least, only 526 times.

df8 = df["Character mention"].value_counts()[1:11]
print(df8.index.to_list())
print(df8.to_list())
c = (
    Bar(init_opts=opts.InitOpts(theme=ThemeType.DARK))
    .add_xaxis(df8.index.to_list())
    .add_yaxis("",df8.to_list()) 
    .set_global_opts(title_opts=opts.TitleOpts(title="Number of people mentioned",subtitle="Data source: Tencent video \t Drawing: dishes J learn Python",pos_left = 'left'),
                       xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=13)), #Change abscissa font size
                       yaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=13)), #Change ordinate font size
                       )
    .set_series_opts(label_opts=opts.LabelOpts(font_size=16,position='top'))
    )
c.render_notebook()


Draw the cloud pictures of bullet screen words of 8 interns respectively. We find that many viewers still recognize Ding Hui, and the words such as cheer, like, optimistic and support appear more frequently; For Zhan Qiuyi, who is introverted, the audience also likes her very much. From the high-frequency words such as beautiful, Liu Yifei and good-looking, it can be seen that many people like her based on her appearance; As Wang Xiao from Stanford, a top university, the audience presents a two-sided situation. Some people say Wang Xiao is good, while others think he is Versailles; Zhu Yixuan is the same. Some people think she is cute, while others hate her; Qu Zelin was praised for his high EQ and cute; Li Jinye's handsome was praised by the audience, and many people even think he is very much like he Yunchen, a popular intern in season 1; Wang Yingfei, who graduated from the National People's Congress, was also praised by the audience for being good-looking and beautiful; Liu Yucheng, who passed the examination with high scores, was praised by the audience for his good professional knowledge. Because he was robbed by Wang Xiao in issue 3, he was wronged, and the audience expressed their heartache one after another.

Emotional analysis

By using Baidu open source NLP to calculate the emotional score of bullet screen content, we found that in the second quarter of exciting offer, the overall emotional score was higher than 0.5, showing a positive performance. The more the audience with higher membership level can persist to the end, the number of bullet screen likes showed an increasing trend from the video playback, and plummeted in the last 15 minutes. The emotional score is high at the beginning and end of the video playback and low in the middle.

import paddlehub as hub
#Baidu's open source sentiment prediction model is used here
senta = hub.Module(name="senta_bilstm")
texts = df['Barrage content'].tolist()
input_data = {'text':texts}
res = senta.sentiment_classify(data=input_data)
df['Emotional score'] = [x['positive_probs'] for x in res]
#Resampling to 15 minutes
df.index = df['Sending time']
data = df.resample('15min').mean().reset_index()

#Add palette to data table
import seaborn as sns
color_map = sns.light_palette('orange', as_cmap=True)  #light_palette palette
data.style.background_gradient(color_map)

c = (
        Line(init_opts=opts.InitOpts(theme=ThemeType.DARK))
       .add_xaxis(data["Sending time"].to_list())
       .add_yaxis('Emotional tendency', list(data["Emotional score"].round(2)), is_smooth=True,is_connect_nones=True,areastyle_opts=opts.AreaStyleOpts(opacity=0.5))
       .set_global_opts(title_opts=opts.TitleOpts(title="Emotional tendency",subtitle="Data source: Tencent video \t Drawing: dishes J learn Python",pos_left = 'left'))
    )
c.render_notebook()


The text and pictures of this article come from the Internet, only for learning and communication, and do not have any commercial purpose. If you have any questions, please contact us in time for handling

For more Python learning materials, you can add QQ:2955637827 private chat or add Q group 630390733. Let's study and discuss together!

Tags: Python Big Data AI Data Mining csv

Posted by dustbuster on Tue, 03 May 2022 03:38:30 +0300