R language for Twitter data visualization

By Audhi Aprilliant
Compile VK
Source: towards Data Science

summary

For this project, we used Twitter's raw data through crawlers from May 28 to 29, 2019. In addition, the data is in CSV format (comma separated) and can be downloaded here.

https://github.com/audhiaprilliant/Indonesia-Public-Election-Twitter-Sentiment-Analysis/tree/master/Datasets

It involves two topics, one is the data of Joko Widodo containing the keyword "Joko Widodo", and the other is the data of Prabowo Subianto with the keyword "Prabowo Subianto". It includes several variables and information to determine user emotion. In fact, the data has 16 variables or attributes and more than 1000 observations. Table 1 lists some variables.

# Import library
library(ggplot2)
library(lubridate)

# Load data of Joko Widodo
data.jokowi.df = read.csv(file = 'data-joko-widodo.csv',
                          header = TRUE,
                          sep = ',')
senti.jokowi = read.csv(file = 'sentiment-joko-widodo.csv',
                        header = TRUE,
                        sep = ',')
                        
# Load Prabowo Subianto's data
data.prabowo.df = read.csv(file = 'data-prabowo-subianto.csv',
                           header = TRUE,
                           sep = ',')
senti.prabowo = read.csv(file = 'sentiment-prabowo-subianto.csv',
                         header = TRUE,
                         sep = ',')

Data visualization

Data exploration aims to get any information from Twitter data. It should be noted that the data has been text preprocessed. We explore variables that are considered interesting..

# TWEETS bar chart - JOKO WIDODO
data.jokowi.df$created = ymd_hms(data.jokowi.df$created,
                                 tz = 'Asia/Jakarta')
# Another way to make "date" and "hour" variables
data.jokowi.df$date = date(data.jokowi.df$created)
data.jokowi.df$hour = hour(data.jokowi.df$created)
# Date: May 29, 2019
data.jokowi.date1 = subset(x = data.jokowi.df,
                           date == '2019-05-29')
data.hour.date1 = data.frame(table(data.jokowi.date1$hour))
colnames(data.hour.date1) = c('Hour','Total.Tweets')
# Create data visualization
ggplot(data.hour.date1)+
  geom_bar(aes(x = Hour,
               y = Total.Tweets,
               fill = I('blue')),
           stat = 'identity',
           alpha = 0.75,
           show.legend = FALSE)+
  geom_hline(yintercept = mean(data.hour.date1$Total.Tweets),
             col = I('black'),
             size = 1)+
  geom_text(aes(fontface = 'italic',
                label = paste('Average:',
ceiling(mean(data.hour.date1$Total.Tweets)),
                              'Tweets per hour'),
                x = 8,
                y = mean(data.hour.date1$Total.Tweets)+20),
            hjust = 'left',
            size = 4)+
  labs(title = 'Total Tweets per Hours - Joko Widodo',
       subtitle = '28 May 2019',
       caption = 'Twitter Crawling 28 - 29 May 2019')+
  xlab('Time of Day')+
  ylab('Total Tweets')+
  scale_fill_brewer(palette = 'Dark2')+
  theme_bw()
  
# Bar chart of TWEETS - Prabowo subanto
data.prabowo.df$created = ymd_hms(data.prabowo.df$created,
                                  tz = 'Asia/Jakarta')
                                  
# Another way to make "date" and "hour" variables
data.prabowo.df$date = date(data.prabowo.df$created)
data.prabowo.df$hour = hour(data.prabowo.df$created)

# Date: May 28, 2019
data.prabowo.date1 = subset(x = data.prabowo.df,
                            date == '2019-05-28')
data.hour.date1 = data.frame(table(data.prabowo.date1$hour))
colnames(data.hour.date1) = c('Hour','Total.Tweets')

# Date: May 29, 2019
data.prabowo.date2 = subset(x = data.prabowo.df,
                            date == '2019-05-29')
data.hour.date2 = data.frame(table(data.prabowo.date2$hour))
colnames(data.hour.date2) = c('Hour','Total.Tweets')
data.hour.date3 = rbind(data.hour.date1,data.hour.date2)
data.hour.date3$Date = c(rep(x = '2019-05-28',
                             len = nrow(data.hour.date1)),
                         rep(x = '2019-05-29',
                             len = nrow(data.hour.date2)))
data.hour.date3$Labels = c(letters,'A','B')
data.hour.date3$Hour = as.character(data.hour.date3$Hour)
data.hour.date3$Hour = as.numeric(data.hour.date3$Hour)

# Data preprocessing
for (i in 1:nrow(data.hour.date3)) {
  if (i%%2 == 0) {
    data.hour.date3[i,'Hour'] = ''
  }
  if (i%%2 == 1) {
    data.hour.date3[i,'Hour'] = data.hour.date3[i,'Hour']
  }
}
data.hour.date3$Hour = as.factor(data.hour.date3$Hour)

# Data visualization
ggplot(data.hour.date3)+
  geom_bar(aes(x = Labels,
               y = Total.Tweets,
               fill = Date),
           stat = 'identity',
           alpha = 0.75,
           show.legend = TRUE)+
  geom_hline(yintercept = mean(data.hour.date3$Total.Tweets),
             col = I('black'),
             size = 1)+
  geom_text(aes(fontface = 'italic',
                label = paste('Average:',
ceiling(mean(data.hour.date3$Total.Tweets)),
                              'Tweets per hour'),
                x = 5,
                y = mean(data.hour.date3$Total.Tweets)+6),
            hjust = 'left',
            size = 3.8)+
  scale_x_discrete(limits = data.hour.date3$Labels,
                   labels = data.hour.date3$Hour)+
  labs(title = 'Total Tweets per Hours - Prabowo Subianto',
       subtitle = '28 - 29 May 2019',
       caption = 'Twitter Crawling 28 - 29 May 2019')+
  xlab('Time of Day')+
  ylab('Total Tweets')+
  ylim(c(0,100))+
  theme_bw()+
  theme(legend.position = 'bottom',
        legend.title = element_blank())+
  scale_fill_brewer(palette = 'Dark2')

According to figure 1, we can conclude that the number of tweet s obtained by data capture (keywords "Jokow Widodo" and "Prabowo Subianto") is not similar, even on the same date.

For example, in Figure 1 (left), from a visual point of view, tweets with the keyword "Joko Widodo" are only obtained during WIB at 03:00 – 17:00 on May 28, 2019. In Figure 1 (right), we conclude that tweets with the keyword "Prabowo Subianto" were obtained during 12:00-23:59 WIB (May 28, 2019) and 00:00-15:00 WIB (May 29, 2019) from May 28 to 29, 2019.

# Twitter on May 28, 2019
ggplot(data.hour.date1)+
  geom_bar(aes(x = Hour,
               y = Total.Tweets,
               fill = I('red')),
           stat = 'identity',
           alpha = 0.75,
           show.legend = FALSE)+
  geom_hline(yintercept = mean(data.hour.date1$Total.Tweets),
             col = I('black'),
             size = 1)+
  geom_text(aes(fontface = 'italic',
                label = paste('Average:',
ceiling(mean(data.hour.date1$Total.Tweets)),
                              'Tweets per hour'),
                x = 6.5,
                y = mean(data.hour.date1$Total.Tweets)+5),
            hjust = 'left',
            size = 4)+
  labs(title = 'Total Tweets per Hours - Prabowo Subianto',
       subtitle = '28 May 2019',
       caption = 'Twitter Crawling 28 - 29 May 2019')+
  xlab('Time of Day')+
  ylab('Total Tweets')+
  ylim(c(0,100))+
  theme_bw()+
  scale_fill_brewer(palette = 'Dark2')
  
# Twitter on May 29, 2019
ggplot(data.hour.date2)+
  geom_bar(aes(x = Hour,
               y = Total.Tweets,
               fill = I('red')),
           stat = 'identity',
           alpha = 0.75,
           show.legend = FALSE)+
  geom_hline(yintercept = mean(data.hour.date2$Total.Tweets),
             col = I('black'),
             size = 1)+
  geom_text(aes(fontface = 'italic',
                label = paste('Average:',
ceiling(mean(data.hour.date2$Total.Tweets)),
                              'Tweets per hour'),
                x = 1,
                y = mean(data.hour.date2$Total.Tweets)+6),
            hjust = 'left',
            size = 4)+
  labs(title = 'Total Tweets per Hours - Prabowo Subianto',
       subtitle = '29 May 2019',
       caption = 'Twitter Crawling 28 - 29 May 2019')+
  xlab('Time of Day')+
  ylab('Total Tweets')+
  ylim(c(0,100))+
  theme_bw()+
  scale_fill_brewer(palette = 'Dark2')

According to figure 2, we get a significant difference between users using the keywords "Joko Widodo" and "Prabowo Subianto". Tweets with the keyword "Joko Widodo" talk about Joko Widodo very intensely at a specific time (07:00 – 09:00 WIB), and 08:00 WIB has the largest number of tweets. It has 348 tweets. However, tweets with the keyword "Prabowo Subianto" tend to keep talking about Prabowo Subianto from May 28 to 29, 2019. From May 28 to 29, 2019, the average number of tweets uploaded per hour with the keyword "Prabowo Subianto" was 36.

# JOKO WIDODO
df.score.1 = subset(senti.jokowi,class == c('Negative','Positive'))
colnames(df.score.1) = c('Score','Text','Sentiment')
# Data viz
ggplot(df.score.1)+
  geom_density(aes(x = Score,
                   fill = Sentiment),
               alpha = 0.75)+
  xlim(c(-11,11))+
  labs(title = 'Density Plot of Sentiment Scores',
       subtitle = 'Joko Widodo',
       caption = 'Twitter Crawling 28 - 29 May 2019')+
  xlab('Score')+ 
  ylab('Density')+
  theme_bw()+
  scale_fill_brewer(palette = 'Dark2')+
  theme(legend.position = 'bottom',
        legend.title = element_blank())
        
# PRABOWO SUBIANTO
df.score.2 = subset(senti.prabowo,class == c('Negative','Positive'))
colnames(df.score.2) = c('Score','Text','Sentiment')
ggplot(df.score.2)+
  geom_density(aes(x = Score,
                   fill = Sentiment),
               alpha = 0.75)+
  xlim(c(-11,11))+
  labs(title = 'Density Plot of Sentiment Scores',
       subtitle = 'Prabowo Subianto',
       caption = 'Twitter Crawling 28 - 29 May 2019')+
  xlab('Density')+ 
  ylab('Score')+
  theme_bw()+
  scale_fill_brewer(palette = 'Dark2')+
  theme(legend.position = 'bottom',
        legend.title = element_blank())

Figure 3 is a bar chart of multiple tweets with the keywords "Joko Widodo" and "Prabowo Subianto" from May 28 to 29, 2019. As can be seen from Figure 3 (left), Twitter users talk about Prabowo Subianto less frequently on WIB from 19:00 to 23:59. This is due to the rest time of Indonesians. However, these themed tweets are always updated at midnight, because some users live abroad and some users are still active. Then, the user starts the WIB activity at 04:00, reaches the peak at 07:00, and then decreases until the WIB rises again at 12:00.

# JOKO WIDODO
df.senti.score.1 = data.frame(table(senti.jokowi$score))
colnames(df.senti.score.1) = c('Score','Freq')
# Data preprocessing
df.senti.score.1$Score = as.character(df.senti.score.1$Score)
df.senti.score.1$Score = as.numeric(df.senti.score.1$Score)
Score1 = df.senti.score.1$Score
sign(df.senti.score.1[1,1])
for (i in 1:nrow(df.senti.score.1)) {
  sign.row = sign(df.senti.score.1[i,'Score'])
  for (j in 1:ncol(df.senti.score.1)) {
    df.senti.score.1[i,j] = df.senti.score.1[i,j] * sign.row
  }
}
df.senti.score.1$Label = c(letters[1:nrow(df.senti.score.1)])
df.senti.score.1$Sentiment = ifelse(df.senti.score.1$Freq < 0,
                                    'Negative','Positive')
df.senti.score.1$Score1 = Score1
# Data visualization
ggplot(df.senti.score.1)+
  geom_bar(aes(x = Label,
               y = Freq,
               fill = Sentiment),
           stat = 'identity',
           show.legend = FALSE)+
  # Positive emotion
  geom_hline(yintercept = mean(abs(df.senti.score.1[which(df.senti.score.1$Sentiment == 'Positive'),'Freq'])),
             col = I('black'),
             size = 1)+
  geom_text(aes(fontface = 'italic',
                label = paste('Average Freq:',
ceiling(mean(abs(df.senti.score.1[which(df.senti.score.1$Sentiment == 'Positive'),'Freq'])))),
                x = 10,
                y = mean(abs(df.senti.score.1[which(df.senti.score.1$Sentiment == 'Positive'),'Freq']))+30),
            hjust = 'right',
            size = 4)+
  # Negative emotion
  geom_hline(yintercept = mean(df.senti.score.1[which(df.senti.score.1$Sentiment == 'Negative'),'Freq']),
             col = I('black'),
             size = 1)+
  geom_text(aes(fontface = 'italic',
                label = paste('Average Freq:',
ceiling(mean(abs(df.senti.score.1[which(df.senti.score.1$Sentiment == 'Negative'),'Freq'])))),
                x = 5,
                y = mean(df.senti.score.1[which(df.senti.score.1$Sentiment == 'Negative'),'Freq'])-15),
            hjust = 'left',
            size = 4)+
  labs(title = 'Barplot of Sentiments',
       subtitle = 'Joko Widodo',
       caption = 'Twitter Crawling 28 - 29 May 2019')+
  xlab('Score')+
  scale_x_discrete(limits = df.senti.score.1$Label,
                   labels = df.senti.score.1$Score1)+
  theme_bw()+
  scale_fill_brewer(palette = 'Dark2')

# PRABOWO SUBIANTO
df.senti.score.2 = data.frame(table(senti.prabowo$score))
colnames(df.senti.score.2) = c('Score','Freq')
# Data preprocessing
df.senti.score.2$Score = as.character(df.senti.score.2$Score)
df.senti.score.2$Score = as.numeric(df.senti.score.2$Score)
Score2 = df.senti.score.2$Score
sign(df.senti.score.2[1,1])
for (i in 1:nrow(df.senti.score.2)) {
  sign.row = sign(df.senti.score.2[i,'Score'])
  for (j in 1:ncol(df.senti.score.2)) {
    df.senti.score.2[i,j] = df.senti.score.2[i,j] * sign.row
  }
}
df.senti.score.2$Label = c(letters[1:nrow(df.senti.score.2)])
df.senti.score.2$Sentiment = ifelse(df.senti.score.2$Freq < 0,
                                    'Negative','Positive')
df.senti.score.2$Score1 = Score2
# Data visualization
ggplot(df.senti.score.2)+
  geom_bar(aes(x = Label,
               y = Freq,
               fill = Sentiment),
           stat = 'identity',
           show.legend = FALSE)+
  # Positive emotion
  geom_hline(yintercept = mean(abs(df.senti.score.2[which(df.senti.score.2$Sentiment == 'Positive'),'Freq'])),
             col = I('black'),
             size = 1)+
  geom_text(aes(fontface = 'italic',
                label = paste('Average Freq:',
ceiling(mean(abs(df.senti.score.2[which(df.senti.score.2$Sentiment == 'Positive'),'Freq'])))),
                x = 11,
                y = mean(abs(df.senti.score.2[which(df.senti.score.2$Sentiment == 'Positive'),'Freq']))+20),
            hjust = 'right',
            size = 4)+
  # Negative emotion
  geom_hline(yintercept = mean(df.senti.score.2[which(df.senti.score.2$Sentiment == 'Negative'),'Freq']),
             col = I('black'),
             size = 1)+
  geom_text(aes(fontface = 'italic',
                label = paste('Average Freq:',
ceiling(mean(abs(df.senti.score.2[which(df.senti.score.2$Sentiment == 'Negative'),'Freq'])))),
                x = 9,
                y = mean(df.senti.score.2[which(df.senti.score.2$Sentiment == 'Negative'),'Freq'])-10),
            hjust = 'left',
            size = 4)+
  labs(title = 'Barplot of Sentiments',
       subtitle = 'Prabowo Subianto',
       caption = 'Twitter Crawling 28 - 29 May 2019')+
  xlab('Score')+
  scale_x_discrete(limits = df.senti.score.2$Label,
                   labels = df.senti.score.2$Score1)+
  theme_bw()+
  scale_fill_brewer(palette = 'Dark2')

Figure 4 shows the emotional score density of the keywords "Joko Widodo" and "Prabowo Subianto". The score of tweets is obtained from the average score of the roots that make up tweets. Therefore, its score is given for each root, and its value is between - 10 and 10. The smaller the score, the more negative emotions in the microblog, and vice versa. According to figure 4 (left), it can be concluded that the negative emotion of tweets containing the keyword "Joko Widodo" is between - 10 and - 1, with a median score of - 4. It also applies to positive emotions (of course, there is a positive score). According to the density chart in Figure 4 (left), we found that the score of positive emotion has a fairly small variance. Therefore, we conclude that the positive emotions towards the microblog containing the keyword "Joko Widodo" are not too diverse.

Figure 4 (right) shows the emotional score density diagram containing the keyword "Prabowo Subianto". It is different from Figure 4 (left) because the negative emotion on Figure 4 (right) is between - 8 and - 1. This means that tweets don't have too many negative emotions (tweets have negative emotions, but not high enough). In addition, the distribution of negative emotion scores has two peaks between 4 and 1. However, positive emotions range from 1 to 10. Compared with figure 4 (left), the positive emotion in Figure 4 (right) has a higher variance, with two peaks in the range of 3 and 10. This shows that the microblog containing the keyword "Prabowo Subianto" has a high positive mood.

# JOKO WIDODO
df.senti.3 = as.data.frame(table(senti.jokowi$class))
colnames(df.senti.3) = c('Sentiment','Freq')
# Data preprocessing
df.pie.1 = df.senti.3
df.pie.1$Prop = df.pie.1$Freq/sum(df.pie.1$Freq)
df.pie.1 = df.pie.1 %>%
  arrange(desc(Sentiment)) %>%
  mutate(lab.ypos = cumsum(Prop) - 0.5*Prop)
# Data visualization
ggplot(df.pie.1,
       aes(x = 2,
           y = Prop,
           fill = Sentiment))+
  geom_bar(stat = 'identity',
           col = 'white',
           alpha = 0.75,
           show.legend = TRUE)+
  coord_polar(theta = 'y', 
              start = 0)+
  geom_text(aes(y = lab.ypos,
                label = Prop),
            color = 'white',
            fontface = 'italic',
            size = 4)+
  labs(title = 'Piechart of Sentiments',
       subtitle = 'Joko Widodo',
       caption = 'Twitter Crawling 28 - 29 May 2019')+
  xlim(c(0.5,2.5))+
  theme_void()+
  scale_fill_brewer(palette = 'Dark2')+
  theme(legend.title = element_blank(),
        legend.position = 'right')
        
# PRABOWO SUBIANTO
df.senti.4 = as.data.frame(table(senti.prabowo$class))
colnames(df.senti.4) = c('Sentiment','Freq')
# Data preprocessing
df.pie.2 = df.senti.4
df.pie.2$Prop = df.pie.2$Freq/sum(df.pie.2$Freq)
df.pie.2 = df.pie.2 %>%
  arrange(desc(Sentiment)) %>%
  mutate(lab.ypos = cumsum(Prop) - 0.5*Prop)
# Data visualization
ggplot(df.pie.2,
       aes(x = 2,
           y = Prop,
           fill = Sentiment))+
  geom_bar(stat = 'identity',
           col = 'white',
           alpha = 0.75,
           show.legend = TRUE)+
  coord_polar(theta = 'y', 
              start = 0)+
  geom_text(aes(y = lab.ypos,
                label = Prop),
            color = 'white',
            fontface = 'italic',
            size = 4)+
  labs(title = 'Piechart of Sentiments',
       subtitle = 'Prabowo Subianto',
       caption = 'Twitter Crawling 28 - 29 May 2019')+
  xlim(c(0.5,2.5))+
  theme_void()+
  scale_fill_brewer(palette = 'Dark2')+
  theme(legend.title = element_blank(),
        legend.position = 'right')

Figure 5 is a summary of the emotional scores of twitter. These microblogs are divided into negative emotions, neutral emotions and positive emotions. Negative emotion refers to the emotion whose score is lower than zero, neutral refers to the emotion whose score is equal to zero, and positive emotion score is greater than zero. As can be seen from Figure 5, the percentage of negative emotions of microblogs with the keyword "Joko Widodo" is lower than that of tweets with the keyword "Prabowo Subianto". There is a difference of 6.3%. The study also found that compared with the microblog with the keyword Prabowo Subianto, the microblog with the keyword "Joko Widodo" has higher neutral and positive emotions. According to piechart's research, tweets with the keyword "Joko Widodo" tend to have a higher proportion of positive emotions than tweets with the keyword "Prabowo Subianto". However, through the density map, it is found that the distribution of positive and negative emotion scores shows that microblogs containing the keyword "Prabowo Subianto" tend to have higher emotion scores than "Joko Widodo". It must be further analyzed.

Figure 6 shows the terms or words in tweets (keywords "Joko Widodo" and "Prabowo Subianto") frequently uploaded by users from May 28 to 29, 2019. Through this WordCloud visualization, you can find hot topics, which are discussed for keywords. For tweets containing the keyword "Joko Widodo", we found that the terms "tuang", "petisi", "negara", "aman" and "nusantara" were the top five, and each tweet appeared the most. However, tweets containing the keyword "Joko Widodo" found that "Prabowo", "Subianto", "kriminalisasi", "selamat" and "dubai" were the top five words in each tweet. This indirectly shows the pattern of tweets uploaded with the keyword "Prabowo Subianto", that is, it is almost certain that each uploaded tweet directly contains the name of "Prabowo Subianto", rather than through reference (@). This is because in text preprocessing, the reference (@) has been deleted.

You can go to my GitHub repo to find the code: https://github.com/audhiaprilliant/Indonesia-Public-Election-Twitter-Sentiment-Analysis

Reference reference

[1] K. Borau, C. Ullrich, J. Feng, R. Shen. Microblogging for Language Learning: Using Twitter to Train Communicative and Cultural Competence (2009), Advances in Web-Based Learning — ICWL 2009, 8th International Conference, Aachen, Germany, August 19–21, 2009.

Original link: https://towardsdatascience.com/twitter-data-visualization-fb4f45b63728

Welcome to panchuang AI blog:
http://panchuang.net/

Official Chinese document of sklearn machine learning:
http://sklearn123.com/

Welcome to panchuang blog resources summary station:
http://docs.panchuang.net/

Posted by johnie on Wed, 11 May 2022 12:40:22 +0300