TF-IDF model: analysis of epidemic text data based on stuttering word segmentation and wordcloud
Recently, we have made a text data analysis of China's policy on the COVID-19. Let's introduce the relevant knowledge to summarize and consolidate, and hope to help more people.
1, TF IDF: keyword extraction
Stop words: stop words are words or words that are automatically filtered out before or after processing natural language data (or text) in order to save storage space and improve search efficiency in information retrieval
A word is rare, but it appears many times in the article, which reflects the characteristics of the article, then it is a keyword
1. Word frequency TF
Words frequency ( T F ) = some individual Words stay writing chapter in of Out present second number Should writing Out present Words of total number \scriptsize word frequency (TF) = \frac {the number of times a word appears in the article} {the total number of words in the article} Word frequency (TF) = the total number of words in the text and the number of times a word appears in the text
2. Inverse document frequency IDF
The larger the IDF, the greater the word frequency and the higher the importance
TF-IDF = word frequency * inverse document frequency
2, Stutter participle
jieba is a Python Chinese word segmentation component library with many built-in methods to help us use it.
pip install jieba
The following is a demonstration of stuttering participle
1. First, import jieba thesaurus
import pandas as pd import jieba import os import jieba.analyse
2. Get all file names under the folder
If you are interested in what you want to do, you can find some txt files and put the data into the folder (the larger the amount of data, the better)
content_S =[] contents = [] # Store the contents of each line of txt file path1 = r"policy_2020" files_1 = os.listdir(path1) # Get the names of all files in the path1 folder # print(files_1)
3. Traverse each file and use stuttering for word segmentation
Here I use a very low-level writing method, which makes my time too complicated. When there is a large amount of data, it is not recommended to write like this. It is still OK to learn and use, and the bosses can also put forward some valuable suggestions. Due to data reasons, it took me 5 hours to finish this part (2144 txt files).
for i in range(len(files_1)): new_path = path1 + '\\'+ files_1[i] contents.append(new_path) # print(contents) for j in range(len(contents)): with open(contents[j], 'r', encoding='utf-8') as f: myString = f.read().replace(' ','').replace('\n','') # Take the top five keywords tags = jieba.analyse.extract_tags(myString,topK=5) # print(tags) content_S.append(tags) print("complete",i)
4. Read the content of the divided words_ S and stop words
df_content=pd.DataFrame({'content_S':content_S}) df_content.head() # Stop words stopwords = pd.read_csv("baidu_stopwords.txt",index_col=False,sep="\t",quoting=2,names=['stopword'],encoding='utf-8') stopwords.head
5. Stop word judgment logic
def drop_stopwords(contents,stopwords): contents_clean=[] all_words=[] for line in contents: line_clean = [] for word in line: if word in stopwords: continue line_clean.append(word) all_words.append(str(word)) contents_clean.append(line_clean) return contents_clean,all_words cons = df_content.content_S.values.tolist() stopwords=stopwords.stopword.values.tolist() contents_clean,all_words = drop_stopwords(cons,stopwords)
6. View data after stop word processing
df_content=pd.DataFrame({'content_clean':contents_clean}) df_content.head() #key word df_all_words=pd.DataFrame({'all_words':all_words}) df_all_words.head() #Check the word frequency of all keywords after stuttering and word segmentation words_count=df_all_words.groupby(by=['all_words'])['all_words'].agg([("count","count")]) words_count=words_count.reset_index().sort_values(by=["count"],ascending=False) words_count.head(20)
3, Draw word cloud
Using wordcloud library, we can draw word cloud and display high-frequency words visually
First, you need to import pip install wordcloud
from wordcloud import WordCloud import matplotlib.pyplot as plt %matplotlib inline import matplotlib matplotlib.rcParams['figure.figsize'] = (10.0,5.0) #Set the font, background color and maximum text size. If not, you can find a font file at will wordcloud=WordCloud(font_path="./data/simhei.ttf",background_color="white",max_font_size=80) word_frequence = {x[0]:x[1] for x in words_count.head(100).values} wordcloud=wordcloud.fit_words(word_frequence) plt.imshow(wordcloud)
The operation results are as follows
0).values}
wordcloud=wordcloud.fit_words(word_frequence)
plt.imshow(wordcloud)
The operation results are as follows [External chain picture transfer...(img-w6Q74NUD-1652364226383)]