TF-IDF model for NLP text keyword extraction: epidemic text data analysis based on stuttering word segmentation and wordcloud

TF-IDF model: analysis of epidemic text data based on stuttering word segmentation and wordcloud

Recently, we have made a text data analysis of China's policy on the COVID-19. Let's introduce the relevant knowledge to summarize and consolidate, and hope to help more people.

1, TF IDF: keyword extraction

Stop words: stop words are words or words that are automatically filtered out before or after processing natural language data (or text) in order to save storage space and improve search efficiency in information retrieval

A word is rare, but it appears many times in the article, which reflects the characteristics of the article, then it is a keyword

1. Word frequency TF

Words frequency ( T F ) = some individual Words stay writing chapter in of Out present second number Should writing Out present Words of total number \scriptsize word frequency (TF) = \frac {the number of times a word appears in the article} {the total number of words in the article} Word frequency (TF) = the total number of words in the text and the number of times a word appears in the text

2. Inverse document frequency IDF

The larger the IDF, the greater the word frequency and the higher the importance

TF-IDF = word frequency * inverse document frequency

2, Stutter participle

jieba is a Python Chinese word segmentation component library with many built-in methods to help us use it.

pip install jieba

The following is a demonstration of stuttering participle

1. First, import jieba thesaurus

import pandas as pd
import jieba
import os
import jieba.analyse

2. Get all file names under the folder

If you are interested in what you want to do, you can find some txt files and put the data into the folder (the larger the amount of data, the better)

content_S =[]
contents = []   # Store the contents of each line of txt file
path1 = r"policy_2020"
files_1 = os.listdir(path1)   # Get the names of all files in the path1 folder
# print(files_1)

3. Traverse each file and use stuttering for word segmentation

Here I use a very low-level writing method, which makes my time too complicated. When there is a large amount of data, it is not recommended to write like this. It is still OK to learn and use, and the bosses can also put forward some valuable suggestions. Due to data reasons, it took me 5 hours to finish this part (2144 txt files).

for i in range(len(files_1)):
    new_path = path1 + '\\'+ files_1[i]
    contents.append(new_path)
# print(contents)
    for j in range(len(contents)):
        with open(contents[j], 'r', encoding='utf-8') as f:
            myString = f.read().replace(' ','').replace('\n','')
            # Take the top five keywords
        tags = jieba.analyse.extract_tags(myString,topK=5)
#         print(tags)
        content_S.append(tags)
    print("complete",i)

4. Read the content of the divided words_ S and stop words

df_content=pd.DataFrame({'content_S':content_S})
df_content.head()
# Stop words
stopwords = pd.read_csv("baidu_stopwords.txt",index_col=False,sep="\t",quoting=2,names=['stopword'],encoding='utf-8')
stopwords.head

5. Stop word judgment logic

def drop_stopwords(contents,stopwords):
    contents_clean=[]
    all_words=[]
    for line in contents:
        line_clean = []
        for word in line:
            if word in stopwords:
                continue
            line_clean.append(word)
            all_words.append(str(word))
        contents_clean.append(line_clean)
    return contents_clean,all_words
cons = df_content.content_S.values.tolist()
stopwords=stopwords.stopword.values.tolist()
contents_clean,all_words = drop_stopwords(cons,stopwords)

6. View data after stop word processing

df_content=pd.DataFrame({'content_clean':contents_clean})
df_content.head()
#key word
df_all_words=pd.DataFrame({'all_words':all_words})
df_all_words.head()
#Check the word frequency of all keywords after stuttering and word segmentation
words_count=df_all_words.groupby(by=['all_words'])['all_words'].agg([("count","count")])
words_count=words_count.reset_index().sort_values(by=["count"],ascending=False)
words_count.head(20)

3, Draw word cloud

Using wordcloud library, we can draw word cloud and display high-frequency words visually

First, you need to import pip install wordcloud

from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib
matplotlib.rcParams['figure.figsize'] = (10.0,5.0)
#Set the font, background color and maximum text size. If not, you can find a font file at will
wordcloud=WordCloud(font_path="./data/simhei.ttf",background_color="white",max_font_size=80)
word_frequence = {x[0]:x[1] for x in words_count.head(100).values}
wordcloud=wordcloud.fit_words(word_frequence)
plt.imshow(wordcloud)

The operation results are as follows


0).values}
wordcloud=wordcloud.fit_words(word_frequence)
plt.imshow(wordcloud)

The operation results are as follows

[External chain picture transfer...(img-w6Q74NUD-1652364226383)]

Tags: Python AI Data Analysis Deep Learning NLP Data Mining

Posted by Dasndan on Fri, 13 May 2022 00:46:36 +0300