TF-IDF, TextRank, WordCount three methods to achieve English keyword extraction (python Implementation)

Source code: https://github.com/Cpaulyz/BigDataAnalysis/tree/master/Assignment2

Data preprocessing

Before keyword extraction, you need to perform a series of preprocessing on the source file:

  • Extract PDF as TXT file
  • Clause
  • Word segmentation (stem extraction, word form restoration)
  • Filter numbers, special characters, etc., case conversion

Extract PDF

Extract PDF Text Using Apache PDFBox tool

The dependencies are as follows:

<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>2.0.13</version>
</dependency>

The code logic of extracting tool class utils/PDFParser is as follows

try {
    // Read the PDF folder and save the PDF file path into an Array
    File dir = new File("src\\main\\resources\\ACL2020");
    ArrayList<String> targets = new ArrayList<String>();
    for(File file:dir.listFiles()){
        if(file.getAbsolutePath().endsWith(".pdf")){
            targets.add(file.getAbsolutePath());
        }
    }
    // readPdf is the extraction method
    for(String path:targets){
        readPdf(path);
    }
} catch (Exception e) {
    e.printStackTrace();
}

At this point, the text in the PDF file is extracted and saved txt file for subsequent operations, as shown below.

Clause

Use the nltk Library in python for sentence segmentation

from nltk.tokenize import sent_tokenize
sens = sent_tokenize(str)

The clause situation is roughly as follows. It can be seen that the clause situation is more accurate

Word segmentation (stem extraction, word form restoration)

nltk provides a word segmentation tool. The API is as follows

from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()
print(wnl.lemmatize('ate', 'v'))
print(wnl.lemmatize('fancier', 'n'))

# Output as eat fancy

However, this word segmentation method needs to determine the part of speech of the word in. Fortunately, nltk also provides us with a method to judge the part of speech of the sentence. It is encapsulated as follows

# Get the part of speech of the word
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

Call after combination, as follows:

from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

tokens = word_tokenize(sentence)  # participle
tagged_sent = pos_tag(tokens)  # Get word parts of speech

wnl = WordNetLemmatizer()
lemmas_sent = []
for tag in tagged_sent:
    wordnet_pos = get_wordnet_pos(tag[1]) or wordnet.NOUN
    lemmas_sent.append(wnl.lemmatize(tag[0], pos=wordnet_pos))  # Morphological reduction

The results are shown in the figure

It can be seen that the effect of word segmentation is good, but there are still some problems

  1. Not eliminated;:, And other special symbols

  2. No elimination of numbers, etc

  3. Some prepositions such as a, the and of are not eliminated

filter

Problems 1 and 2 are easy to be eliminated by regular expressions;

In question 3, we eliminate it through the list of English stop words provided by nltk and "let's assume that the string with length less than 4 is invalid".

import re
from nltk.corpus import stopwords

invalid_word = stopwords.words('english')

# Preprocessing. If it is False, it will be discarded
def is_valid(word):
    if re.match("[()\-:;,.0-9]+", word):
        return False
    elif len(word) < 4 or word in invalid_word:
        return False
    else:
        return True

Method 1 TF-IDF

The structured process of extracting keywords by TF-IDF algorithm is as follows:

1.1 clause and word segmentation

It is the same as data preprocessing and will not be repeated

1.2 construction of corpus

Since the calculation of IDF needs the support of corpus, we build a corpus with all articles and store it in all_dic = {}

all_dict is a Map with a storage structure of (String article name, Map word frequency < word, word frequency >)

An example is as follows

{
	'A Generative Model for Joint Natural Language Understanding and Generation.txt': 
		{'natural': 13, 
		'language': 24, 
		'understanding': 4,
		'andnatural': 1, 
		'generation': 9, 
		'twofundamental': 1,
		...
		},
	...
}

1.3 calculate TF-IDF

(1)TF

Term frequency (TF) refers to the number of times a given word appears in the file. This number is usually normalized (usually the word frequency divided by the total number of words in the article) to prevent it from being biased towards long documents. (the same word may have a higher word frequency in a long file than in a short file, regardless of whether the word is important or not.)

TF = article_dict[word] / article_word_counts

(2)IDF

Inverse document frequency (IDF) the main idea of IDF is: if there are fewer documents containing the term t, the larger the IDF, it indicates that the term has a good ability to distinguish between categories. The IDF of a specific word can be obtained by dividing the total number of files by the number of files containing the word, and then taking the logarithm of the quotient.

            contain_count = 1  # The total number of documents included, because it needs to be + 1, simply do it directly with the initial value of 1
            for article1 in all_dic.keys():
                if word in all_dic[article1].keys():
                    contain_count += 1
            IDF = log(article_nums / contain_count)

(3)TF-IDF

The implementation core code is as follows:

def TFIDF():
    article_nums = len(all_dic)
    for article in all_dic.keys():
        article_dict: dict = all_dic[article]
        article_word_counts = 0
        for count in article_dict.values():
            article_word_counts += count
        local_dict = {}
        for word in article_dict:
            TF = article_dict[word] / article_word_counts
            contain_count = 1  # The total number of documents to be included is simply + 1
            for article1 in all_dic.keys():
                if word in all_dic[article1].keys():
                    contain_count += 1
            IDF = log(article_nums / contain_count)
            local_dict[word] = TF * IDF
        all_dic[article] = local_dict  # Replace word frequency with TFIDF

1.4 output results

It is worth mentioning that for the corpus based keyword algorithm of TF-IDF, we are extracting all articles of ACL2020 as corpus, so the extracted TF-IDF value is relative to the keyword weight inside the article.

Therefore, through this method, we generate the keywords of each article rather than the keywords of the corpus.

Here, we select the word with the highest TF-IDF in each article and output its weight to method1_ In dict.txt, the weight represents the TF-IDF value, which is sorted by the letter of the article title.

unlabelled 0.03366690429509488
database 0.025963153344621098
triplet 0.06007324859328521
anaphor 0.054325239855360946
sparse 0.05140787295501171
dialog 0.02857688733696682
evaluator 0.047046849916043215
article 0.03181976626426247
dialogue 0.05009864522556742
false 0.05046963249913187
explanation 0.06756267918534663
keyphrases 0.07257334117762049
switch 0.02057258339292402
response 0.03487928535131968
hcvae 0.01490817643452481
response 0.01691069785427619
fragment 0.036740214670107636
concept 0.10144398960055125
node 0.026861943279698357
type 0.021568639909022032
hierarchy 0.04174740425673965
legal 0.09062083506033958
confidence 0.03208193690887942
question 0.018326715354972434
follow-up 0.0768915254934173
graph 0.030139792811985255
quarel 0.03142980753777034
instruction 0.04310656492734328
summary 0.023522349291620226
mutual 0.021794659657633334
malicious 0.03361252033133951
nucleus 0.03062106234461863
supervision 0.02716542294214428
relation 0.026017607441275774
calibrator 0.053113533081036744
centrality 0.06527959271708282
question 0.015813880735872966
slot 0.04442739804723785
graph 0.017963145985978687
taxonomy 0.05263359765861765
question 0.01694100733341999
transformer 0.019573842786351815
response 0.027652528223249546
topic 0.04541019920353925
paraphrase 0.024098507886884227

Method 2 TextRank

The structured process of extracting keywords by TextRank algorithm is as follows

2.1 clause

It is the same as the clause processing in the preprocessing part and will not be repeated

2.2 establishing relationship matrix

Establish the relationship matrix Mn*n, where n is the number of words (the same word is only remembered once), and Mij indicates that there is a relationship between j and i with the weight of Mij.

Relationship is defined as follows:

If the window size is taken as win, then in each clause, after removing the stop words, punctuation and invalid words, each word is related to the words within the distance of win

In order to facilitate the representation of the relationship matrix, a Map (String word, Array relative_words) is used to represent the existence of word → relative_ An example of the relationship between words is as follows (source network) http://www.hankcs.com/nlp/textrank-algorithm-to-extract-the-keywords-java-implementation.html )

Sentence participle = [programmer, English, program, development, maintenance, professional, personnel, programmer, divided into, program, design, personnel, program, coding, personnel, boundary, special, China, software, personnel, divided into, programmer, senior, programmer, system, analyst, project, manager]

After that, two windows with the size of 5 will be established, and each word will vote for the words within the distance of 5 in front of and behind it:

{development = [professional, programmer, maintenance, English, program, personnel],

Software = [programmer, divided into, boundary, advanced, China, special, personnel],

Programmer = [development, software, analyst, maintenance, system, project, manager, divided into, English, program, major, design, senior, personnel, China],

Analyst = [programmer, system, project, manager, senior],

Maintenance = [professional, developer, programmer, divided into, English, program, personnel],

System = [programmer, analyst, project manager, divided into, senior],

Project = [programmer, analyst, system, manager, senior],

Manager = [programmer, analyst, system, project],

Divided into = [professional, software, design, programmer, maintenance, system, advanced, program, China, special, personnel],

English = [professional, development, programmer, maintenance, program],

Program = [professional, development, design, programmer, coding, maintenance, boundary, divided into, English, special, personnel],

Special = [software, coding, division, boundary, program, China, personnel],

Major = [development, programmer, maintenance, divided into, English, program, personnel],

Design = [programmer, code, divided into, program, personnel],

Code = [design, boundary, procedure, China, special, personnel],

Boundary = [software, code, program, China, special, personnel],

Senior = [programmer, software, analyst, system, project, divided into, personnel],

China = [programmer, software, coding, divided into, boundary, special, personnel],

Personnel = [development, programmer, software, maintenance, divided into, program, special, professional, design, coding, boundary, advanced, China]}

The implementation code is as follows

def add_to_dict(word_list, windows=5):
    valid_word_list = []  # Filter first
    for word in word_list:
        word = str(word).lower()
        if is_valid(word):
            valid_word_list.append(word)
    # Establish relationship according to window
    if len(valid_word_list) < windows:
        win = valid_word_list
        build_words_from_windows(win)
    else:
        index = 0
        while index + windows <= len(valid_word_list):
            win = valid_word_list[index:index + windows]
            index += 1
            build_words_from_windows(win)

# According to the small window, establish the relationship into words
def build_words_from_windows(win):
    for word in win:
        if word not in words.keys():
            words[word] = []
        for other in win:
            if other == word or other in words[word]:
                continue
            else:
                words[word].append(other)

2.3 iteration

The calculation formula of TextRank is similar to PageRank

There are two termination conditions for iteration:

  1. max_ Diff < the specified threshold indicates convergence

  2. max_ ITER > specify the number of iterations, indicating that the number of iterations has reached the upper limit

The code implementation is as follows

def text_rank(d=0.85, max_iter=100):
    min_diff = 0.05
    words_weight = {}  # {str,float)
    for word in words.keys():
        words_weight[word] = 1 / len(words.keys())
    for i in range(max_iter):
        n_words_weight = {}  # {str,float)
        max_diff = 0
        for word in words.keys():
            n_words_weight[word] = 1 - d
            for other in words[word]:
                if other == word or len(words[other]) == 0:
                    continue
                n_words_weight[word] += d * words_weight[other] / len(words[other])
            max_diff = max(n_words_weight[word] - words_weight[word], max_diff)
        words_weight = n_words_weight
        print('iter', i, 'max diff is', max_diff)
        if max_diff < min_diff:
            print('break with iter', i)
            break
    return words_weight

2.4 output results

Select the first 30 keywords and the output results are as follows. In this method, the weight represents the value calculated by TextRank and is saved in method2_ In dict.txt

 model 176.5304347133946
 question 85.40181168045564
 response 62.507994652932325
 data 60.65722815422958
 method 59.467011421798766
 result 58.625521805302576
 show 58.328949197586205
 graph 57.56085447050974
 answer 56.016412290514324
 generate 53.04744866326927
 example 52.68958963476476
 training 52.109756756305856
 also 51.35655567676399
 input 50.69980375572206
 word 50.52677865990237
 train 49.34118286080509
 representation 48.497427796293245
 sentence 48.21207111035171
 dataset 48.07840701700186
 work 47.57844139247928
 system 47.03771276235998
 propose 46.88347913956473
 task 46.518530285062205
 performance 45.70988317875179
 base 45.675096486932375
 different 44.92213315873288
 score 43.76950706001539
 test 42.996530025663326
 give 42.40794849944198
 information 42.39192128940212

Method 3 WordCount

The last method is the simple word frequency calculation method. The idea is very simple. It is to calculate the word frequency. It is considered that the more times it occurs, the more likely it is to be a keyword. The structured process is as follows:

3.1 word and sentence segmentation

The same as the pretreatment part, which will not be repeated

3.2 statistics of word frequency

Use a Map to represent (word, word frequency)

dic = {}

def add_to_dict(word_list):
    for word in word_list:
        word = str(word).lower()
        if is_valid(word):
            if word in dic.keys():
                dic[word] += 1
            else:
                dic[word] = 1

Output results

Select the first 30 keywords and the output results are as follows. In this method, the weight represents the word frequency and is saved in method3_ In dict.txt

model 1742
question 813
response 579
graph 515
data 490
method 464
show 456
result 447
answer 445
representation 408
generate 398
example 394
training 393
word 387
dataset 377
sentence 368
input 365
propose 360
train 351
test 349
system 345
also 342
task 330
performance 327
score 325
different 315
work 312
document 304
base 294
information 293

Tags: Python

Posted by nel on Sun, 15 May 2022 09:05:45 +0300