1, Implementation steps of text classification:
Definition stage: define the data and classification system, which categories are divided and which data are needed
Data preprocessing: prepare documents for word segmentation and de stop words
Data extraction features: reduce the dimension of the document matrix and extract the most useful features in the training set
Model training stage: select the specific classification model and algorithm to train the text classifier
Evaluation stage: Test and evaluate the performance of the classifier on the test set
Application stage: apply the classification model with the highest performance to classify the classified documents
2, Several classical methods of feature extraction:
Word bag method (BOW): bag of words, the most primitive feature set. A word / participle is a feature.
It often leads to a data set with tens of thousands of features. Some simple indicators can filter out some words that are not helpful for classification, such as removing stop words, calculating mutual information entropy and so on.
But generally speaking, the feature dimensions are large, and the amount of information of each feature is too small
Statistical features: TF-IDF method (term frequency – inverse document frequency). It mainly uses the statistical features of vocabulary as the feature set. Each feature has its physical meaning. It looks better than bag of word, and the actual effect is almost the same
N-gram: a model considering lexical order, that is, n-order Markov chain. Each sample is transferred into a transfer probability matrix, which has a good effect
3, Classifier method:
Naive Bayesian (NB)
For a given training set, firstly, the joint probability distribution P(X,Y) of input and output is learned independently based on the characteristic conditions, and then based on this model, for a given input x, the output y, YY with the largest a posteriori probability is obtained by using Bayesian theorem
Assuming that P(X,Y) is independently distributed, learn the joint probability distribution P(X,Y) through the training set
P(X, Y)=P(Y|X)·P(X)=P(X|Y)·P(Y)
According to the above equation, the general form of Bayesian theory can be obtained
The denominator is obtained from the full probability formula
Therefore, naive Bayes can be expressed as:
In order to simplify the calculation, the same denominator can be removed
Advantages: simple implementation, high efficiency of learning and prediction
Disadvantages: the performance of classification is not necessarily very high
Logistic regression (LR)
A log linear model whose output is a probability rather than an exact category
Image:
For a given data set, the maximum likelihood estimation method is used to estimate the model parameters
Advantages: simple implementation, small amount of calculation in classification, high speed and low storage resources
Disadvantages: easy under fitting and low accuracy
Support vector machine (SVM)
Find a hyper plane in the feature space that separates the two data sets as much as possible
For linear non separable problems, it is necessary to introduce kernel function to transform the problem into high-dimensional space
Advantages: it can be used for linear / nonlinear classification and regression; Low generalization error; Easy to explain; Low computational complexity; The derivation process is beautiful
Disadvantages: sensitive to the choice of parameters and kernel function
4, Chinese spam classification practice
The data set is divided into: ham_data.txt and spam data. Txt, corresponding to normal mail and spam
Dataset Download
Each line represents a message
The main processes are:
Data extraction, split
#get data def get_data(): """ get data :return: Text data, corresponding to labels """ with open("../../testdata/ham_data.txt", encoding='utf-8') as ham_f, open("../../testdata/spam_data.txt",encoding='utf-8') as spam_f: ham_data = ham_f.readlines() spam_data = spam_f.readlines() ham_label = np.ones(len(ham_data)).tolist() # The tolist function converts a matrix type to a list type spam_label = np.zeros(len(spam_data)).tolist() corpus = ham_data + spam_data labels = ham_label + spam_label return corpus, labels #Split data def prepare_datasets(corpus, labels, test_data_proportion=0.3): """ :param corpus: Text data :param labels: Text label :param test_data_proportion: Proportion of test set data :return: Training data, test data, training labels, test labels """ x_train, x_test, y_train, y_test = train_test_split(corpus, labels, test_size=test_data_proportion, random_state=42) # Fixed random_ After state, the data generated each time is the same (that is, the model is the same) return x_train, x_test, y_train, y_test #Delete empty messages def remove_empty_docs(corpus, labels): filtered_corpus = [] filtered_labels = [] for docs, label in zip(corpus, labels): #Removes the character specified at the beginning and end of the string (space by default) if docs.strip(): filtered_corpus.append(docs) filtered_labels.append(label) return filtered_corpus, filtered_labels
Data normalization and preprocessing
# Load stop words with open("../../testdata/stop_words.utf8", encoding="utf8") as f: stopword_list = f.readlines() #jieba participle def tokenize_text(text): tokens = jieba.cut(text) tokens = [token.strip() for token in tokens] return tokens #Remove all special characters and punctuation marks def remove_special_characters(text): # jieba participle tokens = tokenize_text(text) # compile returns a matching object. Escape ignores the meaning of special characters (equivalent to escape, showing its own meaning) string Punctuation indicates all punctuation marks pattern = re.compile('[{}]'.format(re.escape(string.punctuation))) filtered_tokens = filter(None, [pattern.sub('', token) for token in tokens]) filtered_text = ' '.join(filtered_tokens) return filtered_text #De stop word def remove_stopwords(text): # jieba participle tokens = tokenize_text(text) filtered_tokens = [token for token in tokens if token not in stopword_list] filtered_text = ''.join(filtered_tokens) return filtered_text #Cleaning data and word segmentation def normalize_corpus(corpus, tokenize=False): normalized_corpus = [] for text in corpus: # Remove all special characters and punctuation marks text = remove_special_characters(text) # De stop word text = remove_stopwords(text) normalized_corpus.append(text) if tokenize: text = tokenize_text(text) normalized_corpus.append(text) return normalized_corpus
Feature extraction (tfidf and word bag model)
# Characteristics of word bag model bow_vectorizer, bow_train_features = bow_extractor(norm_train_corpus) bow_test_features = bow_vectorizer.transform(norm_test_corpus) # tfdf characteristics tfidf_vectorizer, tfidf_train_features = tfidf_extractor(norm_train_corpus) tfidf_test_features = tfidf_vectorizer.transform(norm_test_corpus) #Word bag model def bow_extractor(corpus, ngram_range=(1, 1)): vectorizer = CountVectorizer(min_df=1, ngram_range=ngram_range) features = vectorizer.fit_transform(corpus) return vectorizer, features def tfidf_transformer(bow_matrix): transformer = TfidfTransformer(norm='l2', smooth_idf=True, use_idf=True) tfidf_matrix = transformer.fit_transform(bow_matrix) return transformer, tfidf_matrix # tfdf def tfidf_extractor(corpus, ngram_range=(1, 1)): vectorizer = TfidfVectorizer(min_df=1, norm='l2', smooth_idf=True, use_idf=True, ngram_range=ngram_range) features = vectorizer.fit_transform(corpus) return vectorizer, features
Training classifier
#Training model def train_predict_evaluate_model(classifier, train_features, train_labels, test_features, test_labels): # build model classifier.fit(train_features, train_labels) # predict using model predictions = classifier.predict(test_features) # evaluate model prediction performance get_metrics(true_labels=test_labels, predicted_labels=predictions) return predictions
Polynomial naive Bayes based on word bag model
Logistic regression based on word bag model
Support vector machine based on word bag model
Polynomial naive Bayes based on tfidf
Logistic regression based on tfidf
Support vector machine based on tfidf
#naive bayesian model mnb = MultinomialNB() #Support vector machine model svm = SGDClassifier(loss='hinge', n_iter_no_change=100) #Logistic regression model lr = LogisticRegression() # Multinomial naive Bayes based on word bag model print("Bayesian classifier based on the features of word bag model") mnb_bow_predictions = train_predict_evaluate_model(classifier=mnb, train_features=bow_train_features, train_labels=train_labels, test_features=bow_test_features, test_labels=test_labels) # Logistic regression based on the characteristics of word bag model print("Logistic regression based on the characteristics of word bag model") lr_bow_predictions = train_predict_evaluate_model(classifier=lr, train_features=bow_train_features, train_labels=train_labels, test_features=bow_test_features, test_labels=test_labels) # Support vector machine method based on word bag model print("Support vector machine based on word bag model") svm_bow_predictions = train_predict_evaluate_model(classifier=svm, train_features=bow_train_features, train_labels=train_labels, test_features=bow_test_features, test_labels=test_labels) # Polynomial naive Bayesian model based on tfidf print("be based on tfidf Bayesian model") mnb_tfidf_predictions = train_predict_evaluate_model(classifier=mnb, train_features=tfidf_train_features, train_labels=train_labels, test_features=tfidf_test_features, test_labels=test_labels) # Logistic regression model based on tfidf print("be based on tfidf Logistic regression model") lr_tfidf_predictions = train_predict_evaluate_model(classifier=lr, train_features=tfidf_train_features, train_labels=train_labels, test_features=tfidf_test_features, test_labels=test_labels) # Support vector machine model based on tfidf print("be based on tfidf Support vector machine model") svm_tfidf_predictions = train_predict_evaluate_model(classifier=svm, train_features=tfidf_train_features, train_labels=train_labels, test_features=tfidf_test_features, test_labels=test_labels)
The accuracy, recall and F1 measures were used to evaluate the model
#Predicted value evaluation def get_metrics(true_labels, predicted_labels): print('Accuracy:', np.round( metrics.accuracy_score(true_labels, predicted_labels), 2)) print('accuracy:', np.round( metrics.precision_score(true_labels, predicted_labels, average='weighted'), 2)) print('recall :', np.round( metrics.recall_score(true_labels, predicted_labels, average='weighted'), 2)) print('F1 score:', np.round( metrics.f1_score(true_labels, predicted_labels, average='weighted'), 2))
5. Complete code
1. Data processing method normalization py
# -*- coding: utf-8 -*- import re # Implement regular expression module import string import jieba # Load stop words with open("../../testdata/stop_words.utf8", encoding="utf8") as f: stopword_list = f.readlines() #jieba participle def tokenize_text(text): tokens = jieba.cut(text) tokens = [token.strip() for token in tokens] return tokens #Remove all special characters and punctuation marks def remove_special_characters(text): # jieba participle tokens = tokenize_text(text) # compile returns a matching object. Escape ignores the meaning of special characters (equivalent to escape, showing its own meaning) string Punctuation indicates all punctuation marks pattern = re.compile('[{}]'.format(re.escape(string.punctuation))) filtered_tokens = filter(None, [pattern.sub('', token) for token in tokens]) filtered_text = ' '.join(filtered_tokens) return filtered_text #De stop word def remove_stopwords(text): # jieba participle tokens = tokenize_text(text) filtered_tokens = [token for token in tokens if token not in stopword_list] filtered_text = ''.join(filtered_tokens) return filtered_text #Cleaning data and word segmentation def normalize_corpus(corpus, tokenize=False): normalized_corpus = [] for text in corpus: # Remove all special characters and punctuation marks text = remove_special_characters(text) # De stop word text = remove_stopwords(text) normalized_corpus.append(text) if tokenize: text = tokenize_text(text) normalized_corpus.append(text) return normalized_corpus
2. Feature extraction method_ extractors. py
# -*- coding: utf-8 -*- # CountVectorizer considers the frequency of words in the text from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer #Word bag model def bow_extractor(corpus, ngram_range=(1, 1)): vectorizer = CountVectorizer(min_df=1, ngram_range=ngram_range) features = vectorizer.fit_transform(corpus) return vectorizer, features def tfidf_transformer(bow_matrix): transformer = TfidfTransformer(norm='l2', smooth_idf=True, use_idf=True) tfidf_matrix = transformer.fit_transform(bow_matrix) return transformer, tfidf_matrix # tfdf def tfidf_extractor(corpus, ngram_range=(1, 1)): vectorizer = TfidfVectorizer(min_df=1, norm='l2', smooth_idf=True, use_idf=True, ngram_range=ngram_range) features = vectorizer.fit_transform(corpus) return vectorizer, features
3. Main method classfier py
# -*- coding: utf-8 -*- # date: 09/22/2020 # coding: gbk import numpy as np from sklearn.model_selection import train_test_split from nlpstudycode.Spam classification.normalization import normalize_corpus from nlpstudycode.Spam classification.feature_extractors import bow_extractor, tfidf_extractor import gensim import jieba from sklearn import metrics from sklearn.naive_bayes import MultinomialNB from sklearn.linear_model import SGDClassifier from sklearn.linear_model import LogisticRegression #get data def get_data(): """ get data :return: Text data, corresponding to labels """ with open("../../testdata/ham_data.txt", encoding='utf-8') as ham_f, open("../../testdata/spam_data.txt",encoding='utf-8') as spam_f: ham_data = ham_f.readlines() spam_data = spam_f.readlines() ham_label = np.ones(len(ham_data)).tolist() # The tolist function converts a matrix type to a list type spam_label = np.zeros(len(spam_data)).tolist() corpus = ham_data + spam_data labels = ham_label + spam_label return corpus, labels #Split data def prepare_datasets(corpus, labels, test_data_proportion=0.3): """ :param corpus: Text data :param labels: Text label :param test_data_proportion: Proportion of test set data :return: Training data, test data, training labels, test labels """ x_train, x_test, y_train, y_test = train_test_split(corpus, labels, test_size=test_data_proportion, random_state=42) # Fixed random_ After state, the data generated each time is the same (that is, the model is the same) return x_train, x_test, y_train, y_test #Delete empty messages def remove_empty_docs(corpus, labels): filtered_corpus = [] filtered_labels = [] for docs, label in zip(corpus, labels): #Removes the character specified at the beginning and end of the string (space by default) if docs.strip(): filtered_corpus.append(docs) filtered_labels.append(label) return filtered_corpus, filtered_labels #Predicted value evaluation def get_metrics(true_labels, predicted_labels): print('Accuracy:', np.round( metrics.accuracy_score(true_labels, predicted_labels), 2)) print('accuracy:', np.round( metrics.precision_score(true_labels, predicted_labels, average='weighted'), 2)) print('recall :', np.round( metrics.recall_score(true_labels, predicted_labels, average='weighted'), 2)) print('F1 score:', np.round( metrics.f1_score(true_labels, predicted_labels, average='weighted'), 2)) #Training model def train_predict_evaluate_model(classifier, train_features, train_labels, test_features, test_labels): # build model classifier.fit(train_features, train_labels) # predict using model predictions = classifier.predict(test_features) # evaluate model prediction performance get_metrics(true_labels=test_labels, predicted_labels=predictions) return predictions def main(): #get data corpus, labels = get_data() print("Total data volume:", len(labels)) #Delete empty messages corpus, labels = remove_empty_docs(corpus, labels) print('One sample:', corpus[10]) print('Sample label:', labels[10]) label_name_map = ['Spam', 'Normal mail'] # 0 1 print('Actual type:', label_name_map[int(labels[10])], label_name_map[int(labels[5900])]) # Split data train_corpus, test_corpus, train_labels, test_labels = prepare_datasets(corpus,labels,test_data_proportion=0.3) #Cleaning data and word segmentation norm_train_corpus = normalize_corpus(train_corpus) norm_test_corpus = normalize_corpus(test_corpus) ''.strip() # Characteristics of word bag model bow_vectorizer, bow_train_features = bow_extractor(norm_train_corpus) bow_test_features = bow_vectorizer.transform(norm_test_corpus) # tfdf characteristics tfidf_vectorizer, tfidf_train_features = tfidf_extractor(norm_train_corpus) tfidf_test_features = tfidf_vectorizer.transform(norm_test_corpus) # tokenize documents tokenized_train = [jieba.lcut(text) for text in norm_train_corpus] print(tokenized_train[2:10]) tokenized_test = [jieba.lcut(text) for text in norm_test_corpus] # build word2vec model # model = gensim.models.Word2Vec(tokenized_train, # size=500, # window=100, # min_count=30, # sample=1e-3) #naive bayesian model mnb = MultinomialNB() #Support vector machine model svm = SGDClassifier(loss='hinge', n_iter_no_change=100) #Logistic regression model lr = LogisticRegression() # Multinomial naive Bayes based on word bag model print("Bayesian classifier based on the features of word bag model") mnb_bow_predictions = train_predict_evaluate_model(classifier=mnb, train_features=bow_train_features, train_labels=train_labels, test_features=bow_test_features, test_labels=test_labels) # Logistic regression based on the characteristics of word bag model print("Logistic regression based on the characteristics of word bag model") lr_bow_predictions = train_predict_evaluate_model(classifier=lr, train_features=bow_train_features, train_labels=train_labels, test_features=bow_test_features, test_labels=test_labels) # Support vector machine method based on word bag model print("Support vector machine based on word bag model") svm_bow_predictions = train_predict_evaluate_model(classifier=svm, train_features=bow_train_features, train_labels=train_labels, test_features=bow_test_features, test_labels=test_labels) # Polynomial naive Bayesian model based on tfidf print("be based on tfidf Bayesian model") mnb_tfidf_predictions = train_predict_evaluate_model(classifier=mnb, train_features=tfidf_train_features, train_labels=train_labels, test_features=tfidf_test_features, test_labels=test_labels) # Logistic regression model based on tfidf print("be based on tfidf Logistic regression model") lr_tfidf_predictions = train_predict_evaluate_model(classifier=lr, train_features=tfidf_train_features, train_labels=train_labels, test_features=tfidf_test_features, test_labels=test_labels) # Support vector machine model based on tfidf print("be based on tfidf Support vector machine model") svm_tfidf_predictions = train_predict_evaluate_model(classifier=svm, train_features=tfidf_train_features, train_labels=train_labels, test_features=tfidf_test_features, test_labels=test_labels) if __name__ == '__main__': main()
4. Results
Bayesian classifier based on the features of word bag model
Accuracy: 0.79
Accuracy: 0.85
Recall rate: 0.79
F1 score: 0.78
Logistic regression based on the characteristics of word bag model
Accuracy: 0.96
Accuracy: 0.96
Recall rate: 0.96
F1 score: 0.96
Support vector machine based on word bag model
Accuracy: 0.97
Accuracy: 0.97
Recall rate: 0.97
F1 score: 0.97
Bayesian model based on tfidf
Accuracy: 0.79
Accuracy: 0.85
Recall rate: 0.79
F1 score: 0.78
Logistic regression model based on tfidf
Accuracy: 0.94
Accuracy: 0.94
Recall rate: 0.94
F1 score: 0.94
Support vector machine model based on tfidf
Accuracy: 0.97
Accuracy: 0.97
Recall rate: 0.97
F1 score: 0.97