# 1, Implementation steps of text classification:

Definition stage: define the data and classification system, which categories are divided and which data are needed
Data preprocessing: prepare documents for word segmentation and de stop words
Data extraction features: reduce the dimension of the document matrix and extract the most useful features in the training set
Model training stage: select the specific classification model and algorithm to train the text classifier
Evaluation stage: Test and evaluate the performance of the classifier on the test set
Application stage: apply the classification model with the highest performance to classify the classified documents

# 2, Several classical methods of feature extraction:

Word bag method (BOW): bag of words, the most primitive feature set. A word / participle is a feature.
It often leads to a data set with tens of thousands of features. Some simple indicators can filter out some words that are not helpful for classification, such as removing stop words, calculating mutual information entropy and so on.
But generally speaking, the feature dimensions are large, and the amount of information of each feature is too small

Statistical features: TF-IDF method (term frequency – inverse document frequency). It mainly uses the statistical features of vocabulary as the feature set. Each feature has its physical meaning. It looks better than bag of word, and the actual effect is almost the same

N-gram: a model considering lexical order, that is, n-order Markov chain. Each sample is transferred into a transfer probability matrix, which has a good effect

# 3, Classifier method:

### Naive Bayesian (NB)

For a given training set, firstly, the joint probability distribution P(X,Y) of input and output is learned independently based on the characteristic conditions, and then based on this model, for a given input x, the output y, YY with the largest a posteriori probability is obtained by using Bayesian theorem

Assuming that P(X,Y) is independently distributed, learn the joint probability distribution P(X,Y) through the training set
P(X, Y)=P(Y|X)·P(X)=P(X|Y)·P(Y)

According to the above equation, the general form of Bayesian theory can be obtained The denominator is obtained from the full probability formula

Therefore, naive Bayes can be expressed as: In order to simplify the calculation, the same denominator can be removed

Advantages: simple implementation, high efficiency of learning and prediction
Disadvantages: the performance of classification is not necessarily very high

### Logistic regression (LR)

A log linear model whose output is a probability rather than an exact category Image: For a given data set, the maximum likelihood estimation method is used to estimate the model parameters Advantages: simple implementation, small amount of calculation in classification, high speed and low storage resources
Disadvantages: easy under fitting and low accuracy

# Find a hyper plane in the feature space that separates the two data sets as much as possible For linear non separable problems, it is necessary to introduce kernel function to transform the problem into high-dimensional space

Advantages: it can be used for linear / nonlinear classification and regression; Low generalization error; Easy to explain; Low computational complexity; The derivation process is beautiful
Disadvantages: sensitive to the choice of parameters and kernel function

# 4, Chinese spam classification practice

The data set is divided into: ham_data.txt and spam data. Txt, corresponding to normal mail and spam
Each line represents a message

The main processes are:
Data extraction, split

```#get data
def get_data():
"""
get data
:return:  Text data, corresponding to labels
"""
with open("../../testdata/ham_data.txt", encoding='utf-8') as ham_f, open("../../testdata/spam_data.txt",encoding='utf-8') as spam_f:
ham_label = np.ones(len(ham_data)).tolist()  # The tolist function converts a matrix type to a list type
spam_label = np.zeros(len(spam_data)).tolist()
corpus = ham_data + spam_data
labels = ham_label + spam_label
return corpus, labels

#Split data
def prepare_datasets(corpus, labels, test_data_proportion=0.3):
"""
:param corpus: Text data
:param labels: Text label
:param test_data_proportion:  Proportion of test set data
:return: Training data, test data, training labels， test labels
"""
x_train, x_test, y_train, y_test = train_test_split(corpus, labels, test_size=test_data_proportion,
random_state=42)  # Fixed random_ After state, the data generated each time is the same (that is, the model is the same)
return x_train, x_test, y_train, y_test

#Delete empty messages
def remove_empty_docs(corpus, labels):
filtered_corpus = []
filtered_labels = []
for docs, label in zip(corpus, labels):
#Removes the character specified at the beginning and end of the string (space by default)
if docs.strip():
filtered_corpus.append(docs)
filtered_labels.append(label)
return filtered_corpus, filtered_labels
```

Data normalization and preprocessing

```# Load stop words
with open("../../testdata/stop_words.utf8", encoding="utf8") as f:

#jieba participle
def tokenize_text(text):
tokens = jieba.cut(text)
tokens = [token.strip() for token in tokens]

#Remove all special characters and punctuation marks
def remove_special_characters(text):
# jieba participle
tokens = tokenize_text(text)
# compile returns a matching object. Escape ignores the meaning of special characters (equivalent to escape, showing its own meaning) string Punctuation indicates all punctuation marks
pattern = re.compile('[{}]'.format(re.escape(string.punctuation)))
filtered_tokens = filter(None, [pattern.sub('', token) for token in tokens])
filtered_text = ' '.join(filtered_tokens)
return filtered_text

#De stop word
def remove_stopwords(text):
# jieba participle
tokens = tokenize_text(text)
filtered_tokens = [token for token in tokens if token not in stopword_list]
filtered_text = ''.join(filtered_tokens)
return filtered_text

#Cleaning data and word segmentation
def normalize_corpus(corpus, tokenize=False):
normalized_corpus = []
for text in corpus:
# Remove all special characters and punctuation marks
text = remove_special_characters(text)
# De stop word
text = remove_stopwords(text)
normalized_corpus.append(text)
if tokenize:
text = tokenize_text(text)
normalized_corpus.append(text)

return normalized_corpus
```

Feature extraction (tfidf and word bag model)

```# Characteristics of word bag model
bow_vectorizer, bow_train_features = bow_extractor(norm_train_corpus)
bow_test_features = bow_vectorizer.transform(norm_test_corpus)

# tfdf characteristics
tfidf_vectorizer, tfidf_train_features = tfidf_extractor(norm_train_corpus)
tfidf_test_features = tfidf_vectorizer.transform(norm_test_corpus)
#Word bag model
def bow_extractor(corpus, ngram_range=(1, 1)):
vectorizer = CountVectorizer(min_df=1, ngram_range=ngram_range)
features = vectorizer.fit_transform(corpus)
return vectorizer, features

def tfidf_transformer(bow_matrix):
transformer = TfidfTransformer(norm='l2',
smooth_idf=True,
use_idf=True)
tfidf_matrix = transformer.fit_transform(bow_matrix)
return transformer, tfidf_matrix

# tfdf
def tfidf_extractor(corpus, ngram_range=(1, 1)):
vectorizer = TfidfVectorizer(min_df=1,
norm='l2',
smooth_idf=True,
use_idf=True,
ngram_range=ngram_range)
features = vectorizer.fit_transform(corpus)
return vectorizer, features
```

Training classifier

```#Training model
def train_predict_evaluate_model(classifier,
train_features, train_labels,
test_features, test_labels):
# build model
classifier.fit(train_features, train_labels)
# predict using model
predictions = classifier.predict(test_features)
# evaluate model prediction performance
get_metrics(true_labels=test_labels,
predicted_labels=predictions)
return predictions
```

Polynomial naive Bayes based on word bag model
Logistic regression based on word bag model
Support vector machine based on word bag model
Polynomial naive Bayes based on tfidf
Logistic regression based on tfidf
Support vector machine based on tfidf

```#naive bayesian model
mnb = MultinomialNB()
#Support vector machine model
svm = SGDClassifier(loss='hinge', n_iter_no_change=100)
#Logistic regression model
lr = LogisticRegression()

# Multinomial naive Bayes based on word bag model
print("Bayesian classifier based on the features of word bag model")
mnb_bow_predictions = train_predict_evaluate_model(classifier=mnb,
train_features=bow_train_features,
train_labels=train_labels,
test_features=bow_test_features,
test_labels=test_labels)

# Logistic regression based on the characteristics of word bag model
print("Logistic regression based on the characteristics of word bag model")
lr_bow_predictions = train_predict_evaluate_model(classifier=lr,
train_features=bow_train_features,
train_labels=train_labels,
test_features=bow_test_features,
test_labels=test_labels)

# Support vector machine method based on word bag model
print("Support vector machine based on word bag model")
svm_bow_predictions = train_predict_evaluate_model(classifier=svm,
train_features=bow_train_features,
train_labels=train_labels,
test_features=bow_test_features,
test_labels=test_labels)

# Polynomial naive Bayesian model based on tfidf
print("be based on tfidf Bayesian model")
mnb_tfidf_predictions = train_predict_evaluate_model(classifier=mnb,
train_features=tfidf_train_features,
train_labels=train_labels,
test_features=tfidf_test_features,
test_labels=test_labels)

# Logistic regression model based on tfidf
print("be based on tfidf Logistic regression model")
lr_tfidf_predictions = train_predict_evaluate_model(classifier=lr,
train_features=tfidf_train_features,
train_labels=train_labels,
test_features=tfidf_test_features,
test_labels=test_labels)

# Support vector machine model based on tfidf
print("be based on tfidf Support vector machine model")
svm_tfidf_predictions = train_predict_evaluate_model(classifier=svm,
train_features=tfidf_train_features,
train_labels=train_labels,
test_features=tfidf_test_features,
test_labels=test_labels)
```

The accuracy, recall and F1 measures were used to evaluate the model

```#Predicted value evaluation
def get_metrics(true_labels, predicted_labels):
print('Accuracy:', np.round(
metrics.accuracy_score(true_labels,
predicted_labels),
2))
print('accuracy:', np.round(
metrics.precision_score(true_labels,
predicted_labels,
average='weighted'),
2))
print('recall :', np.round(
metrics.recall_score(true_labels,
predicted_labels,
average='weighted'),
2))
print('F1 score:', np.round(
metrics.f1_score(true_labels,
predicted_labels,
average='weighted'),
2))
```

# 5. Complete code

1. Data processing method normalization py

```# -*- coding: utf-8 -*-
import re  # Implement regular expression module
import string
import jieba

with open("../../testdata/stop_words.utf8", encoding="utf8") as f:

#jieba participle
def tokenize_text(text):
tokens = jieba.cut(text)
tokens = [token.strip() for token in tokens]

#Remove all special characters and punctuation marks
def remove_special_characters(text):
# jieba participle
tokens = tokenize_text(text)
# compile returns a matching object. Escape ignores the meaning of special characters (equivalent to escape, showing its own meaning) string Punctuation indicates all punctuation marks
pattern = re.compile('[{}]'.format(re.escape(string.punctuation)))
filtered_tokens = filter(None, [pattern.sub('', token) for token in tokens])
filtered_text = ' '.join(filtered_tokens)
return filtered_text

#De stop word
def remove_stopwords(text):
# jieba participle
tokens = tokenize_text(text)
filtered_tokens = [token for token in tokens if token not in stopword_list]
filtered_text = ''.join(filtered_tokens)
return filtered_text

#Cleaning data and word segmentation
def normalize_corpus(corpus, tokenize=False):
normalized_corpus = []
for text in corpus:
# Remove all special characters and punctuation marks
text = remove_special_characters(text)
# De stop word
text = remove_stopwords(text)
normalized_corpus.append(text)
if tokenize:
text = tokenize_text(text)
normalized_corpus.append(text)

return normalized_corpus
```

2. Feature extraction method_ extractors. py

```# -*- coding: utf-8 -*-
# CountVectorizer considers the frequency of words in the text
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer

#Word bag model
def bow_extractor(corpus, ngram_range=(1, 1)):
vectorizer = CountVectorizer(min_df=1, ngram_range=ngram_range)
features = vectorizer.fit_transform(corpus)
return vectorizer, features

def tfidf_transformer(bow_matrix):
transformer = TfidfTransformer(norm='l2',
smooth_idf=True,
use_idf=True)
tfidf_matrix = transformer.fit_transform(bow_matrix)
return transformer, tfidf_matrix

# tfdf
def tfidf_extractor(corpus, ngram_range=(1, 1)):
vectorizer = TfidfVectorizer(min_df=1,
norm='l2',
smooth_idf=True,
use_idf=True,
ngram_range=ngram_range)
features = vectorizer.fit_transform(corpus)
return vectorizer, features
```

3. Main method classfier py

```# -*- coding: utf-8 -*-
# date: 09/22/2020
# coding: gbk
import numpy as np
from sklearn.model_selection import train_test_split
from nlpstudycode.Spam classification.normalization import normalize_corpus
from nlpstudycode.Spam classification.feature_extractors import bow_extractor, tfidf_extractor
import gensim
import jieba
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression

#get data
def get_data():
"""
get data
:return:  Text data, corresponding to labels
"""
with open("../../testdata/ham_data.txt", encoding='utf-8') as ham_f, open("../../testdata/spam_data.txt",encoding='utf-8') as spam_f:
ham_label = np.ones(len(ham_data)).tolist()  # The tolist function converts a matrix type to a list type
spam_label = np.zeros(len(spam_data)).tolist()
corpus = ham_data + spam_data
labels = ham_label + spam_label
return corpus, labels

#Split data
def prepare_datasets(corpus, labels, test_data_proportion=0.3):
"""
:param corpus: Text data
:param labels: Text label
:param test_data_proportion:  Proportion of test set data
:return: Training data, test data, training labels， test labels
"""
x_train, x_test, y_train, y_test = train_test_split(corpus, labels, test_size=test_data_proportion,
random_state=42)  # Fixed random_ After state, the data generated each time is the same (that is, the model is the same)
return x_train, x_test, y_train, y_test

#Delete empty messages
def remove_empty_docs(corpus, labels):
filtered_corpus = []
filtered_labels = []
for docs, label in zip(corpus, labels):
#Removes the character specified at the beginning and end of the string (space by default)
if docs.strip():
filtered_corpus.append(docs)
filtered_labels.append(label)
return filtered_corpus, filtered_labels

#Predicted value evaluation
def get_metrics(true_labels, predicted_labels):
print('Accuracy:', np.round(
metrics.accuracy_score(true_labels,
predicted_labels),
2))
print('accuracy:', np.round(
metrics.precision_score(true_labels,
predicted_labels,
average='weighted'),
2))
print('recall :', np.round(
metrics.recall_score(true_labels,
predicted_labels,
average='weighted'),
2))
print('F1 score:', np.round(
metrics.f1_score(true_labels,
predicted_labels,
average='weighted'),
2))

#Training model
def train_predict_evaluate_model(classifier,
train_features, train_labels,
test_features, test_labels):
# build model
classifier.fit(train_features, train_labels)
# predict using model
predictions = classifier.predict(test_features)
# evaluate model prediction performance
get_metrics(true_labels=test_labels,
predicted_labels=predictions)
return predictions

def main():
#get data
corpus, labels = get_data()
print("Total data volume:", len(labels))
#Delete empty messages
corpus, labels = remove_empty_docs(corpus, labels)
print('One sample:', corpus)
print('Sample label:', labels)
label_name_map = ['Spam', 'Normal mail']  # 0 1
print('Actual type:', label_name_map[int(labels)], label_name_map[int(labels)])
# Split data
train_corpus, test_corpus, train_labels, test_labels = prepare_datasets(corpus,labels,test_data_proportion=0.3)

#Cleaning data and word segmentation
norm_train_corpus = normalize_corpus(train_corpus)
norm_test_corpus = normalize_corpus(test_corpus)

''.strip()

# Characteristics of word bag model
bow_vectorizer, bow_train_features = bow_extractor(norm_train_corpus)
bow_test_features = bow_vectorizer.transform(norm_test_corpus)

# tfdf characteristics
tfidf_vectorizer, tfidf_train_features = tfidf_extractor(norm_train_corpus)
tfidf_test_features = tfidf_vectorizer.transform(norm_test_corpus)

# tokenize documents
tokenized_train = [jieba.lcut(text)
for text in norm_train_corpus]
print(tokenized_train[2:10])
tokenized_test = [jieba.lcut(text)
for text in norm_test_corpus]
# build word2vec model
# model = gensim.models.Word2Vec(tokenized_train,
#                                size=500,
#                                window=100,
#                                min_count=30,
#                                sample=1e-3)
#naive bayesian model
mnb = MultinomialNB()
#Support vector machine model
svm = SGDClassifier(loss='hinge', n_iter_no_change=100)
#Logistic regression model
lr = LogisticRegression()

# Multinomial naive Bayes based on word bag model
print("Bayesian classifier based on the features of word bag model")
mnb_bow_predictions = train_predict_evaluate_model(classifier=mnb,
train_features=bow_train_features,
train_labels=train_labels,
test_features=bow_test_features,
test_labels=test_labels)

# Logistic regression based on the characteristics of word bag model
print("Logistic regression based on the characteristics of word bag model")
lr_bow_predictions = train_predict_evaluate_model(classifier=lr,
train_features=bow_train_features,
train_labels=train_labels,
test_features=bow_test_features,
test_labels=test_labels)

# Support vector machine method based on word bag model
print("Support vector machine based on word bag model")
svm_bow_predictions = train_predict_evaluate_model(classifier=svm,
train_features=bow_train_features,
train_labels=train_labels,
test_features=bow_test_features,
test_labels=test_labels)

# Polynomial naive Bayesian model based on tfidf
print("be based on tfidf Bayesian model")
mnb_tfidf_predictions = train_predict_evaluate_model(classifier=mnb,
train_features=tfidf_train_features,
train_labels=train_labels,
test_features=tfidf_test_features,
test_labels=test_labels)

# Logistic regression model based on tfidf
print("be based on tfidf Logistic regression model")
lr_tfidf_predictions = train_predict_evaluate_model(classifier=lr,
train_features=tfidf_train_features,
train_labels=train_labels,
test_features=tfidf_test_features,
test_labels=test_labels)

# Support vector machine model based on tfidf
print("be based on tfidf Support vector machine model")
svm_tfidf_predictions = train_predict_evaluate_model(classifier=svm,
train_features=tfidf_train_features,
train_labels=train_labels,
test_features=tfidf_test_features,
test_labels=test_labels)

if __name__ == '__main__':
main()
```

4. Results

Bayesian classifier based on the features of word bag model
Accuracy: 0.79
Accuracy: 0.85
Recall rate: 0.79
F1 score: 0.78
Logistic regression based on the characteristics of word bag model
Accuracy: 0.96
Accuracy: 0.96
Recall rate: 0.96
F1 score: 0.96
Support vector machine based on word bag model
Accuracy: 0.97
Accuracy: 0.97
Recall rate: 0.97
F1 score: 0.97
Bayesian model based on tfidf
Accuracy: 0.79
Accuracy: 0.85
Recall rate: 0.79
F1 score: 0.78
Logistic regression model based on tfidf
Accuracy: 0.94
Accuracy: 0.94
Recall rate: 0.94
F1 score: 0.94
Support vector machine model based on tfidf
Accuracy: 0.97
Accuracy: 0.97
Recall rate: 0.97
F1 score: 0.97

Tags: Machine Learning NLP

Posted by lucerias on Mon, 16 May 2022 00:57:33 +0300