NLP learning - text classification of NLP practice - Chinese spam classification - Python 3

1, Implementation steps of text classification:

Definition stage: define the data and classification system, which categories are divided and which data are needed
Data preprocessing: prepare documents for word segmentation and de stop words
Data extraction features: reduce the dimension of the document matrix and extract the most useful features in the training set
Model training stage: select the specific classification model and algorithm to train the text classifier
Evaluation stage: Test and evaluate the performance of the classifier on the test set
Application stage: apply the classification model with the highest performance to classify the classified documents

2, Several classical methods of feature extraction:

Word bag method (BOW): bag of words, the most primitive feature set. A word / participle is a feature.
It often leads to a data set with tens of thousands of features. Some simple indicators can filter out some words that are not helpful for classification, such as removing stop words, calculating mutual information entropy and so on.
But generally speaking, the feature dimensions are large, and the amount of information of each feature is too small

Statistical features: TF-IDF method (term frequency – inverse document frequency). It mainly uses the statistical features of vocabulary as the feature set. Each feature has its physical meaning. It looks better than bag of word, and the actual effect is almost the same

N-gram: a model considering lexical order, that is, n-order Markov chain. Each sample is transferred into a transfer probability matrix, which has a good effect

3, Classifier method:

Naive Bayesian (NB)

For a given training set, firstly, the joint probability distribution P(X,Y) of input and output is learned independently based on the characteristic conditions, and then based on this model, for a given input x, the output y, YY with the largest a posteriori probability is obtained by using Bayesian theorem

Assuming that P(X,Y) is independently distributed, learn the joint probability distribution P(X,Y) through the training set
P(X, Y)=P(Y|X)·P(X)=P(X|Y)·P(Y)

According to the above equation, the general form of Bayesian theory can be obtained

The denominator is obtained from the full probability formula

Therefore, naive Bayes can be expressed as:

In order to simplify the calculation, the same denominator can be removed

Advantages: simple implementation, high efficiency of learning and prediction
Disadvantages: the performance of classification is not necessarily very high

Logistic regression (LR)

A log linear model whose output is a probability rather than an exact category

Image:

For a given data set, the maximum likelihood estimation method is used to estimate the model parameters

Advantages: simple implementation, small amount of calculation in classification, high speed and low storage resources
Disadvantages: easy under fitting and low accuracy

Support vector machine (SVM)

Find a hyper plane in the feature space that separates the two data sets as much as possible

For linear non separable problems, it is necessary to introduce kernel function to transform the problem into high-dimensional space

Advantages: it can be used for linear / nonlinear classification and regression; Low generalization error; Easy to explain; Low computational complexity; The derivation process is beautiful
Disadvantages: sensitive to the choice of parameters and kernel function

4, Chinese spam classification practice

The data set is divided into: ham_data.txt and spam data. Txt, corresponding to normal mail and spam
Dataset Download
Each line represents a message

The main processes are:
Data extraction, split

#get data
def get_data():
    """
    get data
    :return:  Text data, corresponding to labels
    """
    with open("../../testdata/ham_data.txt", encoding='utf-8') as ham_f, open("../../testdata/spam_data.txt",encoding='utf-8') as spam_f:
        ham_data = ham_f.readlines()
        spam_data = spam_f.readlines()
        ham_label = np.ones(len(ham_data)).tolist()  # The tolist function converts a matrix type to a list type
        spam_label = np.zeros(len(spam_data)).tolist()
        corpus = ham_data + spam_data
        labels = ham_label + spam_label
    return corpus, labels

#Split data
def prepare_datasets(corpus, labels, test_data_proportion=0.3):
    """
    :param corpus: Text data
    :param labels: Text label
    :param test_data_proportion:  Proportion of test set data
    :return: Training data, test data, training labels, test labels
    """
    x_train, x_test, y_train, y_test = train_test_split(corpus, labels, test_size=test_data_proportion,
                                                        random_state=42)  # Fixed random_ After state, the data generated each time is the same (that is, the model is the same)
    return x_train, x_test, y_train, y_test

#Delete empty messages
def remove_empty_docs(corpus, labels):
    filtered_corpus = []
    filtered_labels = []
    for docs, label in zip(corpus, labels):
        #Removes the character specified at the beginning and end of the string (space by default)
        if docs.strip():
            filtered_corpus.append(docs)
            filtered_labels.append(label)
    return filtered_corpus, filtered_labels

Data normalization and preprocessing

# Load stop words
with open("../../testdata/stop_words.utf8", encoding="utf8") as f:
    stopword_list = f.readlines()

#jieba participle
def tokenize_text(text):
    tokens = jieba.cut(text)
    tokens = [token.strip() for token in tokens]
    return tokens

#Remove all special characters and punctuation marks
def remove_special_characters(text):
    # jieba participle
    tokens = tokenize_text(text)
    # compile returns a matching object. Escape ignores the meaning of special characters (equivalent to escape, showing its own meaning) string Punctuation indicates all punctuation marks
    pattern = re.compile('[{}]'.format(re.escape(string.punctuation)))
    filtered_tokens = filter(None, [pattern.sub('', token) for token in tokens])
    filtered_text = ' '.join(filtered_tokens)
    return filtered_text

#De stop word
def remove_stopwords(text):
    # jieba participle
    tokens = tokenize_text(text)
    filtered_tokens = [token for token in tokens if token not in stopword_list]
    filtered_text = ''.join(filtered_tokens)
    return filtered_text

#Cleaning data and word segmentation
def normalize_corpus(corpus, tokenize=False):
    normalized_corpus = []
    for text in corpus:
        # Remove all special characters and punctuation marks
        text = remove_special_characters(text)
        # De stop word
        text = remove_stopwords(text)
        normalized_corpus.append(text)
        if tokenize:
            text = tokenize_text(text)
            normalized_corpus.append(text)

    return normalized_corpus

Feature extraction (tfidf and word bag model)

# Characteristics of word bag model
bow_vectorizer, bow_train_features = bow_extractor(norm_train_corpus)
bow_test_features = bow_vectorizer.transform(norm_test_corpus)

# tfdf characteristics
tfidf_vectorizer, tfidf_train_features = tfidf_extractor(norm_train_corpus)
tfidf_test_features = tfidf_vectorizer.transform(norm_test_corpus)
#Word bag model
def bow_extractor(corpus, ngram_range=(1, 1)):
    vectorizer = CountVectorizer(min_df=1, ngram_range=ngram_range)
    features = vectorizer.fit_transform(corpus)
    return vectorizer, features


def tfidf_transformer(bow_matrix):
    transformer = TfidfTransformer(norm='l2',
                                   smooth_idf=True,
                                   use_idf=True)
    tfidf_matrix = transformer.fit_transform(bow_matrix)
    return transformer, tfidf_matrix

# tfdf
def tfidf_extractor(corpus, ngram_range=(1, 1)):
    vectorizer = TfidfVectorizer(min_df=1,
                                 norm='l2',
                                 smooth_idf=True,
                                 use_idf=True,
                                 ngram_range=ngram_range)
    features = vectorizer.fit_transform(corpus)
    return vectorizer, features

Training classifier

#Training model
def train_predict_evaluate_model(classifier,
                                 train_features, train_labels,
                                 test_features, test_labels):
    # build model
    classifier.fit(train_features, train_labels)
    # predict using model
    predictions = classifier.predict(test_features)
    # evaluate model prediction performance
    get_metrics(true_labels=test_labels,
                predicted_labels=predictions)
    return predictions

Polynomial naive Bayes based on word bag model
Logistic regression based on word bag model
Support vector machine based on word bag model
Polynomial naive Bayes based on tfidf
Logistic regression based on tfidf
Support vector machine based on tfidf

#naive bayesian model 
    mnb = MultinomialNB()
    #Support vector machine model
    svm = SGDClassifier(loss='hinge', n_iter_no_change=100)
    #Logistic regression model
    lr = LogisticRegression()

    # Multinomial naive Bayes based on word bag model
    print("Bayesian classifier based on the features of word bag model")
    mnb_bow_predictions = train_predict_evaluate_model(classifier=mnb,
                                                       train_features=bow_train_features,
                                                       train_labels=train_labels,
                                                       test_features=bow_test_features,
                                                       test_labels=test_labels)

    # Logistic regression based on the characteristics of word bag model
    print("Logistic regression based on the characteristics of word bag model")
    lr_bow_predictions = train_predict_evaluate_model(classifier=lr,
                                                      train_features=bow_train_features,
                                                      train_labels=train_labels,
                                                      test_features=bow_test_features,
                                                      test_labels=test_labels)

    # Support vector machine method based on word bag model
    print("Support vector machine based on word bag model")
    svm_bow_predictions = train_predict_evaluate_model(classifier=svm,
                                                       train_features=bow_train_features,
                                                       train_labels=train_labels,
                                                       test_features=bow_test_features,
                                                       test_labels=test_labels)

    # Polynomial naive Bayesian model based on tfidf
    print("be based on tfidf Bayesian model")
    mnb_tfidf_predictions = train_predict_evaluate_model(classifier=mnb,
                                                         train_features=tfidf_train_features,
                                                         train_labels=train_labels,
                                                         test_features=tfidf_test_features,
                                                         test_labels=test_labels)

    # Logistic regression model based on tfidf
    print("be based on tfidf Logistic regression model")
    lr_tfidf_predictions = train_predict_evaluate_model(classifier=lr,
                                                        train_features=tfidf_train_features,
                                                        train_labels=train_labels,
                                                        test_features=tfidf_test_features,
                                                        test_labels=test_labels)

    # Support vector machine model based on tfidf
    print("be based on tfidf Support vector machine model")
    svm_tfidf_predictions = train_predict_evaluate_model(classifier=svm,
                                                         train_features=tfidf_train_features,
                                                         train_labels=train_labels,
                                                         test_features=tfidf_test_features,
                                                         test_labels=test_labels)

The accuracy, recall and F1 measures were used to evaluate the model

#Predicted value evaluation
def get_metrics(true_labels, predicted_labels):
    print('Accuracy:', np.round(
        metrics.accuracy_score(true_labels,
                               predicted_labels),
        2))
    print('accuracy:', np.round(
        metrics.precision_score(true_labels,
                                predicted_labels,
                                average='weighted'),
        2))
    print('recall :', np.round(
        metrics.recall_score(true_labels,
                             predicted_labels,
                             average='weighted'),
        2))
    print('F1 score:', np.round(
        metrics.f1_score(true_labels,
                         predicted_labels,
                         average='weighted'),
        2))

5. Complete code

1. Data processing method normalization py

# -*- coding: utf-8 -*-
import re  # Implement regular expression module
import string
import jieba

# Load stop words
with open("../../testdata/stop_words.utf8", encoding="utf8") as f:
    stopword_list = f.readlines()

#jieba participle
def tokenize_text(text):
    tokens = jieba.cut(text)
    tokens = [token.strip() for token in tokens]
    return tokens

#Remove all special characters and punctuation marks
def remove_special_characters(text):
    # jieba participle
    tokens = tokenize_text(text)
    # compile returns a matching object. Escape ignores the meaning of special characters (equivalent to escape, showing its own meaning) string Punctuation indicates all punctuation marks
    pattern = re.compile('[{}]'.format(re.escape(string.punctuation)))
    filtered_tokens = filter(None, [pattern.sub('', token) for token in tokens])
    filtered_text = ' '.join(filtered_tokens)
    return filtered_text

#De stop word
def remove_stopwords(text):
    # jieba participle
    tokens = tokenize_text(text)
    filtered_tokens = [token for token in tokens if token not in stopword_list]
    filtered_text = ''.join(filtered_tokens)
    return filtered_text

#Cleaning data and word segmentation
def normalize_corpus(corpus, tokenize=False):
    normalized_corpus = []
    for text in corpus:
        # Remove all special characters and punctuation marks
        text = remove_special_characters(text)
        # De stop word
        text = remove_stopwords(text)
        normalized_corpus.append(text)
        if tokenize:
            text = tokenize_text(text)
            normalized_corpus.append(text)

    return normalized_corpus

2. Feature extraction method_ extractors. py

# -*- coding: utf-8 -*-
# CountVectorizer considers the frequency of words in the text
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer

#Word bag model
def bow_extractor(corpus, ngram_range=(1, 1)):
    vectorizer = CountVectorizer(min_df=1, ngram_range=ngram_range)
    features = vectorizer.fit_transform(corpus)
    return vectorizer, features


def tfidf_transformer(bow_matrix):
    transformer = TfidfTransformer(norm='l2',
                                   smooth_idf=True,
                                   use_idf=True)
    tfidf_matrix = transformer.fit_transform(bow_matrix)
    return transformer, tfidf_matrix

# tfdf
def tfidf_extractor(corpus, ngram_range=(1, 1)):
    vectorizer = TfidfVectorizer(min_df=1,
                                 norm='l2',
                                 smooth_idf=True,
                                 use_idf=True,
                                 ngram_range=ngram_range)
    features = vectorizer.fit_transform(corpus)
    return vectorizer, features

3. Main method classfier py

# -*- coding: utf-8 -*-
# date: 09/22/2020
# coding: gbk
import numpy as np
from sklearn.model_selection import train_test_split
from nlpstudycode.Spam classification.normalization import normalize_corpus
from nlpstudycode.Spam classification.feature_extractors import bow_extractor, tfidf_extractor
import gensim
import jieba
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression

#get data
def get_data():
    """
    get data
    :return:  Text data, corresponding to labels
    """
    with open("../../testdata/ham_data.txt", encoding='utf-8') as ham_f, open("../../testdata/spam_data.txt",encoding='utf-8') as spam_f:
        ham_data = ham_f.readlines()
        spam_data = spam_f.readlines()
        ham_label = np.ones(len(ham_data)).tolist()  # The tolist function converts a matrix type to a list type
        spam_label = np.zeros(len(spam_data)).tolist()
        corpus = ham_data + spam_data
        labels = ham_label + spam_label
    return corpus, labels

#Split data
def prepare_datasets(corpus, labels, test_data_proportion=0.3):
    """
    :param corpus: Text data
    :param labels: Text label
    :param test_data_proportion:  Proportion of test set data
    :return: Training data, test data, training labels, test labels
    """
    x_train, x_test, y_train, y_test = train_test_split(corpus, labels, test_size=test_data_proportion,
                                                        random_state=42)  # Fixed random_ After state, the data generated each time is the same (that is, the model is the same)
    return x_train, x_test, y_train, y_test

#Delete empty messages
def remove_empty_docs(corpus, labels):
    filtered_corpus = []
    filtered_labels = []
    for docs, label in zip(corpus, labels):
        #Removes the character specified at the beginning and end of the string (space by default)
        if docs.strip():
            filtered_corpus.append(docs)
            filtered_labels.append(label)
    return filtered_corpus, filtered_labels

#Predicted value evaluation
def get_metrics(true_labels, predicted_labels):
    print('Accuracy:', np.round(
        metrics.accuracy_score(true_labels,
                               predicted_labels),
        2))
    print('accuracy:', np.round(
        metrics.precision_score(true_labels,
                                predicted_labels,
                                average='weighted'),
        2))
    print('recall :', np.round(
        metrics.recall_score(true_labels,
                             predicted_labels,
                             average='weighted'),
        2))
    print('F1 score:', np.round(
        metrics.f1_score(true_labels,
                         predicted_labels,
                         average='weighted'),
        2))

#Training model
def train_predict_evaluate_model(classifier,
                                 train_features, train_labels,
                                 test_features, test_labels):
    # build model
    classifier.fit(train_features, train_labels)
    # predict using model
    predictions = classifier.predict(test_features)
    # evaluate model prediction performance
    get_metrics(true_labels=test_labels,
                predicted_labels=predictions)
    return predictions


def main():
    #get data
    corpus, labels = get_data()
    print("Total data volume:", len(labels))
    #Delete empty messages
    corpus, labels = remove_empty_docs(corpus, labels)
    print('One sample:', corpus[10])
    print('Sample label:', labels[10])
    label_name_map = ['Spam', 'Normal mail']  # 0 1
    print('Actual type:', label_name_map[int(labels[10])], label_name_map[int(labels[5900])])
    # Split data
    train_corpus, test_corpus, train_labels, test_labels = prepare_datasets(corpus,labels,test_data_proportion=0.3)

    #Cleaning data and word segmentation
    norm_train_corpus = normalize_corpus(train_corpus)
    norm_test_corpus = normalize_corpus(test_corpus)

    ''.strip()

    # Characteristics of word bag model
    bow_vectorizer, bow_train_features = bow_extractor(norm_train_corpus)
    bow_test_features = bow_vectorizer.transform(norm_test_corpus)

    # tfdf characteristics
    tfidf_vectorizer, tfidf_train_features = tfidf_extractor(norm_train_corpus)
    tfidf_test_features = tfidf_vectorizer.transform(norm_test_corpus)

    # tokenize documents
    tokenized_train = [jieba.lcut(text)
                       for text in norm_train_corpus]
    print(tokenized_train[2:10])
    tokenized_test = [jieba.lcut(text)
                      for text in norm_test_corpus]
    # build word2vec model
    # model = gensim.models.Word2Vec(tokenized_train,
    #                                size=500,
    #                                window=100,
    #                                min_count=30,
    #                                sample=1e-3)
    #naive bayesian model 
    mnb = MultinomialNB()
    #Support vector machine model
    svm = SGDClassifier(loss='hinge', n_iter_no_change=100)
    #Logistic regression model
    lr = LogisticRegression()

    # Multinomial naive Bayes based on word bag model
    print("Bayesian classifier based on the features of word bag model")
    mnb_bow_predictions = train_predict_evaluate_model(classifier=mnb,
                                                       train_features=bow_train_features,
                                                       train_labels=train_labels,
                                                       test_features=bow_test_features,
                                                       test_labels=test_labels)

    # Logistic regression based on the characteristics of word bag model
    print("Logistic regression based on the characteristics of word bag model")
    lr_bow_predictions = train_predict_evaluate_model(classifier=lr,
                                                      train_features=bow_train_features,
                                                      train_labels=train_labels,
                                                      test_features=bow_test_features,
                                                      test_labels=test_labels)

    # Support vector machine method based on word bag model
    print("Support vector machine based on word bag model")
    svm_bow_predictions = train_predict_evaluate_model(classifier=svm,
                                                       train_features=bow_train_features,
                                                       train_labels=train_labels,
                                                       test_features=bow_test_features,
                                                       test_labels=test_labels)

    # Polynomial naive Bayesian model based on tfidf
    print("be based on tfidf Bayesian model")
    mnb_tfidf_predictions = train_predict_evaluate_model(classifier=mnb,
                                                         train_features=tfidf_train_features,
                                                         train_labels=train_labels,
                                                         test_features=tfidf_test_features,
                                                         test_labels=test_labels)

    # Logistic regression model based on tfidf
    print("be based on tfidf Logistic regression model")
    lr_tfidf_predictions = train_predict_evaluate_model(classifier=lr,
                                                        train_features=tfidf_train_features,
                                                        train_labels=train_labels,
                                                        test_features=tfidf_test_features,
                                                        test_labels=test_labels)

    # Support vector machine model based on tfidf
    print("be based on tfidf Support vector machine model")
    svm_tfidf_predictions = train_predict_evaluate_model(classifier=svm,
                                                         train_features=tfidf_train_features,
                                                         train_labels=train_labels,
                                                         test_features=tfidf_test_features,
                                                         test_labels=test_labels)


if __name__ == '__main__':
    main()

4. Results

Bayesian classifier based on the features of word bag model
Accuracy: 0.79
Accuracy: 0.85
Recall rate: 0.79
F1 score: 0.78
Logistic regression based on the characteristics of word bag model
Accuracy: 0.96
Accuracy: 0.96
Recall rate: 0.96
F1 score: 0.96
Support vector machine based on word bag model
Accuracy: 0.97
Accuracy: 0.97
Recall rate: 0.97
F1 score: 0.97
Bayesian model based on tfidf
Accuracy: 0.79
Accuracy: 0.85
Recall rate: 0.79
F1 score: 0.78
Logistic regression model based on tfidf
Accuracy: 0.94
Accuracy: 0.94
Recall rate: 0.94
F1 score: 0.94
Support vector machine model based on tfidf
Accuracy: 0.97
Accuracy: 0.97
Recall rate: 0.97
F1 score: 0.97

Tags: Machine Learning NLP

Posted by lucerias on Mon, 16 May 2022 00:57:33 +0300