NLP05: Sentiment Classification Based on CNN-LSTM

Official Account: Notes on Data Mining and Machine Learning

Sentiment classification using CNN-LSTM, here is a binary classification model. It is divided into the following steps as a whole:

  • Environment and parameter settings
  • data preprocessing
  • Model network structure construction and training
  • model usage

1. Environment and parameter settings

The environment mainly refers to which packages are required. The parameter settings include the parameters of the Embedding, CNN, and LSTM network layers and some basic parameter settings.

from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Dropout,Activation,Embedding,LSTM,Conv1D,MaxPooling1D
from tensorflow.keras.datasets import imdb
import re
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
#embedding parameter
maxlen=100 #Maximum sample length, not enough for Padding, more than for interception
embedding_size=200 #word vector dimension

#Convolution parameters

#LSTM parameters
lstm_output_size=100 #Output dimension of the LSTM layer

#training parameters

2. Data preprocessing and training data preparation

2.1 Data overview

The sentiment classification data used here are mainly reviews of shopping, hotel accommodation, etc. Provided using Excel's format, each row represents a sample. The specific form and content are as follows:

negative reviews

positive reviews

2.2 Data preprocessing

Only simple text processing is done here: only Chinese characters are retained, and all non-Chinese characters are removed. In addition, no word segmentation is performed, and a character-level model is used for training.

def textToChars(filePath):
  Read a text file and process it
  :param filePath:file path
  lines = []
  for index, row in df.iterrows():
    row = re.sub("[^\u4e00-\u9fa5]", "", str(row))  # Chinese only
  return lines

2.3 Training data preparation

Convert the text data into the matrix format required for training, and divide the training and test sets.

def getWordIndex(vocabPath):
  Obtain word2Index,index2Word
  :param vocabPath:vocabulary file, using BERT inner vocab.txt document
  word2Index = {}
  with open(vocabPath, encoding="utf-8") as f:
    for line in f.readlines():
      word2Index[line.strip()] = len(word2Index)
  index2Word = dict(zip(word2Index.values(), word2Index.keys()))
  return word2Index, index2Word

def lodaData(posFile, negFile, word2Index):
  Get training data
  :param posFile:Positive sample file
  :param negFile:negative sample file
  :param word2Index:
  posLines = textToChars(posFile)
  negLines = textToChars(negFile)
  print("Number of positive samples%d,number of negative samples%d"%(len(posLines),len(negLines)))
  posIndexLines = [[word2Index[word] if word2Index.get(word) else 0 for word in line] for line in posLines]
  negIndexLines = [[word2Index[word] if word2Index.get(word) else 0 for word in line] for line in negLines]
  lines = posIndexLines + negIndexLines
  print("The training samples and test samples are:%d indivual"%(len(lines)))
  # lens = [len(line) for line in lines]
  labels = [1] * len(posIndexLines) + [0] * len(negIndexLines)
  padSequences = sequence.pad_sequences(lines, maxlen=maxlen, padding="post", truncating="post")
  X_train,X_test,y_train,y_test=train_test_split(padSequences,np.array(labels),test_size=0.2,random_state=42) #Divide the training set and test set according to the ratio of 8:2
  return (textLines,labels),(X_train,X_test,y_train,y_test)
vocabPath="/content/drive/My Drive/data/vocab.txt"
negFilePath="/content/drive/My Drive/data/text_classify/sentiment/neg.xls"
posFilePath="/content/drive/My Drive/data/text_classify/sentiment/pos.xls"
word2Index, index2Word=getWordIndex(vocabPath)

The total number of samples is 21005, and the number of positive and negative samples is roughly equal.

3. Model network structure construction and training

The overall network structure is: Embedding+Conv+LSTM+Dense, in which the convolutional layer is a one-dimensional convolution, which is convolved at the time step. Dropout should be performed after Embedding, MaxPooling should be performed after convolution, and a sigmoid activation function should be connected after the last fully connected layer. The loss function uses the cross-entropy loss for binary classification, and the optimizer uses Adam.

print("start training"),y_train,batch_size=batch_size,epochs=epochs,validation_data=(X_test,y_test))

4. Model usage

def predict_one(sentence,model,word2Index):
  sentence=re.sub("[^\u4e00-\u9fa5]", "", str(sentence))  # Chinese only
  # print(sentence)
  sentence=[word2Index[word] if word2Index.get(word) else 0 for word in sentence]
  sentence=sentence+[0]*(maxlen-len(sentence)) if len(sentence)<maxlen else sentence[0:300]
  # print(sentence)
  label = 1 if pred_prob[0][0]>0.5 else 0
  return label
sentence="A very unpleasant shopping, the page said that the goods would arrive the next day, but the goods were sent from Shaanxi, and the seller knew that the goods would not arrive the next day at all. It is mentioned in many places that there are still 100 for delivery.%The delivery was not fulfilled, and after contacting the customer service for many days, the ball was still kicked to the courier company. It's a lesson."
  • Data:

  • Code:

Tags: NLP

Posted by hkothari on Mon, 16 May 2022 12:42:47 +0300