Official Account: Notes on Data Mining and Machine Learning
Sentiment classification using CNN-LSTM, here is a binary classification model. It is divided into the following steps as a whole:
- Environment and parameter settings
- data preprocessing
- Model network structure construction and training
- model usage
1. Environment and parameter settings
The environment mainly refers to which packages are required. The parameter settings include the parameters of the Embedding, CNN, and LSTM network layers and some basic parameter settings.
from tensorflow.keras.preprocessing import sequence from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense,Dropout,Activation,Embedding,LSTM,Conv1D,MaxPooling1D from tensorflow.keras.datasets import imdb import re import numpy as np import pandas as pd from sklearn.model_selection import train_test_split
#embedding parameter maxlen=100 #Maximum sample length, not enough for Padding, more than for interception embedding_size=200 #word vector dimension #Convolution parameters kernel_size=5 filters=128 pool_size=4 #LSTM parameters lstm_output_size=100 #Output dimension of the LSTM layer #training parameters batch_size=128 epochs=20
2. Data preprocessing and training data preparation
2.1 Data overview
The sentiment classification data used here are mainly reviews of shopping, hotel accommodation, etc. Provided using Excel's format, each row represents a sample. The specific form and content are as follows:
2.2 Data preprocessing
Only simple text processing is done here: only Chinese characters are retained, and all non-Chinese characters are removed. In addition, no word segmentation is performed, and a character-level model is used for training.
def textToChars(filePath): """ Read a text file and process it :param filePath:file path :return: """ lines = [] df=pd.read_excel(filePath,header=None) df.columns=['content'] for index, row in df.iterrows(): row=row['content'] row = re.sub("[^\u4e00-\u9fa5]", "", str(row)) # Chinese only lines.append(list(str(row))) return lines
2.3 Training data preparation
Convert the text data into the matrix format required for training, and divide the training and test sets.
def getWordIndex(vocabPath): """ Obtain word2Index,index2Word :param vocabPath:vocabulary file, using BERT inner vocab.txt document :return: """ word2Index = {} with open(vocabPath, encoding="utf-8") as f: for line in f.readlines(): word2Index[line.strip()] = len(word2Index) index2Word = dict(zip(word2Index.values(), word2Index.keys())) return word2Index, index2Word def lodaData(posFile, negFile, word2Index): """ Get training data :param posFile:Positive sample file :param negFile:negative sample file :param word2Index: :return: """ posLines = textToChars(posFile) negLines = textToChars(negFile) textLines=posLines+negLines print("Number of positive samples%d,number of negative samples%d"%(len(posLines),len(negLines))) posIndexLines = [[word2Index[word] if word2Index.get(word) else 0 for word in line] for line in posLines] negIndexLines = [[word2Index[word] if word2Index.get(word) else 0 for word in line] for line in negLines] lines = posIndexLines + negIndexLines print("The training samples and test samples are:%d indivual"%(len(lines))) # lens = [len(line) for line in lines] labels = [1] * len(posIndexLines) + [0] * len(negIndexLines) padSequences = sequence.pad_sequences(lines, maxlen=maxlen, padding="post", truncating="post") X_train,X_test,y_train,y_test=train_test_split(padSequences,np.array(labels),test_size=0.2,random_state=42) #Divide the training set and test set according to the ratio of 8:2 return (textLines,labels),(X_train,X_test,y_train,y_test)
vocabPath="/content/drive/My Drive/data/vocab.txt" negFilePath="/content/drive/My Drive/data/text_classify/sentiment/neg.xls" posFilePath="/content/drive/My Drive/data/text_classify/sentiment/pos.xls" word2Index, index2Word=getWordIndex(vocabPath) (textLines,labels),(X_train,X_test,y_train,y_test)=lodaData(posFile=posFilePath,negFile=negFilePath,word2Index=word2Index) print(X_train.shape,X_test.shape,y_train.shape,y_test.shape)
The total number of samples is 21005, and the number of positive and negative samples is roughly equal.
3. Model network structure construction and training
The overall network structure is: Embedding+Conv+LSTM+Dense, in which the convolutional layer is a one-dimensional convolution, which is convolved at the time step. Dropout should be performed after Embedding, MaxPooling should be performed after convolution, and a sigmoid activation function should be connected after the last fully connected layer. The loss function uses the cross-entropy loss for binary classification, and the optimizer uses Adam.
model=Sequential() model.add(Embedding(len(word2Index),embedding_size,input_length=maxlen)) model.add(Dropout(0.2)) model.add(Conv1D(filters,kernel_size,padding="valid",activation="relu",strides=1)) model.add(MaxPooling1D(pool_size)) model.add(LSTM(lstm_output_size)) model.add(Dense(1)) model.add(Activation("sigmoid")) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) print("start training") model.fit(X_train,y_train,batch_size=batch_size,epochs=epochs,validation_data=(X_test,y_test))
model.summary()
4. Model usage
def predict_one(sentence,model,word2Index): sentence=re.sub("[^\u4e00-\u9fa5]", "", str(sentence)) # Chinese only # print(sentence) sentence=[word2Index[word] if word2Index.get(word) else 0 for word in sentence] sentence=sentence+[0]*(maxlen-len(sentence)) if len(sentence)<maxlen else sentence[0:300] # print(sentence) sentence=np.reshape(np.array(sentence),(-1,len(sentence))) pred_prob=model.predict(sentence) label = 1 if pred_prob[0][0]>0.5 else 0 print(label) return label
sentence="A very unpleasant shopping, the page said that the goods would arrive the next day, but the goods were sent from Shaanxi, and the seller knew that the goods would not arrive the next day at all. It is mentioned in many places that there are still 100 for delivery.%The delivery was not fulfilled, and after contacting the customer service for many days, the ball was still kicked to the courier company. It's a lesson." predict_one(sentence,model,word2Index)
-
Data: https://github.com/chongzicbo/nlp-ml-dl-notes/tree/master/data/data2
-
Code: https://github.com/chongzicbo/nlp-ml-dl-notes/blob/master/code/textclassification/cnn_lstm.ipynb