Artificial intelligence - Text Classification (necessary for big homework)

👦👦 A handsome boy, you can call me Love And Program
🖱 ⌨ Personal homepage: Personal homepage of Love And Program
💖💖 If it helps you, I hope the third company 💨💨 Support bloggers

preface

Recently, I found that my big homework series has been collected crazily. I see, the big homework is coming!! These days, I will introduce several articles of major homework level to explain nanny level and pass the test easily. Of course, if you want to be excellent, you need to add something yourself. You can look back.

Text classification of film reviews

This time, we will explain the content of Tensorflow elementary course, classify the film reviews, and divide the texts into positive and negative categories. It is a dichotomous problem, which can be briefly explained to the teacher, Attach the source code address However, due to possible version reasons, I suggest you follow my code.
Pychar versiontensorflow version
3.82.4 and 2.5 (version 2.3 is recommended. Different Warming will occur in different versions, but the results will not be affected)

Import data section

Using IMDB dataset, you can introduce it as follows: this dataset is a built-in Online Movie Database in keras library, which can be called directly.

Download the data set first, as shown in the figure below (num_words means to retain 10000 data, which can be understood as retaining 10000 high-frequency words, which can be changed here)

imdb = keras.datasets.imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

If there is an error message similar to visibledeprecationwarning: creating an ndarray from ragged needed sequences (which is a list or multiple of lists or schedules or ndarrays with different lengths or...), it is a small warning incompatible with numpy. Enter the imdb.load_data package and add it in line 140. If it does not appear, skip this paragraph.

x_train = np.array(x_train, dtype=object)
x_test = np.array(x_test, dtype=object)


Change the code in about 160 lines

  x_train, y_train = np.array(xs[:idx],dtype=object), np.array(labels[:idx],dtype=object)
  x_test, y_test = np.array(xs[idx:],dtype=object), np.array(labels[idx:],dtype=object)

Then check the data. I don't know if you can understand the second data. The specific position in the corresponding dictionary. For example, if the word is the first in the is - > dictionary and if it is the 14th in the are - > dictionary.

#All data
print(train_data)
#The first piece of data, each integer represents a word in the dictionary
print(train_data[0])
#result: [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8,  316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144,  30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
#Film review length
print(len(train_data[0]), len(train_data[1]))

Convert data into words

First, get all the words. The meaning of this step is to let you know that each word corresponds to a number. As mentioned above, the final form is as follows: 65117: 'sakal', 44868: 'Marching', As for why v+3 appears, please continue to look down.

word_index = imdb.get_word_index()
word_index = {k:(v+3) for k,v in word_index.items()}

I tried to print out train_ The specific content in data [0], rather than a string of numbers, uses the following code.
Note: I replace (v+3) with v.

Let the data from 'gussied': 65114, "bullock's": 32069... – > 65117: 'sakal', 44868: 'marveling'... The code is as follows:

word_index = imdb.get_word_index()
#'gussied': 65114, "bullock's": 32069...
word_index = {k:(v) for k,v in word_index.items()}
#Here, in order to better use the data, it is converted into the form of 65117: 'sakal', 44868: 'marveling'
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
#print(reverse_word_index)
def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])
print(decode_review(train_data[0]))

But when you finish printing, you will find what the following code is, nonsense???? Why? Next, you will get a quick start to text information processing network

the as you with out themselves powerful lets loves their becomes reaching had journalist of lot from anyone to have after out atmosphere never more room and it so heart shows to years of every never going and help moments or of every chest visual movie except her was several of enough more with is now current film as you of mine potentially unfortunately of you than him that with out themselves her get for was camp of you movie sometimes movie that with scary but and to story wonderful that in seeing in character to of 70s musicians with heart had shadows they of here that with her serious to have does when from why what have critics they is you that isn't one will very to as itself with other and in of seen over landed for anyone of and br show's to whether from than out themselves history he name half some br of and odd was two most of mean for 1 any an boat she he should is thought frog but of script you not while history he heart to real at barrel but when from one bit then have two of script their with her nobody most that with wasn't to with armed acting watch an for with heartfelt film want an

I can see from the official website that a string of such code is needed. Without much to say, I can directly explain the meaning of this paragraph.

word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3

< pad >: when processing data, this batch should have the same length, which can be understood as expanding the data into text of the same length, for example: my name is loveandprogram - > My name is loveandprogram_ pad_ _ pad_ _ pad_ _ pad_, Forcibly expand the sequence to the set maximum length 8.

< start >: added at the beginning of text

< unk >: used to replace words that are not suitable for known words. What does it mean? My name is LoveAndProgram will be translated as my name is_ unk_

< unused: I don't understand that similar information is found

Now you know why (v+3). It's used here

After the above processing, the code is as follows:

word_index = imdb.get_word_index()
word_index = {k:(v+3) for k,v in word_index.items()}
word_index["<PAD>"] = 0
#start
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])
print(decode_review(train_data[0]))

Finally, the following data appears and the data conversion is completed.

<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all

Data operation before entering neural network

The integer array must be converted into tensor before entering the neural network. In this part, we choose to fill the array to ensure that the input data has the same length, and then create a size of max_ length * num_ For the integer tensor of reviews, we can use the embedded layer that can handle this shape data as the first layer in the network.

Sequence preprocessing pad_ The sequences () function populates the sequence to standardize the length of the text. The returned is a two-dimensional tensor with a length of maxlen. Is the value parameter familiar? As described above, it corresponds to 0, which is used to fill in insufficient information. (0 does not represent any specific word, but is used to encode any unknown word)

max_sequence_len = 256
#print(decode_review(train_data[0]))
train_data = keras.preprocessing.sequence.pad_sequences(train_data,
                                                        value=word_index["<PAD>"],
                                                        padding='post',
                                                        maxlen=max_sequence_len)
print(train_data,train_data.shape)
test_data = keras.preprocessing.sequence.pad_sequences(test_data,
                                                       value=word_index["<PAD>"],
                                                       padding='post',
                                                       maxlen=max_sequence_len)
#Training entries: 25000, labels: 25000
#[[   1   14   22 ...    0    0    0]
# [   1  194 1153 ...    0    0    0]
# [   1   14   47 ...    0    0    0]
# ...
# [   1   11    6 ...    0    0    0]
# [   1 1446 7079 ...    0    0    0]
# [   1   17    6 ...    0    0    0]] (25000, 256)

Let's take a look at the final data. Use the following code (len(train_data) is 25000, as mentioned above):

for i in range(len(train_data)):
    print(len(train_data[i]))
#256
#.
#.
#.
#256
#256

Build model

The Sequential() function is used to build the sequence model. Here is vocab_size should be consistent with the number of words above. 10000 is the vocabulary of different words in the text. The text is represented by a 16 dimensional vector. It can also be changed to 64 dimensions, and the amount of computation will increase.

  1. The first layer is the embedded layer. You just need to add it at the beginning. This layer is to convert it into a dimension you want.
  2. The second layer is the average pooling layer, which is mainly to find an average value, allowing the model to process variable length input in the simplest way possible.
  3. The third and fourth layers are fully connected layers. You can change the values inside to make the model learn more complex representations, but this also leads to a significant increase in the amount of calculation. Please modify it as appropriate. This part can be explained in detail with the teacher, and you can add the number of layers by yourself. If the code fails to run, it is the same as batch_ For size, the following red characters have been identified.
vocab_size = 10000

model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))

model.summary()

Build loss function, optimization function and training model

Optimizer refers to the optimizer, which can be said to select multiple optimizers for comparison, and finally select Adam optimizer and put the comparison data. Here's an example: the Adadelta optimizer is used for comparison. It can be seen that the accuracy changes are not as good as Adam optimizer.

(mainly let the teacher see your workload. Finally, list several optimizers: Rmsprop, Adadelta, Adadelta and SGD. I sort them by accuracy)

Loss refers to the loss function, which can also be replaced by mean_squared_error, but generally speaking, binary_ Crossintropy is more suitable for dealing with probability, You can take a look at this to supplement your knowledge.

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
x_val = train_data[:10000]
partial_x_train = train_data[10000:]
y_val = train_labels[:10000]
partial_y_train = train_labels[10000:]
#The following error message appears in the following four lines. There is a problem of adding different versions of tensorflow. The version I use needs me to change the type manually
#ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type int).
#partial_x_train = partial_x_train.astype(np.int64)
#partial_y_train = partial_y_train.astype(np.int64)
#x_val = x_val.astype(np.int64)
#y_val = y_val.astype(np.int64)
print(type(partial_x_train))
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=40,
                    batch_size=512,
                    validation_data=(x_val, y_val),
                    verbose=1)

With Adagrad optimizer, the higher the accuracy, the better, val_accuracy refers to the accuracy of the verification set; The smaller the loss function is, the better. epochs can be changed. In certain cases, the larger the loss function is, the better

model.compile(optimizer='Adagrad',
              loss='binary_crossentropy',
              metrics=['accuracy'])
x_val = train_data[:10000]
partial_x_train = train_data[10000:]
y_val = train_labels[:10000]
partial_y_train = train_labels[10000:]

#ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type int).
#partial_x_train = partial_x_train.astype(np.int64)
#partial_y_train = partial_y_train.astype(np.int64)
#x_val = x_val.astype(np.int64)
#y_val = y_val.astype(np.int64)
#The above four lines of code need to be added manually. The following four lines of code need to be changed manually
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=40,
                    batch_size=512,
                    validation_data=(x_val, y_val),
                    verbose=1)


Note: if the epoch is stuck and the training data is unavailable, the batch_ The size becomes smaller and changed to 64 or smaller, which is directly related to the number of layers because of the problem caused by the immobility of the computer itself.

Evaluation function

Call directly, return the loss value and the value specified in metrics, and specify accuracy before, then the accuracy is displayed.

#ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type int).
#test_data = test_data.astype(np.int64)
#test_labels = test_labels.astype(np.int64)
#If the above error is reported in the following two lines, add the above two lines of code. This is a problem with different versions of tensorflow. The version I use needs me to change the type manually
results = model.evaluate(test_data,  test_labels, verbose=2)
print(results)

👏👏 Forecast function view detailed data

You can see that this shows that the students have very light perseverance. You have completed the model training. Let's see how to use the model training results!!

print(decode_review(test_data[0]))
print(test_labels[0])
print(decode_review(test_data[1]))
print(test_labels[1])
predictions = model.predict(test_data)
print(predictions)

Translating the first sentence through translation software is really negative speech; The second sentence, after translation, belongs to positive remarks. Since then, the teaching has been completed.

<START> please give this one a miss br br <UNK> <UNK> and the rest of the cast rendered terrible performances the show is flat flat flat br br i don't know how michael madison could have allowed this one on his plate he almost seemed to know this wasn't going to work out and his performance was quite <UNK> so all you madison fans give this a miss <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
0
a lot of patience because it focuses on mood and character development the plot is very simple and many of the scenes take place on the same set in frances <UNK> the sandy dennis character apartment but the film builds to a disturbing climax br br the characters create an atmosphere <UNK> with sexual tension and psychological <UNK> it's very interesting that robert altman directed this considering the style and structure of his other films still the trademark altman audio style is evident here and there i think what really makes this film work is the brilliant performance by sandy dennis it's definitely one of her darker characters but she plays it so perfectly and convincingly that it's scary michael burns does a good job as the mute young man regular altman player michael murphy has a small part the <UNK> moody set fits the content of the story very well in short this movie is a powerful study of loneliness sexual <UNK> and desperation be patient <UNK> up the atmosphere and pay attention to the wonderfully written script br br i praise robert altman this is one of his many films that deals with unconventional fascinating subject matter this film is disturbing but it's sincere and it's sure to <UNK> a strong emotional response from the viewer if you want to see an unusual film some might even say bizarre this is worth the time br br unfortunately it's very difficult to find in video stores you may have to buy it off the internet
1
[[0.41339412]#In the first sentence, less than 0.5 is negative
 [0.9098137 ]#In the second sentence, greater than 0.5 is positive
 [0.6236479 ]
 ...
 [0.30431408]
 [0.48930165]
 [0.63445085]]

Finally, a loss function and accuracy curve can be added for data visualization

Here you can directly follow the code on the official website without any error, but you can make some differences. Take the loss function as an example to illustrate in detail

color has red,blue,yellow,tan,lime,g,cyan and other forms.

linestyle has' - ',' -- ',' -. ',': ',' None ',' solid ',' dashed ',' dashdot ',' dotted '.

import matplotlib.pyplot as plt
#Keep training history
history_dict = history.history
history_dict.keys()
acc = history_dict['accuracy']
val_acc = history_dict['val_accuracy']
loss = history_dict['loss']
val_loss = history_dict['val_loss']

epochs = range(1, len(acc) + 1)

plt.plot(epochs, loss, color = "orchid",linestyle='sad', label='Training loss')
plt.plot(epochs, val_loss, color = "blue",linestyle='--', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

Complete code

import tensorflow as tf
from tensorflow import keras
import numpy as np
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
imdb = keras.datasets.imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
print("Training entries: {}, labels: {}".format(len(train_data), len(train_labels)))
word_index = imdb.get_word_index()
word_index = {k:(v+3) for k,v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])
max_sequence_len = 256
train_data = keras.preprocessing.sequence.pad_sequences(train_data,
                                                        value=word_index["<PAD>"],
                                                        padding='post',
                                                        maxlen=max_sequence_len)
test_data = keras.preprocessing.sequence.pad_sequences(test_data,
                                                       value=word_index["<PAD>"],
                                                       padding='post',
                                                       maxlen=max_sequence_len)
vocab_size = 10000
model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16))
model.add(keras.layers.GlobalAveragePooling1D())

model.add(keras.layers.Dense(16, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))
# model = keras.Sequential()
# model.add(keras.layers.Embedding(vocab_size, 160))
# model.add(keras.layers.GlobalAveragePooling1D())
# model.add(keras.layers.Dense(160, activation='relu'))
# model.add(keras.layers.Dense(80, activation='relu'))
# model.add(keras.layers.Dense(10, activation='relu'))
# model.add(keras.layers.Dense(1, activation='sigmoid'))
# model.summary()

model.summary()
model.compile(optimizer='Adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

x_val = train_data[:10000]
partial_x_train = train_data[10000:]

y_val = train_labels[:10000]
partial_y_train = train_labels[10000:]
partial_x_train = partial_x_train.astype(np.int64)
partial_y_train = partial_y_train.astype(np.int64)
x_val = x_val.astype(np.int64)
y_val = y_val.astype(np.int64)
#print(type(partial_x_train))
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=20,
                    batch_size=256,
                    validation_data=(x_val, y_val),
                    verbose=1)
test_data = test_data.astype(np.int64)
test_labels = test_labels.astype(np.int64)
results = model.evaluate(test_data,  test_labels, verbose=2)

print(results)
import matplotlib.pyplot as plt
history_dict = history.history
history_dict.keys()
acc = history_dict['accuracy']
val_acc = history_dict['val_accuracy']
loss = history_dict['loss']
val_loss = history_dict['val_loss']

epochs = range(1, len(acc) + 1)

plt.plot(epochs, loss, color = "orchid",linestyle='dotted', label='Training loss')
plt.plot(epochs, val_loss, color = "g",linestyle='--', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()
plt.clf()

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'g', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()
#Prediction function test
#print(decode_review(test_data[0]))
#print(test_labels[0])
#print(decode_review(test_data[1]))
#print(test_labels[1])
#predictions = model.predict(test_data)
#print(predictions)

Tags: AI TensorFlow

Posted by Fuzzy Wobble on Sat, 21 May 2022 03:40:27 +0300