TF2.0 basic method of text classification

This article is compiled from TF2 0 official tutorial( https://www.tensorflow.org/tutorials/keras/text_classification)
The example of this article is to use IMDB's comment data for sentimental analysis:
Data source address: https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

1. Load dataset

Use TF keras. preprocessing. text_ dataset_ from_ The directory() function loads the text dataset from the directory. The directory needs to maintain the following structure:

main_directory/
...class_a/
......a_text_1.txt
......a_text_2.txt
...class_b/
......b_text_1.txt
......b_text_2.txt

Where class_a / class_b is the classification label. In the example, the data set is the comment data of IMDB, and the directory structure is:

➜  aclImdb ll
total 3432
-rw-r--r--  1 hongbin.dhb  staff    4037  6 26  2011 README
-rw-r--r--  1 hongbin.dhb  staff  845980  4 13  2011 imdb.vocab
-rw-r--r--  1 hongbin.dhb  staff  903029  6 12  2011 imdbEr.txt
drwxr-xr-x  7 hongbin.dhb  staff     224  4 13  2011 test
drwxr-xr-x  9 hongbin.dhb  staff     288 10 10 19:51 train

The code for loading the dataset is as follows:

# Size of each batch
batch_size = 32
seed = 42

# Extract 80% from the train directory as the training set
raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train', 
    batch_size=batch_size, 
    validation_split=0.2, 
    subset='training', 
    seed=seed)

# Extract 20% from the directory train as the validation set
raw_val_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train', 
    batch_size=batch_size, 
    validation_split=0.2, 
    subset='validation', 
    seed=seed)

# Load test sets from the test directory
raw_test_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/test', 
    batch_size=batch_size)

Return value raw_train_ds is a data set with labels, and the type is TF data. Dataset,
It should be noted that validation is used_ When split and subset, you need to specify the seed parameter, or use shuffle=False to ensure that the training set and verification set data do not cross.

Let's take a look at the label label and class of the loaded dataset_ name:

for text_batch, label_batch in raw_train_ds.take(1):
  for i in range(3):
    print("Review", text_batch.numpy()[i])
    print("Label", label_batch.numpy()[i])

The value of label is 0 / 1, and the category name of label is raw_train_ds.class_name[0/1] get:

print("Label 0 corresponds to", raw_train_ds.class_names[0])
print("Label 1 corresponds to", raw_train_ds.class_names[1])

output

Label 0 corresponds to neg
Label 1 corresponds to pos

2. Data preprocessing

Data preprocessing requires text to be standardized, tokenized and vectorized

  • Standardization usually refers to removing punctuation marks or HTML tags from text
  • Word segmentation refers to the segmentation of text into words one by one
  • Vectorization refers to converting a word into a number, which is convenient to feed into the neural network

Here is a customized standardized processing method

def custom_standardization(input_data):
	# Convert text to lowercase 
	lowercase = tf.strings.lower(input_data)
	# Replace the < br / > tag with a space (word separator)  
	stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
	# Delete punctuation marks. Generally, there are spaces after punctuation marks
	return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation),
                                  '')

Text vectorization method is used for text standardization, word segmentation and vectorization

from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

# The largest number of features, that is, the largest number of words after word segmentation
max_features = 10000
# Maximum vector length
sequence_length = 250

# Build vectorization layer (for text vectorization)
# Use custom standardization methods for standardization
# Maximum number of word segmentation 10000
# Output vector length 250
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    output_mode='int',
    output_sequence_length=sequence_length)

Next, we need to call the adapt method to convert the text into the vector data we need

# Take the text part of the original training set (without taking the label field)
train_text = raw_train_ds.map(lambda x, y: x)
vectorize_layer.adapt(train_text)

Define a function to process the original data set

def vectorize_text(text, label):
	# Turn text into a one-dimensional array
	text = tf.expand_dims(text, -1)
	# Output text vector and label
	return vectorize_layer(text), label

Process training set, verification set and test set data

# Each row of the dataset is text and annotation
train_ds = raw_train_ds.map(vectorize_text)
val_ds = raw_val_ds.map(vectorize_text)
test_ds = raw_test_ds.map(vectorize_text)

Finally, there is an important step. In order to improve the data processing performance, the vectorized data needs to be put into the cache

AUTOTUNE = tf.data.experimental.AUTOTUNE

train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)

Here, the data preprocessing is basically completed_ ds, val_ ds, test_ DS can be used to input our model.

3. Build a network model

embedding_dim = 16

model = tf.keras.Sequential([
	layers.Embedding(max_features + 1, embedding_dim),
	layers.Dropout(0.2),
	layers.GlobalAveragePooling1D(),
	layers.Dropout(0.2),
	layers.Dense(1)])

model.summary()

output

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, None, 16)          160016    
_________________________________________________________________
dropout (Dropout)            (None, None, 16)          0         
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 1)                 17        
=================================================================
Total params: 160,033
Trainable params: 160,033
Non-trainable params: 0
_________________________________________________________________

Model Description:

  • The first layer is the embedding layer. The input of the embedding layer takes the word index subscript as the integer encoded data, and then looks up the vectors corresponding to each word index subscript. These vectors are learned during model training. We will talk about the specific principle and process later
  • Next, the GlobalAveragePooling1D layer
  • Finally, there is a full connection layer with only one output node (the activation function is sigmod)

4. Loss function and optimizer

Because it is a binary classification problem, and the model outputs the probability of classification
So use losses Binarycrossentropy loss function

model.compile(
	loss=losses.BinaryCrossentropy(from_logits=True), 
	optimizer='adam', 
	metrics=tf.metrics.BinaryAccuracy(threshold=0.0)
)

5. Model training

You only need to pass in the dataset to the fit method of the model

epochs = 10
history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=epochs)

6. Evaluation model

loss, accuracy = model.evaluate(test_ds)

print("Loss: ", loss)
print("Accuracy: ", accuracy)

output

782/782 [==============================] - 3s 3ms/step - loss: 0.3104 - binary_accuracy: 0.8734
Loss:  0.31037017703056335
Accuracy:  0.8733999729156494

7. Draw the model loss and accuracy curve

Model training method Fit returns the history object, which contains the accuracy rate and loss data in the training process. Use history to draw the correlation curve

Draw loss curve

history_dict = history.history

# Accuracy on training set
acc = history_dict['binary_accuracy']
# Accuracy on validation set
val_acc = history_dict['val_binary_accuracy']
# Loss on training set
loss = history_dict['loss']
# Loss on validation set
val_loss = history_dict['val_loss']

epochs = range(1, len(acc) + 1)

# Draw training loss
plt.plot(epochs, loss, 'bo', label='Training loss')
# Draw verification loss
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

Draw accuracy curve

# Mapping training accuracy
plt.plot(epochs, acc, 'ro', label='Training acc')
# Accuracy of drawing verification
plt.plot(epochs, val_acc, 'r', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')

plt.show()

8. Model export

We use the TextVectorization method to process the text data, and then feed it into the model training. If we want to directly use the original text data to train the model and make prediction, we can build a new model and use vector in the first layer_ Layer layer, the second layer is the model we trained above, and the last layer is connected with a sigmod objective function. This new model does not need training and can be compiled and used directly

# Build a new model and use the previously trained model
export_model = tf.keras.Sequential([
  vectorize_layer,
  model,
  layers.Activation('sigmoid')
])

# Model compilation, specifying loss function and optimizer
export_model.compile(
    loss=losses.BinaryCrossentropy(from_logits=False), 
    optimizer="adam", 
    metrics=['accuracy']
)

# Use the original test set data to evaluate the accuracy of the model
loss, accuracy = export_model.evaluate(raw_test_ds)
print(accuracy)

output

782/782 [==============================] - 4s 5ms/step - loss: 0.3104 - accuracy: 0.8734
0.8733999729156494

Use the new model to predict the emotional classification of new data:

examples = [
  "The movie was great!",
  "The movie was okay.",
  "The movie was terrible..."
]

export_model.predict(examples)

output

array([[0.634246  ],
       [0.45762002],
       [0.37179616]], dtype=float32)

Put the text processing layer (standardization, word segmentation, vectorization) into the model and export the trained model, which can simply process the original text and reduce the data skew problem of training set and test

However, putting the text processing layer outside the model can make better use of the parallel processing ability of the CPU, and the data can be cached by training on the GPU. Therefore, we usually put the text processing layer outside the model when developing the model, and then put the text processing layer inside the model when deploying the model.

Tags: Python TensorFlow Deep Learning

Posted by show8bbs on Thu, 12 May 2022 02:26:50 +0300