Regression Prediction of Vehicle Efficiency Using Fully Connected Neural Networks

Regression Prediction of Vehicle Efficiency Using Fully Connected Neural Networks

keyword: fully connected neural network, tensorflow, regression

illustrate

It mainly uses the fully connected neural network to predict the regression problem of the car's performance index MPG.

python packages include: os, pandas, tensorflow, sklearn, matplotlib

data loading

The dataset is Auto MPG . Mainly about the performance indicators of various cars.

Except for the origin, all other fields are numeric. Origin is a category field, 1 America, 2 Europe, 3 Japan.

  • MPG: miles per gallon (efficiency index, y)
  • Cylinders: Number of cylinders
  • Displacement: displacement
  • Horsepower: horsepower
  • Weight: weight
  • Acceleration: acceleration
  • Model Year: Model Year
  • Origin: Origin
import os
import pandas as pd

data_path = os.path.join(os.getcwd(), r'data\auto-mpg.data')
column_names = [
    'MPG','Cylinders','Displacement','Horsepower','Weight',
    'Acceleration', 'Model Year', 'Origin'
]
raw_data = pd.read_csv(
    data_path, names=column_names,
    na_values='?', comment='\t',
    sep=' ', skipinitialspace=True
)

After the loading is completed, some observations and backups of the data are generally performed.

df.info()
df.head()
df.describe()
df.tail()
df.to_pickle(r'D:\data\auto-mpg.pkl')

data processing

Through data observation, it is found that there are missing values ​​in the data. One field origin is categorical data. In addition, gradient descent is used to find optimal parameters, preferably normalized.

Missing value handling

df.isna().sum()

Choose to delete directly.

df = df.dropna()

Feature processing

# Process categorical data, where the origin column represents categories 1, 2, 3, and the distribution represents the origin: United States, Europe, Japan
# Pop (delete and return) the origin column first
origin = df.pop('Origin')
# write new 3 columns based on the origin column
df.loc[:, 'USA'] = (origin == 1)*1.0
df.loc[:, 'Europe'] = (origin == 2)*1.0
df.loc[:, 'Japan'] = (origin == 3)*1.0

Data partitioning and normalization

Standardization here is to deal directly with the definition of standardization. Of course, it is easiest to use the sklearn.preprocessing.StandardScaler method directly.

The data is directly divided into training and test sets in a ratio of 8:2.

from sklearn.model_selection import train_test_split

# Divide the training set and test set in a ratio of 8:2
x_columns = df.columns.to_list()
x_columns.remove('MPG')
y_columns = ['MPG']
x_sample = df[x_columns]
y_sample = df[y_columns]
train_dataset, test_dataset, train_labels, test_labels = train_test_split(x_sample, y_sample, test_size=0.2, random_state=0)

The data is normalized.

# Calculate the mean and standard deviation of the values ​​of each field in the training set, and complete the standardization of the data
train_stats = train_dataset.describe()
train_stats = train_stats.transpose()

# normalized data
def norm(x):
    return (x - train_stats['mean']) / train_stats['std']
    
normed_train_data = norm(train_dataset)
normed_test_data = norm(test_dataset)

Print out the training set and test set size:

# print the size of the training set and test set
print(f'The size of the training set is:{normed_train_data.shape}, Number of labels:{train_labels.shape}')
print(f'The size of the test set is:{normed_test_data.shape}, Number of labels:{test_labels.shape}')

Data observation

Mainly through the statistics of data characteristics, mapping and other methods, to have a deeper understanding of the data. So as to help the selection of the model, or do further data processing and so on.

Feature name:

normed_train_data.columns.to_list()

Scatter plot between features to observe the linear relationship between features:

from pandas.plotting import scatter_matrix

attributes = [
    'Cylinders',
    'Displacement',
    'Horsepower',
    'Weight',
    'Acceleration'
]
scatter_matrix(normed_train_data[attributes], figsize=(12, 8))

Figure 1 Linear relationship between features

Observe the linear relationship between the feature and the target.

df.plot.scatter(
    x='Cylinders',
    y='MPG'
)

Figure 2 Linear relationship between features and targets

Create a network model

Due to the relatively small amount of data, only a 3-layer fully connected network is created to complete the MPG prediction task.

The input feature is 9, so the number of input nodes is 9. The number of output nodes in the middle two hidden layers is 64, 64. Since there is only one predicted value, the output layer node is 1. And the output layer is a numerical prediction, so you can No activation function is added, or the ReLU activation function is added.

The network construction method of tensorflow is very flexible, so there are several ways to build it below.

way 1

network construction

from tensorflow.keras import layers
from tensorflow import keras
import tensorflow as tf


class Network(keras.Model):
    # regression network
    def __init__(self):
        super(Network, self).__init__()
        # Create 3 fully connected layers
        self.fc1 = layers.Dense(64, activation='relu')
        self.fc2 = layers.Dense(64, activation='relu')
        self.fc3 = layers.Dense(1)
        
    def call(self, inputs, training=None, mask=None):
        # Pass through 3 fully connected layers in sequence
        x = self.fc1(inputs)
        x = self.fc2(x)
        x = self.fc3(x)
        return x

Train the model

model = Network()
# The build function completes the creation of the internal tensor, where 4 is an arbitrary batch number and 9 is the input feature length
model.build(input_shape=(4, 9))
# print network information
model.summary()

# Create an optimizer, specifying a learning rate
optimizer = keras.optimizers.RMSprop(0.001)
Model: "network_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_3 (Dense)              multiple                  640       
_________________________________________________________________
dense_4 (Dense)              multiple                  4160      
_________________________________________________________________
dense_5 (Dense)              multiple                  65        
=================================================================
Total params: 4,865
Trainable params: 4,865
Non-trainable params: 0

Build the Dataset object:

# data
# Build the Dataset object
train_db = tf.data.Dataset.from_tensor_slices((
    normed_train_data.values,
    train_labels.values
))
# Randomly scattered, batched
train_db = train_db.shuffle(100).batch(32)

Train the model

loss_log = list()
i = 0

for epoch in range(100):
    for step, (x, y) in enumerate(train_db):
        # Gradient recorder
        with tf.GradientTape() as tape:
            out = model(x)
            loss = tf.reduce_mean(tf.losses.MSE(y, out))
#             mae_loss = tf.reduce_mean(tf.losses.MAE(y, out))
        i += 1
        loss_log.append([i, float(loss)])
        grads = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))
    

parameter view

For regression problems, the evaluation indicators generally include MSE (mean square error), RMSE (root mean square error), and MAE (mean absolute error). Here we choose MSE for evaluation.

epoch, is the mark of each batch of data, equal to i, which is the epoch*step above.

loss_log_df = pd.DataFrame(loss_log, columns=['epoch', 'MSE'])
epoch = loss_log_df['epoch']

MSE is loss

mse = loss_log_df['MSE']

MSE of the test set,

# Test set results
# normed_test_data, test_labels
out = model(normed_test_data.values)
test_mse = tf.reduce_mean(tf.losses.MSE(test_labels.values, out))

way 2

Model building

network = Sequential([
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(1)
])
network.build(input_shape=(None, 9))
network.summary()
network.compile(
    optimizer=keras.optimizers.RMSprop(0.001),
    loss='mse',
    metrics=['mse', 'mae']
)
EPOCH = 200
history = network.fit(
    normed_train_data.values, 
    train_labels.values, 
    epochs=EPOCH,
    validation_split=0.2
)

parameter view

The indicator used here is MSE.

The final result of the test set, a point result:

out = network.predict(normed_test_data.values)
test_mse = tf.reduce_mean(tf.losses.MSE(test_labels.values, out))

Intermediate results of training, in history.history,

history.history.keys()
dict_keys(['loss', 'mse', 'mae', 'val_loss', 'val_mse', 'val_mae'])

The training error, and the validation set error are as follows:

mse = history.history['mse']
val_mse = history.history['val_mse']
epoch = range(EPOCH)

drawing

Way 1 - Results

View Results

loss_log_df.plot.line(
    x='epoch',
    y='MSE'
)

As shown in Figure 3, the final training set MSE is 2.66, and the test set result is 6.33.

Figure 3 MSE of each set of data

Way 2 - Results

import matplotlib.pyplot as plt

plt.rcParams['font.sans-serif'] = ['KaiTi']
plt.rcParams['axes.unicode_minus'] = False # Solve the problem that the minus sign '-' is displayed as a square when saving an image

plt.plot(epoch, mse, 'r', label='train mse')
plt.plot(epoch, val_mse, 'b', label='validate mse')
plt.plot(EPOCH, float(test_mse), 'go', label='test mse')

plt.title('training set and validation set MSE,final test set MSE')
plt.xlabel('epoch')
plt.ylabel('MSE')
plt.legend()

As shown in Figure 4, the final training set MSE is 2.82, the validation set MSE is 7.50, and the test set MSE is 7.60.

Figure 4 MSE of training set and validation set, final MSE of test set

from sklearn.metrics import r2_score

out_r2 = r2_score(out, test_labels.values)
print(out_r2)

Direct MSE requires a better understanding of the data in order to see whether the model is good or bad, which is not very intuitive. R2 Score is more intuitive, the closer it is to 1, the better the model. The R2 Score of this model on the test set is 0.88.

in conclusion

The regression prediction of the performance index MPG of the car is done through the fully connected neural network. The main modeling process, including data loading, data processing, data observation, and creating models. Creating a model includes model selection, model training, and model evaluation.

The problem is clear, it's a regression problem, and the business is clear. And the model has already been selected. Through the drawing section, you can visually see the model training process and get evaluation indicators, including MSE, R2 score, etc. The final R2 score is 0.88.

Fully connected neural network is the most basic neural network model. Neural networks are generally considered to be an end-to-end class of models that do not require feature processing. This is in terms of images and audio, and the neural network will automatically extract features. However, in dealing with some problems, doing some feature processes can make model training faster and reduce model complexity.

refer to

1.TensorFlow deep learning, https://github.com/dragen1860/Deep-Learning-with-TensorFlow-book ,2019

Tags: Machine Learning Deep Learning

Posted by kevinc on Sun, 22 May 2022 19:00:03 +0300