A light-weighted convolutional neural network for image classification in autopilot system

Autopilot has been a heated technology in recent days, and it is used in plenty of car brands. People are becoming dependent on autopilot technology. However, before this technology becomes mature, there are still a lot of insufficiencies that need to improve. The focus of this essay is on the object detection shortage that exists in the autopilot system. This essay is first going to introduce a lite CNN-based method to improve the object detection and identification module. In the result part of the essay, the essay is going to show the result of the training and some new prediction function will be implemented. Also, some accuracy improving methods including data augmentation and Dropout will be introduced, and the effect of the improving methods will be shown in the essay. After training, the autopilot system is going to distinguish pedestrians from vehicles. Although the road condition is far more complex than just vehicles and pedestrians, for convenience this essay is only going to implement the two datasets. Also, some further improvement intension will be introduced.


Introduction
In recent days, autopilot is a heated topic in different areas of research, especially for cars.According to Quan et al [1], there are six levels of Vehicle Autonomy explained, which are Level 0 (No Driving Automation), Level 1 (Driver Assistance), Level 2 (Partial Driving Automation), Level 3 (Conditional Driving Automation), Level 4 (High Driving Automation), Level 5 (Full Driving Automation).The detailed information introduced by Quan et al as displayed in Table 1.

Level
Name definition 0 No automation Full-time performance by the human driver of all aspects of the dynamic driving task, even when enhanced by warning or intervention system.

Partial automation
Driving mode-specific execution by one or more driver assistance systems of both steering or acceleration/deceleration.

Conditional automation
Driving mode-specific performance by an automated driving system of all aspects of the dynamic driving task.4 High automation Driving mode-specific performance by an automated driving system of all aspects of the dynamic driving task, even if a human driver does not respond appropriately to a request to intervene 5 Full automation Full-time performance by an automated driving system of all aspects of the dynamic driving task.
Each level except level 0 requires detecting results from the combination of each module, especially from lidar.After analyzing the road situation detected, the autopilot system will decide on the next step of driving.However, there will more or less be emergencies happening on the road which the autopilot system may not know how to deal with.According to S.Ingle and M.Phute [2], the Autopilot system so far has not been able to distinguish pedestrians among objects.As a result, this may cause unnecessary injuries which can be avoided when the car is piloted by an experienced driver.For instance, if there are two objects burst into the way and it is too late to brake the car, which one of them will the system decide to hit on?Assume one of the objects is a pedestrian and the other one is a vehicle.According to the intuition of most drivers, they would choose to hit on the vehicle instead of the pedestrian to avoid injuries.Unfortunately, due to the fact that the autopilot system is not able to distinguish a pedestrian from objects, the system may control the car to hit on an object that it thinks is most impossible to avoid.However, when the pilot of the car is a human being, this tragedy could have been avoided.The improvement for this is the main focus of this essay.
To improve the situation, a lite CNN model-based image classification is introduced.The lite CNN model-based image classification is a method that can help the machine identify objects with which it is trained.The simplest example of this method is cat and dog detection.When the machine is given the dataset of cats and dogs with thousands of images inside, the machine tries to understand the common characteristics in each dataset.After the training process, the machine can give a prediction on the new image input.With this method, a model could be trained with several datasets which contain objects that can appear on the road.After the training process, the model would be able to give a prediction on new objects detected.So that the autopilot cars will not only be able to detect objects, they can also identify what the objects are and try to make a decision based on the identification.
For simplification purposes, only pedestrian and vehicle datasets provided by N J Karthika et al. [3] and B.Dincer [4] are implemented in this essay.After the implementation of the datasets, the construction of convolution neural networks will be introduced, including the usage and principle of the activating functions and pooling functions.Then a first training result will be shown and discussed.Moreover, there are several improving methods discussed in this essay.Due to the effect of overfitting, the very first result is not as perfect as we expect.Thus, methods including data augmentation and drop-out will be introduced and implemented.At last, a new function named new data prediction is introduced.It is used for result checking and further implementation in the autopilot system.

Dataset
During the training process of the model, the datasets are made up of three individual parts, and each of them is made up of two classes.As shown in Figure 1, the complete dataset is divided into three parts -Training, Testing, and Validation.The three parts of the complete dataset can then be divided

Data generation
After packaging the images into the dataset, data should be processed and input into the model for the training process.For the model could not read images, data that the images conclude should be transferred into the data that the machine can read.Here is a function from TensorFlow.Keras named ImageDataGenerator is used to help the data transfer.The ImageDataGenerator(rescale = x) is a function that can transfer image data into the format that the machine can read.Also, it "controls the process of distributing unique batches to each node" (J.Gregory Pauloski, para 2).After the process procedure of the ImageDataGenerator, the data is then processed by a TensorFlow.Keras function named flow from the directory.The ability of the function is that it can receive the data processed by ImageDataGenerator and input the data to the model.The output result of this function is a conclusion that shows the total amount of images that it has received.The result below is the output result of the model mentioned in this essay.The upper one is the total amount of images in the Training set and the lower one is the total amount of images in the Validation set.According to the result, 1803 images are found which belong to the training datasets, and 1005 images are found which belong to the validation datasets.

Data augmentation
According to Shorten and Khoshgoftaar [6], "data augmentation is a powerful tool when dealing with overfitting.The augmented data will represent a more comprehensive set of possible data points, thus minimizing the distance between the training and validation set, as well as any future testing sets."This data preprocessing method in this essay transfers a single image into multi-dimension data which contains more information.And this method is implemented after the data generation process to avoid overfitting.For example, we take an image from the pedestrian dataset as shown in Figure 2.  As it is shown in the graphs, data augmentation process transfers a single image from the dataset into several part.Reshaping it into different size and dimensions.As a result, each image in the dataset will contain more data for the model to train.Beyond data augmentation, another method will be introduced in the next paragraph.

CNN construction
During the construction of the CNN network, ReLu is used here as an activating function instead of the softMax or the sigmoid function.The reason that we use ReLu here is that ReLu will make the output of some neurons 0, which causes the sparsity of the network, reduces the interdependence of parameters, and alleviates the occurrence of the overfitting problem.
Another function named conv2d is used after the implementation of the ReLu.The usage of the function is to convolute the input data into several layers, and after several times of implementation, the input data could be convoluted into a relatively small size so that the parameter size of the feature maps will be alleviated as well.Give a simple example, if the input size is (N, Cin, H, W) and the output size is (N, Cout, Hout, Wout), where and the total output size is: To conclude, the conv2d function is a useful tool to group the input data into smaller size groups so that it can reduce parameter of the process involved.And after the reduction of the parameter, a pooling function named the max_pooling2d is introduced.As stated by B.Graham [7], "Max-pooling a procedure that takes an N in * N in input matrix and returns a smaller output matrix, say N out * N out .This is achieved by dividing the N in * N in square into N out 2 out pooling regions (P i,j ) ".The max_pooling2d function take four parameters, which is the pool size, strides, padding and data format.The output of the function is the shape of the feature map.As shown in Table 2, the output shape is (None, 74, 74, 32) and the elements represent the batch size, weight, height and channel in order.Techniques mentioned above is the activate function of the CNN introduced in this model.With the alternating use of the two functions, the input data in the dataset is able to be reduced to a relatively small size for the model to train.

Drop-out layer.
The drop-out method is an advanced method used when training neural networks, and its main motivation is to avoid the co-adaption of feature detectors or overfitting.As stated by P.Baldi & P.Sadowski [8], feature detectors as known as the neural node are deleted with a certain probability q=1-p, and the remaining weight are trained by backpropagation.
The formula above represents drop-out in linear networks, where i represents unit, h represents layer, w denotes the weight and I the input vector.

Model training
The training process of the model is conducted after generation of data from the dataset and CNN building.The result of the training process will be shown in the following part of the essay.

Result
At the very beginning of the CNN construction procedure, methods that help avoid the overfitting problem are not implemented in the model.As a result, the training result is shown in Figure 5.As you can see, the training result is not as perfect as we expect.When the training epoch proceeded to the 7th, the validation accuracy began to pace around.This is because of the occurrence of overfitting.According to Ying [9], overfitting can be categorized into three reasons: "1) noise learning on the training set; 2) hypothesis complexity; 3) multiple comparisons procedures which are ubiquitous in induction algorithms."In this essay, the overfitting is most likely to be arisen from the first reason.When the training process of the images is too "accurate", the training result is more likely to be affected by noises in the images.In the following paragraphs, this essay is going to compare the result with the methods that can help avoid the overfitting problem.

Effectiveness of data augmentation and drop-out layer
This paragraph is going to discuss the effect that the data augmentation and drop-out layer have on the training result.After implementing the two methods, the model is trained again and the result is shown below.As is shown, the training process is completed successfully.Then a graphical result is printed out with a 'plot' function in python, and the result is shown in figure 6.As could be seen, the validation accuracy with data augmentation and drop-out turns out to be much more perfect.

New data prediction
This function is used to predict new input data after the model is trained.It can check the accuracy of the model trained.Also, as mentioned in the previous part, it can be further used in the implementation of the autopilot system.
There are two classes in total, which is the pedestrian and the vehicle.In this model, the pedestrian is labelled as 0 and the vehicle is labelled as 1.
For Testing purpose, a thousand images of pedestrian are input into the model.After prediction, there are 976 images that are identified as the pedestrian and 24 images are identified as the vehicle.The accuracy of prediction is approximately 97.6%, which is similar to the validation accuracy at the end of the training process and it is quite ideal.

Discussion
According to the result shown above, the accuracy has been trained to a relatively ideal degree.With the two datasets implemented in this essay, the image identification module can work as an eye of the autopilot car.After implementing the image identification module based on the lite deep learning CNN model with pedestrian and vehicle datasets, the car is able to distinguish pedestrians from the vehicle.When the condition where pedestrian and vehicle are both on the way of the autopilot car, the system is able to make a decision and avoid unnecessary injuries.According to Jefferson & McDonald [10], there were approximately 7.3 million crashes in the United State.Although the appearance of the autopilot car reduced the number of accidents to some degree, new challenges have arisen in the autopilot technology field.This essay is trying to improve the main problem that exists in the autopilot system.However, although the problem has been solved to recognize the difference between the pedestrian and the vehicle, this is still far from enough because the road situation is complex and changes all the time.When there are unexpected condition and the system fail to make the right decision, injuries and asset loss might happen which may be avoided when the car is in control of a human.The autopilot system is invented not only for the convenience of the driver but also as proof of the safety of drivers and pedestrians.The current law has not been able to decide who should be responsible for the car accident involving an autopilot car according to Mingtsung et al [11] so improving the safety level of the autopilot car should be the most important thing in this industry.

Conclusion
This essay introduced a lite deep learning CNN model-based image identification module for the autopilot car.This trained model allows the autopilot system to recognize pedestrians among vehicles.By the implementation of the datasets and the construction of the lite convolution neural network model, the model is trained independently.Moreover, when overfitting happened after the first time of training, two methods are introduced which are correspondingly data augmentation and drop-out.After the improved training process is completed, the model is able to predict new data with an accuracy of over 95%.This kind of improvement reduces injuries in an accident involving autopilot cars to some degree.However, this kind of improvement is still not enough.The road's condition is far from just vehicles and pedestrians, and there may appear something else that may affect the decision of the autopilot system.As a result, more data should be input into the model when a more precise model is trained for the autopilot system to accommodate the complex road situation.Moreover, after the input of new data, some advanced algorithms should develop.This algorithm should be responsible for giving priority order to the object detected and analyzed.For example, the pedestrian should be given the highest priority and the autopilot car should hit on them with the least probability.To conclude, the result of this essay provides a platform for future work, but there is still a long way to go in the autopilot area.

Figure 2 .
Figure 2. Example of an original image.This is the origin image, and after the data augmentation process, it becomes to Figure3.

Figure 4 .
Figure 4. Demonstration of the dropout operation.Also, P.Baldi & P.Sadowski [8] introduced a formula indicating the principle of the Drop-out method in Figure 4.In addition, its formula is shown below.

Figure 5 .
Figure 5. Training and validation accuracy in different epochs.

Figure 6 .
Figure 6.Result of training and validation using data augmentation.

Table 2 .
Introduction of the maxpooling operation.