Traffic Surveillance Using Computer Vision and Deep Learning Methods

From the early 21st century to the present date, Traffic surveillance has been one of the most challenging problems and over the years many solutions were proposed for tackling this problem. However, flaws are can still be noticed in the currently existing systems for performing this task and some of them will be further addressed in this work. In our project Traffic Surveillance using Computer Vision and Deep Learning, we propose a smart vision-based system which accurately provides live traffic statistics like the speed of the traffic and the count of Vehicles passed on a road in between a time interval. The system takes a video feed as an input, which could be live footage provided by a CCTV camera at a traffic junction, and on querying starting and ending timestamps, it outputs the count of the vehicles and speed of the traffic in between the queried timestamps. We divide this problem into two major sub-problems, firstly detecting a vehicle, followed by tracking the detected vehicle. Although there exist many algorithms for detecting objects as well as for tracking objects, one of the major problems is integrating both. As ambiguities like detecting an already detected vehicle, many occur while tracking the vehicles. In our work, we emphasize handling such ambiguities using Deep Learning and we also provide a comparative performance analysis of our system with the already existing ones.


Introduction
We live in a large city or any area where there are lots of vehicles passes each day.From school bus to general official cars, this leads to heavy congestion on highways.One solution to avoid traffic congestion is to use of Internet of Vehicles (IOV) [1] but the computation of this design has not been fully proposed yet, in near future we may come across 5G technologies and counter real time problem with Internet of Vehicles design.But with increase in traffic each day we can't wait much to avoid traffic congestion on highways in this fast-paced life.With the help of deep learning model and computer vision we are going to design a real time object detector.The system will take video footage as an input, which could be live footage provided by a CCTV camera at a traffic junction, through this the statistics provided by this system will have count of vehicles and average speed [2].Now we can estimate the count of vehicles in a city at interval of time, suppose at morning and evening there is huge congestion, now authority can restrict vehicle of count during that interval and rest vehicles can opt other way.By this road authority can handle traffic congestion efficiently [3]

Need of Traffic Surveillance
Traffic surveillance is used to conduct a survey on the traffic by which we can know the number of vehicles and the speed of vehicles at an area at a particular time [4].We can give this information to the government so that they will take the necessary actions to reduce the traffic.

Convolution Layer
It is the first layer of the Convolution Neural Network.In this layer we take the image of H x W x D dimension as an input for classification [5].On this input layer matrix, we apply filter or kernel of dimension Fh x Fw x D (here D will be same as of previous layer input matrix) and produce an output of dimension (H -Fh +1) x (W -Fw + 1) x 1.
For example, we are given an image and we must detect the edges in the image: To illustrate this, consider a 6 x 6 x 1 monochromatic image Next, we apply an edge detection filter to this image and lap the 6 x 6 x 1 with a 3 x 3 x 1 edge detection filter [6].
After the convolution, we will get an output image of dimension (6 -3 + 1) x (6 -3 + 1) x 1 i.e. 4 x 4 x 1.The first element of the output image will be calculated as: To obtain the first value of the output value, the filter is multiplied with the top left corner of the picture matrix.Similar to this, we shift or stride the filter along the height and breadth of the entire image using the proper values of stride to calculate all the values [7].This will give an output like this: Consider another example: Here, higher values of pixel represent the brighter portion and lower the pixel values represents the darker area of the picture and we can detect edges in the picture [8].There are various types of filters available through which we can detect several features in an image by convolving the image with it [9].

Padding
We have seen that applying convolution on an image, the area of the picture compressed and the pixels that are available in the edge of a picture are involved in convolution operation only few times compared to the pixels that are present in the centre of the image which decreases the content [10].To prevent the grey areas, use the input image with an extra border of range with all around the image.This means that instead of 6 x 6 x 1 we will have input matrix as 8 x 8 x 1 and on trying turn operation of 3 x 3 x 1 will effect in a 6 x 6 x 1 model which creates the final dimension of the figure .[11]

Convolution Over Volume
Suppose we have a 3-D image of 6 x 6 x 3 (here 3 refers to the RGB scale).Now on applying 3 x 3 x 3 filter to the image the output 4 x 4 x 1.The first value in the output matrix with addition of devices product wise of all the 27 values and the 27 values of filter, 9 values from each channel of input and filter [13].We can use multiple filter as well to detect multiple features at a time.Let's say for example we want detect vertical edges from first filter and horizontal edges from second filter.Using multiple filters will change the output dimension [12].
Figure 5 Detecting multiple features at a time Dimensions that are more broadly applicable can be provided as: nc indicates the quantity of channels in the filter's input and nc' symbolizes the quantity of filters.Now combining all the concepts, we will have our convolution network layer.
Figure 6 Convolution network layer

Pooling Layers
To reduce the size of the image and speed up the computation, pooling layers are used.Consider the input matrix as: applying the maximum amount of pooling to this matrix will result 2 x 2 matrix output: We can also use average pooling which will take the average of the number instead of taking max values.
Now after applying pooling layer to our convolution neural network, the network will look something like this

Object detection using YOLO
To tackle the problems faced by sliding window detection, we intend to use a Deep Learning based algorithm, You Only Look Once (YOLO) [14].After referring to several research papers and performing an analysis on ratio between accuracy and the computation involved, we have concluded that YOLO is the best algorithm for this use case [15].Resize image: For any given image, we resize its dimensions such that its dimensions and input feature dimensions of the trained Convolutional Network match.

2.
Run the convolutional network This is a Deep Neural Network which will take an image as input and outputs bounding boxes of detected objects in the image.

3.
Non-max suppression: This is used to avoid an object being getting detected multiple times.
TensorFlow object detection is efficient with pre-trained models.But YOLO is certainly fast in computation of object detection.Darknet framework is used to train YOLO.This algorithm overcomes limitations of previous detection algorithm [16].It divides an image into grids and creates an object of grid if the centre of the object lands within the grid, then that way objects in an assigned model assigned to exactly to grid and then the other box with rectangular shape is represented.

Anchor Boxes
We can have multiple objects falling in the same grid.Multiple boxes of different shapes can be used to resolve the problem [17], each anchor box of a different shape being likely to detect an object of a shape For Every single picture in our video bytes, first way of doing/calculating identification of centroids.We can create a identical id to find the object, we have to find if there is any relation in the object centroid with old object centroid i.e yellow and purple dots

Conclusion
A system has been developed to count the number of vehicles and calculate the average speed of traffic given a live video footage as input to the system.The system effectively combines vehicle detection and tracking the detected vehicle.This classification accuracy is 95% and detection is 85%.
The system can also detect drivers and humans walking on the road which can be used to extend this project for various other applications.

Figure 2
Figure 2 Monochromatic image

Figure 3
Figure 3 Convolving the image

Figure 9
Figure 9 Implementation of Yolo Algorithm

Figure 10 .Figure 12 First
Figure 10.Anchor boxes For different objects anchor boxes will be assigned and with the best accuracy according 3.5 Centroid tracking

Figure 13
Figure 13 Update (x,y)-coordinates of existing objects