On-board drone classification with Deep Learning and System-on-Chip implementation

In recent years the increasing use of drones has raised significant concerns on safety and make them dramatic threats to security. To address these worries Counter-UAS Systems (CUS) are capturing the interest of research and of industry. Consequently, the development of effective drone detection technologies has become a critical research focus. The proposed work explores the application of edge computing to drone classification. It tunes a Deep Learning model, You Only Look Once (YOLO), and implements it on a Field Programmable Gate Array (FPGA) technology. FPGAs are considered advantageous over conventional processors since they enable parallelism and can be used to create high-speed, low-power, and low-latency circuit designs and so to satisfy the stringent Size, weight and Power (SWaP) requirements of a drone-based implementation. In details, two different YOLO neural networks YOLO v3 and v8 are trained and evaluated on a large data set constructed with drones’ images at various distances. The two models are then implemented on a System-on-Chip (SoC). In order to demonstrate the feasibility of a drone on board image Artificial Intelligence processing, the evaluation assesses the accuracy of classification and the computational performances such as latency.


Introduction
The drone market has grown significantly in recent years, particularly among consumers.Due of their relatively low price, drones intended for this market are widely available.From the perspective of research, the use of these flying platforms promotes the creation of technologies that have beneficial social effects, such as those used in search and rescue operations, intelligent logistics, environmental monitoring, and precision agriculture.The widespread use of drones also raises safety and security issues, such as malfunctions, misuse or deliberate criminal use.Indeed, a significant increase has been observed in the number of accidents involving drones [1], [2].For this reason, the development of detection and tracking systems for the security of critical buildings such as power plants or critical infrastructure such as airports, has become of primary importance.To address these worries Counter-UAS Systems (CUS) are capturing the interest of research and of industry [3]- [5].A Counter-UAS system must be able to detect one or more drones within the airspace of interest, track their movements extrapolate their emerging behavior and finally neutralize them, assuring certain levels of security.Although suppression of individual drones has been addressed (at least partially), countering a hostile swarm of drones represents an emerging challenge.Very recently research has been addressing the adoption of CUS based on drones themselves [6]- [7].They present some advantages such as, for example, the proximal sensing due to the extreme mobility of drones or the chance to increase the coverage of a protected area [8].To these purposes the drones of the CUS must have on-board the capability of identifying an attacking drone team.This capability has two main benefits: on one hand it improves the autonomy of the drone swarm that doesn't rely on the communications infrastructure.On the other hand, there is also a communications efficiency improvement because the network doesn't exchange raw data but high-level semantic information [9], [10].Drone classification powered by Artificial Intelligence (AI) can be considered as an enabling technology for such a system.Bring this technology on-board is still challenging [11].
The present study discusses the use of Field Programmable Gate Array (FPGA) based System on Chip (SoC) for implementing neural network algorithms, highlighting hardware acceleration capabilities thanks to dedicated processing units.To demonstrate the feasibility of a drone on board processing, the evaluation assesses the accuracy of classification and the computational performances such as latency of the inference task.
The rest of this article is organized as follows.We first described the background on Deep Learning and Edge Computing in Section 2. The dataset gathering and the AI model trainings are described in Section 3. Hardware deployment flow is presented in Section 4. Performance evaluation is described in Section 5. Concluding remarks are given in Section 6.

Background
Object detection is a computer vision task that involves identifying and locating objects within an image or a video stream.It is a fundamental problem in computer vision with numerous practical applications, such as autonomous driving, surveillance, medical image analysis, and more.One possible approach to solve this problem is through Artificial Intelligence data-driven algorithms such as Neural Network.In particular, Deep Neural Networks (DNN -also called Deep learning models for the great number of layers of the network) have been recognized as the state-of-the-art for image classification and object detection since 2012, when at ILSVRC contest, a DNN has surpassed the accuracy of classical Computer Vision methods [12].Among DNNs, Convolutional Neural Networks (CNN) can be considered as a game changer thanks to the introduction of the Convolutional Layer, since it helps reducing the training complexity and the computational cost respect to a Fully Connected network.The convolution operation is performed using the network kernel to find similarities.Finally, feature extraction is performed using the resulting feature map [13].There are different types of convolutional neural networks available such as R-CNN (Region-based CNN) [14] and Faster-RCNN [15].In these networks, the proposed regions are first defined using region proposal networks (RPNs).Then, convolutional filters are applied to these regions, and the extracted features are obtained as the result of the convolutional operation.In alternative deep learning approaches like SSD (Single Shot MultiBox Detector) [16] and YOLO (You Only Look Once) [17], the image is typically comprehensively analyzed, leading to improved accuracy and speed in object recognition when compared to conventional techniques.The enhanced speed in these methods can be attributed to their simpler architecture in contrast to region-based methods.YOLO, for instance, is a CNN-based method designed for both object detection and recognition.It predicts the coordinates of bounding boxes and the probabilities of object classes, considering the entire image.
Performances of an object detector can by evaluated considering three main factors: i) latency, which refers to the time required to process a single frame, ii) accuracy, which measures the quality of the system's output given the input, and iii) power consumption.These factors represent a commonly recognized trade-off, where improving one often comes at the expense of the others.
Edge AI is the intersection between edge computing and Artificial Intelligence, it consists of deploy AI algorithm computation directly on the edge device of a network, instead of relying on a cloud-based processing [9]-[11].This approach presents some advantages that are particularly of importance for a drone-based CUS application.Reducing the time required when data travels from the drones to the computational node (in the cloud or in the ground control station), enables drones to make critical decisions locally and fast, enhancing the autonomy level of the drones.Furthermore, drones exchange high level, semantic data instead of raw one, resulting in a more bandwidth efficient use of the communication channel.Drones, in particular commercial small ones, have limited on-board power availability.Indeed, the adoption of an edge computing approach is still challenging.To overcome this limitation, it is crucial to be aware of the energy efficiency of i) AI algorithms and ii) their hardware/software implementations or architecture.For instance, single stage object detectors are more efficient than proposed region based.Concerning implementation performance, it depends on the processing hardware/software architecture.Despite of the classical CPU/GPU implementation, recently Field Programmable Gate Arrays (FPGAs) based System-on-Chip (SoC) have become a highly attractive platform for implementing CNNs in real-time, as FPGAs are generally more energy-efficient [18], [19].Thanks to programmable logic inside FPGA devices, it is possible to design a custom hardware accelerator that can be tailored on the task of a CNN.In particular, the design of a Single Computation Unit (SCU) that implements deep neural networks, provide acceleration for the computationally intensive algorithms [20].FPGAs enable also the integration of Deep Learning tasks with signal processing for the communications tasks [21].This benefit comes at the expense of a heavier FPGA design effort if compared to the programming effort needed to implement algorithms on CPU/GPU platforms [22].
Nowadays, FPGA vendors already provide optimized SCU cores, for instance Xilinx provide the Deep Learning Processor Unit (DPU) that is an accelerator designed for deep learning inference on Xilinx FPGAs.The DPU is optimized for CNN and provides a high-performance solution for running deep learning models on edge devices with low power consumption [23].
This work studies the implementation of YOLO v3 and v8 algorithms on demoboard and provides a comparison of their performance when implemented on chip.The research work: (i)first set-up data for the training set in order it can be effective for drone detection in different drone asset during the mission, (ii) second assesses the performance on a usual hardware setting in this way the baseline for the comparison; (iii) third implements YOLO v3 and YOLO v8 on Xilinx SoC XCZU9EG present on the demoboard ZC102 [24] and assesses the performance.

Object detectors training
Two energy efficient deep learning models has been considered for FPGA implementation: YOLO v3 [25] and YOLO v8 [26].The objective is to compare accuracy and latency between two different architectures of the most popular object detector in the context of the drone detection task.
The YOLO v3 architecture consists of three main parts: the backbone, the head and the neck.The backbone is based on the Darknet-53 architecture [25], it is a series of convolutional layers that extract features from the input image.The neck combines features from different scales to improve object detection.The head is a set of fully connected layers that predict the location and class of each object in the image.
The YOLO v8 architecture consists of two main parts: the backbone and the head.The backbone is a modified version of the Darknet53 architecture: a new kind of layer has been introduced, the C2f (Coarse to fine) module combines the high-level features with contextual information to improve detection accuracy.The head consists of multiple convolutional layers followed by a series of fully connected layers, responsible for predicting bounding boxes, objectness scores, and class probabilities for the objects detected in an image [26].

Dataset construction
The first step in constructing an object detection dataset is gathering images and videos that are relevant to the target task.The quality and diversity of the collected data significantly impact the model's performance.Different amateur drones have been considered to assure enough data diversity.Note that the detection task considers only two classes: "drone" and "no drone".To kickstart our dataset, two publicly available ones have been considered: Amateur Unmanned Aerial Vehicle Detection ( [27]) and Competition for Unmanned Aerial Vehicle ( [28]).In the first dataset (about 3000 images) there are about 22% of images with drones at short distance, 42% at medium distance and 36% at large distance.In the second dataset (about 1300 images) there are only drones at short distance.A data augmentation process has been performed to increase the volume of the dataset and so to improve generality and diversity.The starting datasets is randomly processed applying image processing techniques such as: • Cropping, rotation, flipping, and resizing • Color adjustments, such as brightness, contrast, and saturation variations • Adding noise and blur

• Affine transforms
As a result, a new dataset of about 10000 images has been generated, doubling the volume of the starting datasets.

Training
YOLO models have been trained on the novel dataset considering the usual rule of 90 / 10, a subset of about 9000 images has been used for the training set and about 1000 images has been used for the test set.The two networks are set up as in the following table:

YOLO v3 YOLO v8
Framework Tensorflow Pytorch Input image size 320x320 320x320 Batch size 16 64 Epochs 500 500 The loss curves during training steps are showed in Figure 2.

Accuracy Evaluation
To properly evaluate the performances of the networks and the results of training steps for the binary classification task, a mean of the Average Precisions (mAP) over the Intersection over Unions (IoU) thresholds has been considered.mAP is a comprehensive metrics that combines precision and recall values over multiple levels of confidence thresholds, it provides a more complete picture of a model's ability to detect objects or retrieve relevant information compared to a single-point precision or recall value [29].Results comparison is given in the following Figure.The -v8 model reaches a mAP score of 0.72, while the -v3 model has a mAP score of 0.58 for the training set considered.As expected, the -v8 model has better performances than the -v3 one thanks to the effective novel layers introduced in this version.Considering an IoU threshold of 50% the two models score 0.98 and 0.97, so lowering the localization accuracy the performances are quite similar.

Hardware Deployment
After AI model development, the deployment flow for the SoC can take place.Thanks to the Xilinx VITIS framework, it is possible to convert a model in a set of instructions for the DPU accelerator ( [30]).
The deployment flow is represented in Figure 4.The first step is the network optimization that consists in a pruning of the network, i.e. some nodes of the network are removed considering an acceptable degradation of the performance.The reduction of the coefficients of the network implies also a faster inference time during execution.
The second step is the conversion of the floating point, 32 bit arithmetic into a fixed point, 8 bit arithmetic.This conversion is mandatory to be compliant with the arithmetic unit inside the DPU that is fixed point on 8 bits.Of course, also here there is a little degradation of the detection accuracy, due to the quantization error of the network coefficients.
The final step involves the compilation of the optimized and quantized network for the DPU.Not all the layers are supported by the VITIS compiler, so it could happen that some operations are not executed on the DPU but are performed by the CPU.The goal is to take all the intermediate computations on DPU, to avoid data exchange between DPU and CPU that lows down the execution speed of the model during the inference task.
The output of the VITIS framework is an application that leverages on the hardware acceleration powered by the DPU of the Zynq Multi Processor System-on-Chip.

Performance Evaluation
The performances are compared to measure the overall degradation of the models after network pruning and quantization.In figure 5 there is a comparison, at different input image sizes, of the mean Average Precision scores of the YOLO models, before and after the compiling of the model for the SoC.As seen the steps of quantization and pruning of the neural network bring the advantage to best fit to the hardware resources and to increase execution speed and also limit power consumption.On the other hand, there is a degradation of the accuracy of the detection task.In this study case, the average degradation on the test set is 8%.
In Figure 6 are represented the latency performances in terms of frame per seconds (FPS) versus input image size that the two detectors can run on the hardware.YOLO v8 has worse scores than the v3 model because not all the processing is done on the DPU, some functions are not supported by the compiler at the time of the evaluation.So, a part of the processing is performed by the DPU and another part by the CPU, this means that there is an exchange of a lot amount of data on the internal bus and this brings to slow down the inference time for that implementation.

Conclusions
Multi-agent systems that rely on drones as edges of a network must face some challenges such as improving of the autonomy level and increase the effectiveness of the communications.The application of the Artificial Intelligence on the edge devices has been proposed to reduce the raw data exchange and to increase on-board awareness of the operational context.The use of dedicated hardware accelerator on programmable logics such as FPGA-based SoCs allows to efficiently implement AI algorithms on limited power devices.In this work, two different models of the YOLO algorithm are considered to assess performances in terms of latency and accuracy of the detection task.A dataset is constructed considering publicly available ones.Data augmentation operations has been applied to increase the diversity and generality of the novel dataset.The models are trained on the same dataset showing an accuracy score superior for the latest YOLO model.This is due to the effectiveness of the new architecture compared to the previous one.The model deployment on real hardware for edge computing shows a degradation of the scores for both, due to the compression of the neural network and to the precision loss for the floating to fixed point representation.Also, the latency factor has been compared: results bring to a different conclusion respect to the accuracy.The YOLO v3 reaches peak performance at lowest input image size scoring better than the YOLO v8.This is essentially due to the worst hardware fitting of the more complex v8 architecture respect to the v3 one.Concluding, this work shows the feasibility for a FPGA based hardware implementation of YOLO drone detectors, highlighting the tradeoff between accuracy and latency.As future steps, from an applicative perspective, it is needed a deeper mission analysis to check the performance with respect to the temporal and accuracy needs and from an implementation perspective, explore better the DPU design and the set of operations that are not well integrated.

Figure 1 .
Figure 1.Images of drones at (a) short, (b) medium and (c) large distances.

Figure 2 .
Figure 2. Trends of the loss curve during training phase (a) YOLO v3 and (b) YOLO v8.

Figure 3 .
Figure 3.Comparison between the YOLO v3 and YOLO v8 mean Average Precision.In orange YOLO v8 and in blue YOLO v3 mAPs.

Figure 4 .
Figure 4. Deployment flow for neural networks on Xilinx FPGA.

Figure 5 .
Figure 5. mAP comparison for (a) YOLO v3 and (b) YOLO v8 at different input image size.

Figure 6 .
Figure 6.Latency comparison in terms of FPS versus image size (a) YOLO v3 and (b) YOLO v8.

Table 1 .
Main setup parameters for the training step.