Asymptotic feature pyramid based YOLOv5s for birds detection

The detection of all sorts of birds has become increasingly important in the fields of ecological balance and biological protection. To tackle the problems of low accuracy, high omission rate and low detection confidence levels in the application of artificial intelligence and deep learning in bird detection, this paper proposes a bird detection method that leverages the YOLOv5s model and incorporates the asymptotic feature pyramid network module (AFPN). Diverging from the conventional pyramid network module (FPN), AFPN offers a more efficient solution, characterized by reduced computation time and memory consumption. It also minimizes conflicts that may arise during feature training. To further enhance the model’s performance and efficiency, the project introduces an object detection regression loss function along with several distinct loss functions. These functions are employed to train and analyze a standardized dataset, enabling the identification of an optimal solution. Through rigorous model construction, training, and repeated testing, notable improvements have been achieved in the accuracy rate and other relevant indicators, meeting the necessary application standards. This refined method exhibits great potential for fulfilling diverse bird detection requirements.


Introduction
As we all know, the natural world harbors a wide variety of avian species, and the detection of birds has a pivotal significance in fields such as wildlife monitoring, agricultural protection, and ecological balance.However, traditional ways of bird detection overly rely on human eyes and manual efforts, which require individuals with extensive experience and exceptional skills [1].In this era of technological and mechanized advancements, this method is inefficient and lacks stability.Currently, with the development of computer technology, knowledge related to computer vision and neural networks has found increasing applications in various fields such as fire safety, logistics, and transportation.The accuracy and efficiency of object feature recognition achieved through these technologies far exceed human capabilities.However, despite this, very few experts have applied machine recognition to the identification of avian species and behaviors.
Existing biological recognition methods mostly rely on deep learning, YOLOv3, or convolutional neural networks [2] [3].Some researchers have utilized loss functions to improve detection models and established detection systems on the web using Flask (Python) [4].They have explored methods such as iterative CAM networks, bilinear networks, or the fusion of YOLOv5s with attention algorithms to accomplish the target tasks and make improvements.However, many bird detection and recognition methods often face challenges such as slow processing speed, high model complexity, and significant power and computational resource consumption.
To address these issues, this paper puts forward a bird recognition method based on the YOLOv5s model, integrated with the asymptotic feature pyramid network (AFPN) and the object detection regression loss functions.These additions aim to further enhance the performance and efficiency of the model.

Dataset
The dataset used in this experiment is named 'Bird Detection Dataset.'This dataset includes the category label 'bird' and was extracted from the VOCtrainval2012 dataset.It contains 811 bird images, with 600 images used for training and 211 images for testing (Figure 1).

YOLOv5s model
In this experiment, we chose YOLOv5s as our object detection model.YOLOv5s is an end-to-end model based on a single-stage detector, which achieves fast and accurate object detection by utilizing a lightweight network architecture and efficient inference strategies [5].YOLOv5s' outstanding performance and real-time capabilities make it well-suited for the application of bird detection, enabling us to achieve better results in the detection of bird species.
The YOLOv5s network mainly consists of a backbone, a feature pyramid network (FPN), and a prediction head.In this study, we abandoned the original FPN feature pyramid network of YOLOv5s and replaced it with another progressive feature network called AFPN, which allows for multi-level incremental feature extraction.This change enables the extraction of multi-scale features while preserving the original feature information to a great extent 2.2.1.Backbone.YOLOv5s uses CSPDarknet as its Backbone to extract rich information features from input images.CSPNet addresses redundant gradient information problems in other large convolutional neural network backbones' optimization.By integrating gradient changes into the feature map from start to finish, it effectively reduces the model's parameter count and FLOPS (floating-point operations per second) value.This results in improved inference speed, accuracy, and a smaller model size.

Asymptotic feature pyramid network.
During network construction, we abandoned the YOLOv5s built-in top-down FPN structure that propagates strong features and the bottom-up feature pyramid containing two PAN (Path Aggregation Network) structures.Instead, we adopted a progressive feature pyramid network called AFPN.The reason for this change is that FPN uses a topdown approach to transfer high-level features to low-level features for feature fusion at different levels.However, in this process, the high-level features do not truly fuse with the low-level features [7].
The architecture of YOLOv5s with AFPN proposed in this paper is shown in the Figure 2.During the bottom-up feature extraction process of the backbone network, AFPN progressively integrates lowlevel, high-level, and top-level features.Specifically, AFPN initially fuses low-level features, followed by deep-level features, and finally integrates the topmost features, which are the most abstract features.In the original feature fusion network, the semantic gap between non-adjacent level features is larger than that between adjacent level features, especially for bottom and top-level features.This directly leads to suboptimal fusion results for non-adjacent level features.Therefore, directly fusing features from c2, c3, c4, and c5 is not reasonable.Thanks to the progressive nature of AFPN's architecture, the semantic information of different level features will approach each other during the progressive fusion process, mitigating the aforementioned issue.For example, the feature fusion between C2 and C3 reduces their semantic gap.As c3 and c4 are adjacent-level features, the semantic gap between c2 and c4 is reduced.In the subsequent stages, it fuses higher-level features, and in the final stage, it adds top-level features to the feature fusion process.Black arrows represent convolutions, and blue arrows represent adaptive spatial fusion.
To align the dimensions and prepare for feature fusion, we used 1x1 convolutions and bilinear interpolation to upsample the features.On the other hand, we performed downsampling based on different convolution kernels and strides.For instance, we applied 2x2 convolutions with a stride of 2 to achieve downsampling by a factor of 2, 4x4 convolutions with a stride of 4 for 4 times downsampling, and 8x8 convolutions with a stride of 8 for 8 times downsampling.After feature fusion, we continued to use four residual units to learn features, which are similar to ResNet.,eachresidual unit consists of two 3x3 convolutions.

Prediction head.
The prediction head of YOLOv5s consists of a series of convolutional layers and fully connected layers.It is responsible for mapping the output of the feature pyramid network to the results of object detection.The prediction head outputs the bounding box positions, class confidences, and other relevant information for each detected object.YOLOv5s achieves predictions for objects of different scales and aspect ratios by using anchor boxes of different sizes and convolutional layers.

Loss function
In the experiment, we attempted to replace YOLOv5s original CIOU loss function with more advanced loss functions to improve the model's performance.These replacement loss functions include SIOU(SCYLLA-IoU), WIOU(Wise-IoU), and EIOU(Efficient-IoU).

EIOU.
EIOU is a modification of the CIOU loss function that separates the aspect ratio influence factor between the predicted box and the ground truth box [8].It calculates the length and width of both the predicted box and the ground truth box independently, based on the penalty term of the CIOU, to address the issues present in CIOU(Complete-IoU) [9].
EIoU comprises three components: IoU loss, distance loss, and aspect ratio loss (incorporating overlapping area, center point distance, and aspect ratio).The aspect ratio loss aims to reduce disparities in height and width between predicted and ground truth bounding boxes, leading to faster convergence and improved localization outcomes.The EIOU function can be represented by a formula: In the formula,   and  ℎ represent the width and height of the minimum bounding rectangle of the predicted bounding box and the ground truth bounding box, respectively.p is the Euclidean distance between two points.

WIOU.
The traditional Intersection over Union (IoU) only considers the overlapping part between the predicted box and the real box, without taking into account the area in between, which may lead to biases when evaluating results [10].Building on this concept, we propose an IoUbased loss with a dynamic non-monotonic feature map called Wise IoU (WIoU).
The final formula of WIOU can be derived in two steps, resulting in WIOUv1 and WIOUv3, respectively (WIOUv2 is another version we didn't use in this experiment).WIOUv1 is obtained by constructing distance attention based on distance metrics, which introduces a two-layer attention mechanism.The formula for WIOUv1 is as follows: WIOUv3 is created by utilizing β to build a non-monotonic focusing coefficient, which is then applied to WIOUv1, leading to WIOUv2 with dynamic non-monotonic feature modulation.Through the intelligent gradient gain allocation strategy of dynamic non-monotonic FM, WIOUv2 demonstrates remarkable performance improvements.The formula is as follows: In this experiment, we adopted the final version of WIOUv3.

SIOU.
Considering the angle between the vectors of expected regression, we propose the SIOU loss function to redefine the penalty metric of angular.This allows the predicted bounding box to quickly align with the nearest axis, after which it only needs to regress one coordinate (X or Y) [11].This effectively decrease the total degrees of freedom.The SIOU loss function comprises three components as follows.
Angle cost The Angle cost measures the minimum angle between the center point and the x-y axis.When the center point aligns with the x-axis or y-axis, Λ equals 0. If the center point connects to the x-axis at 45 degrees, Λ becomes 1.This penalty directs the anchor box towards the nearest axis of the target box, reducing the overall degrees of freedom.The formula is as follows:

Distance cost
The Distance cost is influenced by the separation between center points and is directly related to the angular cost.As the angle of the line connecting the center points of two detection frames approaches 0, the impact of the Distance cost diminishes.On the other hand, when the angle gets closer to Π/4, the significance of the Distance cost increases.The formula is as follows: where Shape cost Shape cos is defined as: where The value of  plays a crucial role in this equation, as it governs the emphasis on the Shape cost.When  is set to 1, shape optimization becomes the primary focus, potentially restricting the freedom of shape movement.
Finally, we define the final SIOU loss function as follows:

Experimental environment and parameter settings
Our experiments are conducted using a computer that is well configured to support high-performance computing and deep learning tasks.The GPU model is NVIDIA GeForce RTX 3060 with 16GB memory and Intel Core i7-10750H processor.We use Python to write the experimental code and scripts, as Python is widely used in the machine learning and deep learning fields is widely used and has a wealth of third-party libraries and tools.Before the experiment, we need to download the dataset and perform preprocessing, including image loading, label processing, and data enhancement.Resize the image to 640×640.The initial learning rate is set to 0.001, and the learning rate decay strategy is used for optimization.Furthermore, an adequate number of iterations is set to ensure that the network fully learns the bird features in the dataset.

Analysis of experimental results
3.2.1.Evaluation indicators.P:Precision represents the percentage of samples classified as positive that are actually positive.
where TP is true cases (The number of samples that the model correctly predicted as positive) and FP is false positive cases (The number of samples that the model mispredicted as negative).The higher the precision, the more accurate the model is in being classified as a positive class.
R:Recall represents the percentage of samples that are actually positive classes that are correctly classified as positive.
where TP is the true cases, FN is the false negative cases (Number of samples for which the model incorrectly predicted a negative class).Recall measures the model's ability to detect samples of positive classes that are actually present, also known as the check all rate.mAP:mAP (Mean Accuracy) is a comprehensive evaluation metric that measures the mean accuracy of a model at different confidence thresholds.mAP@.5 represents the mean accuracy at a confidence threshold of 0.5.mAP is calculated based on the Precision-Recall curve, which assesses the performance of the model by means of calculating the area under the curve.The higher the mAP, the better the model performance under distinct confidence thresholds.mAP@.5:.95:100% is a more comprehensive mAP metric that includes confidence thresholds ranging from 0.5 to 0.95 and average precision over all threshold ranges.A more comprehensive measure of a model's average precision across confidence thresholds provides a more complete assessment of performance.

Results of the experiment
Compared with YOLOv5s, YOLOv5s + AFPN increased the P value from 0.643 to 0.703, the R value from 0.526 to 0.589, mAP@.5 from 0.546 to 0.579, and mAP@.5:.95 from 0.261 to 0.272.Adding the AFPN module can improve the model's performance (Table 1).The model has achieved certain improvements in accuracy, recall rate, and average accuracy, indicating that AFPN can help to enhance feature representation ability and target detection performance (Figure 3).

Analysis of ablation experiments
Comparing models with added AFPN modules and different IOU loss functions, the precision of the model after adding the Asymptotic Feature Pyramid Network (AFPN) has been improved.After adding different IOU loss functions, adding SIOU loss function instead reduces the P value and R value of the model, so SIOU does not have a positive effect on the model.After adding WIOU or EIOU loss functions, the performance of the model has improved.The YOLOv5s + AFPN + WIOU model performs well in terms of precision, has higher accuracy and lower false detection rate compared to other models.The impact of SIOU and WIOU on the performance of the model is not significant, but EIOU can slightly promote the accuracy and recall rate of this model.In these experiments, the increase in mAP@.5 is not very significant, but overall the performance of the model has improved.In conclusion, from the results of these experimental analyses, adding the Asymptotic Feature Pyramid Network (AFPN) can improve detection performance, introducing different IOU loss functions (SIOU, WIOU, EIOU) can further enhance performance, and different loss functions may perform excellently on different indicators, so a suitable loss function can be selected according to needs and optimization objectives.

Conclusion
To achieve bird detection, this paper comes up with a YOLOv5s algorithm based on the progressive feature pyramid.Traditional feature pyramid networks used in training and inference processes tend to consume more time and memory, and they often encounter conflicts and contradictions in feature learning.This paper deviates from the FPN structure in YOLOv5s, which inherently employs a topdown pathway for strong feature propagation, so as to enhance the accuracy of deep learning in bird recognition and detection.Instead, it integrates the progressive feature pyramid network (AFPN) based on YOLOv5s.Through the multi-level structure of the pyramid, fine-grained attention progressively guides the learning of coarse-grained attention, which allow the model to focus on birds from local to global levels and capture more detailed features.Additionally, different loss functions are incorporated into the YOLOv5s framework, specifically for object detection regression, and are trained and compared (SIOU, WIOU, EIOU).The experimental and training datasets are extracted from the VOCtrainval2012 dataset, and data annotation is performed simultaneously with training to ensure accurate bounding boxes around the target birds.The improved model demonstrates significant advantages over the standalone YOLOv5s model.The precision of the YOLOv5s+AFPN+wiou model increases from 0.643 to 0.717, while the recall rates of YOLOv5s+AFPN and YOLOv5s+AFPN+eiou rise from 0.526 to 0.589 and 0.579, respectively.The average precision mAP@.5 of AFPN+siou increases from 0.261 to 0.287, showing notable improvement and meeting the predetermined objectives.

Figure 1 .
Figure 1.Some samples in bird dataset.During the dataset preprocessing, we adopted the following steps: (1) Resizing all images to 640x640 pixels to meet the input requirement of the YOLOv5s model; (2) Applying data augmentation techniques, including random rotation, translation, scaling, and horizontal flipping, to augment the diversity of data samples and enhance the model's robustness; (3) Conducting annotation data filtering and cleaning to ensure that each annotation box accurately encloses the target bird species.

Figure 2 .
Figure 2. is the architecture of YOLOv5s with AFPN The proposed architecture of the progressive feature pyramid network (AFPN) is illustrated as follows.AFPN initially fuses two low-level features.In the subsequent stages, it fuses higher-level features, and in the final stage, it adds top-level features to the feature fusion process.Black arrows represent convolutions, and blue arrows represent adaptive spatial fusion.To align the dimensions and prepare for feature fusion, we used 1x1 convolutions and bilinear interpolation to upsample the features.On the other hand, we performed downsampling based on different convolution kernels and strides.For instance, we applied 2x2 convolutions with a stride of 2 to achieve downsampling by a factor of 2, 4x4 convolutions with a stride of 4 for 4 times downsampling, and 8x8 convolutions with a stride of 8 for 8 times downsampling.After feature fusion, we continued to use four residual units to learn features, which are similar to ResNet.,eachresidual unit consists of two 3x3 convolutions.

Figure 3 .
Figure 3.Some pictures of bird detection.

Figure 4 .
Figure 4. Bar charts for different indicators of the model.According to the figure 4 results of different IOU loss functions, the YOLOv5s + AFPN + EIOU model has the best overall performance, indicating that this model has the best performance when considering accuracy (P value), recall rate (R value), and mAP indicators.In conclusion, from the results of these experimental analyses, adding the Asymptotic Feature Pyramid Network (AFPN) can improve detection performance, introducing different IOU loss functions (SIOU, WIOU, EIOU) can further enhance performance, and different loss functions may perform excellently on different indicators, so a suitable loss function can be selected according to needs and optimization objectives.

Table 1 .
Results after the training of the different models.