Enhanced Security Contraband Detection through Integration of Attention Mechanism in R3Det

Detecting dangerous goods in security images is a challenging task. To overcome the challenges of localization difficulty and directional feature loss of contraband in X-ray images, our proposed solution, R3Det, employs the Convolutional Block Attention Module (CBAM). By integrating ResNeSt into the original detector, our detector includes a soft attention mechanism to redistribute weights among feature channels. This enhances the network’s ability to extract important features and facilitates extraction of target objects features under complex backgrounds. Subsequently, we introduced the spatial and channel attention mechanism during the connection between the backbone and the Feature Pyramid Network (FPN), enabling the model to focus on significant features while ignoring complex background information, then the following Feature Refinement Module to achieve feature alignment in a pixel-by-pixel manner. Our approach successfully achieved rotating target detection in the background of complex X-ray images. Through end-to-end training, our proposed method achieves a 2.6% improvement over the original detector, with a mean Average Precision (mAP) of 86.7%. Notably, our approach showed remarkable results in detecting sensors, pressure, and firetrackers. Now, we have deployed our proposed method on actual security machines for hazardous material detection tasks.


Introduction
In recent years, deep learning has gradually matured in the automatic processing of X-ray image data [1].In public places such as subways, stations and airports, the detection efficiency of traditional security inspection machines is slow and affected by human factors.To solve this issue, object detection is applied to security inspection machines, which greatly relieves the pressure of security inspection, but there are still many shortcomings in the detection of contraband in X-ray images under the background of security inspection: (1) Objects obstructing each other.In daily security checks, backpacks, boxes, and other items are often stacked and block each other.This causes items to overlap, making it difficult to accurately locate the contraband in the X-ray image.
(2) Loss of directional features.The direction of various items in luggage is arbitrary.However, the horizontal detection box, when identifying contraband, carries a large number of other background elements and lacks directional features.
To address the issues associated with the horizontal target detection in X-ray image detection, we propose a network called AT-R3det, which fusion of R3det [2] and attention mechanism.This detector primarily includes the following two improvements and this paper creates a dataset production: (1) To address the issue of detect precision in ResNet [3], the backbone has been replaced with ResNeSt [4].Compared with the original backbone network, ResNeSt is more modular and improves the accuracy without adding too much computational cost.
(2) In response to the complex background of X-ray images, CBAM [5] attention mechanism was added between ResNeSt and FPN [6], including spatial attention mechanism and channel attention mechanism, which can enrich features map.
(3) Since existing objects data under security background is infrequent and expensive, we constructed the XS-6359 benchmark dataset based on actual security inspection images to solve the detection task in this paper.

ResNeSt strucrt based on attention fusion
ResNeSt outperforms ResNet, SE-Net [7], and ResNeXt [8] in terms of accuracy, without adding any extra parameters.This is achieved by keeping the grouping convolution intact while introducing a subattention block to the 1×1 + 3×3 sub-module, which enables the extraction of more valuable features [9]. Figure 1 illustrates the ResNeSt structure, where h, w, c represent height, width,the number of input feature channels, respectively.The input is split into n branches and each undergoes group convolution, where K denotes the Kth subconvolution, and c' is the number of channels in the middle feature layer.After different subconvolutions, the features from each branch are concatenated.In the ResNeSt block, the attention mechanisms formed on different n are different for the same K (dotted box), but are the same for different K formed on the same n.ResNeSt can handle all branches simultaneously, eliminating the need to process each K group individually in the original structure, thus ensuring code robustness and network equivalency.h, w, c) is splited into n groups.Each group is then divided into k parts for 1×1 and 3×3 convolutions.Upon completion, the various parts within each group are concatenated to form a feature map (h, w, c`), and features are extracted after receiving group attention.

Integration of CBAM attention mechanisms
CBAM is a convolutional block attention module that combines space and channel.The schematics of CBAM is shown in Figure 2: The part of channel attention block is shown in Figure 2. The feature map is compressed in space dimension to obtain a one-dimensional vector before operation.Average-pooling is used for every pixel on the feature map, while max-pooling is used for gradient back propagation calculation, which mainly provides gradient feedback to the place with large response.Average-pooling and max-pooling can be used to aggregate the spatial information of feature mapping, send it to shared multilayer perceptron (MLP), compress the spatial dimension and sum it element by element to generate channel attention diagrams.From the perspective of individual graph analysis: channel attention is more focused on the important features.The channel attention mechanism can be expressed as: () = ((()) + (())) (1) The part of spatial attention block is shown in Figure 2. The feature map output from the channel attention mechanism is used as the input feature map for this module.Similarly, the spatial attention mechanism is to compress the channel, and carry out golabal-average-pooling and golabal-max-pooling respectively in the channel dimension.The operation of golabal-max-pooling extracts the maximum value on the channel.The operation of golabal-average-pooling extracts the average value on the channel.Then, the extracted feature graph with the number of channels being one is combined to obtain a feature graph with two channels.

Rotation angle representation
There are three common definition methods of arbitrary rotation box in the angle detection of rotating target, including the five-parameter definition method of two angle ranges and the eight-parameter quadrilateral definition method, as shown in the Figure 3 below: The opencv definition [10] method in Figure 3 is used in this experiment.Starting from the positive half-axis of the X-axis, the first side encountered in a counterclockwise direction is considered the width, and the angle range is [-pi/2,0].The major advantage of this definition method is that the width is always close to the side of the angle definition, eliminating the need to deliberately find the side with longer width and height to calculate theta, which can enhance the detection speed.

Schematics of AT-R3det
AT-R3det is formed based on the R3det fusion attention mechanism, where R3det is an enhanced singlestage spin detector based on RetinaNet [11], which introduces the concepts of spin detection and spin anchoring.As shown in Figure 4, AT-R3det mainly consists of ResNeSt, FPN and classification regression subnetwork, in which ResNeSt extracts image features from bottom to top stratification [12].To enhance the ability of the backbone network to extract features,we add CBAM to extract more important features from ResNeSt, and then the feature map is linked and enhanced by FPN ,thus effectively constructing a rich multi-scale feature pyramid from the input images of a single resolution.Each layer pyramid is connected with a classification regression subnetwork.It can better adapt to different scale target detection in complex environment.

Information of Dataset, experimental environment, evaluation indicator
Given that data on existing objects under security background is infrequent and expensive, we constructed the XS-6359 benchmark dataset based on actual security inspection images to slove the detection task.The contraband dataset mainly includes 8 types of contraband : knife, scissors, lighter, zippooil, pressure, handcuffs, powerbank, firecrackers, with a total of 6359 X-ray images.To test the effectiveness of AT-R3det detector we proposed for the detection of security contraband, a large number of training and tests were carried out with the help of the the XS-6359 benchmark dataset.The experimental environment is shown in Table 1: Table 1.The ratio of training set to verification set is 9:1.The annotated images use the roLabelImg, which adds an angle θ parameter compared to the horizontal labeled box.The evaluation indexes used in the experiment are mAP@0.5(meanAverage Precision, with IOU threshold greater than 0.5) and average accuracy of all categories AP.

3.2.Experimental comparison results based on XS-6359 benchmark dataset
The figure 6 shows the comparison of mAP curves among the three target detector:

Analysis of experimental results
The experiment conducted was a comparison between AT-R3det and other target detectors, such as Yolov5 and R3det.The results from Tables 2.a and 2.b show that AT-R3det outperforms Yolov5 in detecting targets under complex backgrounds.Yolov5, without rotation, has low detection accuracy and is prone to missed and false detections.However, by adding the angle parameter, AT-R3det achieved significantly improved detection accuracy.Other modifications, such as enhancing the backbone network and adding an attention mechanism, also contributed to better detection accuracy for specific targets like scissors, pressure, and firecrackers, leading to an overall increase of 2.3%.The comparison results are shown in Figure 7 below:

Conclusion
We propose the AT-R3det to address two primary challenges encountered during contraband detection in complex X-ray images in this paper.The AT-R3det employs ResNeSt to reduce the number of parameters while maintaining group convolution, and it enhances the attention block submodule to increase the weight of valuable regions in the feature map.Furthermore, the integration of the CBAM between ResNeSt and FPN allows the network to extract features of different dimensions, thereby focusing on relevant features.Experimental results indicate that AT-R3det effectively improves the precision of contraband detection without increasing the number of parameters or the computational load.Future work should aim to make the model more lightweight, reduce resource usage, improve detection efficiency, and adapt it for use in various security inspection machines to achieve accurate detection of prohibited items.

Figure 1 .
Figure 1.An illustration of the unit structure of ResNeSt-50 shows that the input feature map (h, w, c) is splited into n groups.Each group is then divided into k parts for 1×1 and 3×3 convolutions.Upon completion, the various parts within each group are concatenated to form a feature map (h, w, c`), and features are extracted after receiving group attention.

Figure 2 .
Figure 2. Schematics of CBAM, the attention mechanism assigns different weights to the input part of the feature network, enabling the model to ignore certain background information and focus more on the features themselves.The part of channel attention block is shown in Figure2.The feature map is compressed in space dimension to obtain a one-dimensional vector before operation.Average-pooling is used for every pixel on the feature map, while max-pooling is used for gradient back propagation calculation, which mainly provides gradient feedback to the place with large response.Average-pooling and max-pooling can be

Figure 3 .
Figure 3. From left to right: opencv definition method (90°), long edge definition (90°), ordered quadrilateral representation.The opencv definition[10] method in Figure3is used in this experiment.Starting from the positive half-axis of the X-axis, the first side encountered in a counterclockwise direction is considered the width, and the angle range is [-pi/2,0].The major advantage of this definition method is that the width is always close to the side of the angle definition, eliminating the need to deliberately find the side with longer width and height to calculate theta, which can enhance the detection speed.

Figure 5 .
Figure 5. Schematics of Feature Refinement Module.As depicted in Figure5, the Feature Refinement Module of AT-R3det overlays the feature map extracted from the FPN through double-channel convolution to optimize the feature map.To enhance the detection speed, during the refinement stage, only the boundary frame with the highest feature is retained, while other boundary frames are discarded, ensuring that each feature point corresponds to only one refined boundary frame.The module retrieves corresponding feature vectors on the feature map based on the 5 coordinates of the refined boundary frame, and then obtains more precise feature vectors through bilinear interpolation.AT-R3det reconstructs the entire feature map on a pixel-by-pixel basis, using feature alignment to perform classification and regression.

Figure 6 .
Figure 6.Visualization of mAP curves in several comparative experiments.

Figure 7 .
Figure 7.The visualization of several comparative experiments.

Table 2 .
The data in Table2.aandTable2.brepresent the accuracy and average accuracy of all categories of different detector in the same data set.a.Comparison of detection accuracy of non-rotating target detection and rotating target detection algorithms on the same dataset, bold font represents the highest accuracy in comparison.

Table 2 .
b.Comparison of detection accuracy of non-rotating target detection and rotating target detection algorithms on the same dataset, bold font represents the highest accuracy in comparison.