Improved Drug Traceability Code Detection Algorithm Based on YOLOv5

In order to enhance the storage efficiency of drug traceability code information on the blockchain and improve the extraction capability of drug traceability codes on drug packaging, a detection algorithm based on an enhanced version of YOLOv5 is proposed for the drug production and transportation scenario. The proposed algorithm introduces the SPD-Conv module into the backbone network, thereby enhancing the network’s ability to extract detailed feature information. Additionally, the CA attention mechanism is incorporated into the Neck of the network, providing the network with superior feature fusion capabilities. Furthermore, the activation function in the network is replaced with the LeakyReLU activation function, reducing computational requirements during training and inference. This replacement improves the model’s test accuracy and detection speed. Experimental evaluation, conducted on a self-built dataset, demonstrates the effectiveness of the improved YOLOv5 model. The results indicate a compression of model parameters from 6.14M to 2.71M, an increase in mAP@.5 from 82.4% to 93.7%, and a boost in detection speed from 32FPS to 51FPS. These findings establish the superior performance of the enhanced model compared to the original version.


Introduction
Drug traceability codes play a crucial role in recording and tracking various stages of drug production, processing, logistics, transportation, and retail usage.As China's pharmaceutical industry and internet business continue to evolve, blockchain technology is gradually being implemented for the storage and retrieval of drug traceability codes.However, before transmitting these codes to the blockchain, it is necessary to extract them from the drug's outer packaging.Currently, this extraction process heavily relies on manual photography, which is both costly and inefficient.Moreover, manual operations increase the chances of missing or falsely detecting damaged traceability codes.These limitations pose challenges for effective drug traceability on the blockchain, reducing the efficiency of storage and retrieval processes.Consequently, there is a pressing need to develop automatic detection algorithms for drug traceability codes, aiming to enhance algorithmic detection accuracy and sensitivity in distinguishing damaged traceability codes.Such research is of significant importance for advancing the field of drug traceability codes.
With the advancement of deep learning research, significant progress has been made in object detection tasks in recent years.Currently, image target detection algorithms can be broadly categorized into two types.The first category is Two-Stage target detection algorithms, including R-CNN [1] and its subsequent developments such as SPP-Net [2], Fast R-CNN [3], and Faster R-CNN [4].
The Two-Stage detection algorithm divides the target detection process into two stages: candidate region selection and target classification/recognition.R-CNN, the influential work in the application of deep learning to target detection, combines the Selective Search algorithm with a deep convolutional neural network.It predicts candidate regions using Selective Search, employs Alex-Net and SVM for target classification, and utilizes bounding box regression and non-maximum suppression algorithms to obtain the best candidate frames.The R-CNN algorithm significantly improves average accuracy compared to traditional feature extraction and feature matching approaches.However, the high complexity of the Selective Search algorithm hampers the practical applicability of the model due to slow training and prediction speeds, making it challenging to use in industrial production.To address these issues, Girshick et  While Two-Stage detection algorithms can achieve high detection accuracy, they do face significant limitations when it comes to detection speed.In order to address this challenge and improve detection speed without compromising accuracy, One-Stage target detection algorithms have been introduced.These algorithms include the SSD[5] algorithm, Refine-Det [6] algorithm, and the YOLO [7] series algorithm.The One-Stage algorithms utilize deep convolutional neural networks to simultaneously select candidate regions and identify targets, resulting in a significant improvement in detection speed.Let's take the widely recognized YOLO series algorithm as an example.In 2015, Redmon J et al. proposed the YOLO target detection algorithm, which divides the entire image into S*S grids and predicts only the grid containing the target center.While the model's detection accuracy may be somewhat limited, its detection speed is greatly enhanced.In 2020, the introduction of the YOLOv5 algorithm enabled real-time target recognition in practical scenarios.YOLOv5 incorporates the strengths and advantages of its previous versions, utilizing the latest feature extraction networks and effective data augmentation techniques.As a result, it performs exceptionally well in various target detection tasks.Currently, the YOLOv5 algorithm and its subsequent improvements have been successfully applied to target detection tasks across diverse fields, demonstrating their practical utility and effectiveness.Based on the YOLOv5 algorithm, Jing proposed CrackNet, which was used to detect cracks at intersections, effectively improving the problem of crack leakage detection and inaccurate positioning [8]; Liu used GhostNet to reconstruct the Neck of the YOLOv5 network, which improved the detection effect of coal mine fireworks [9].
Based on the aforementioned research, we propose an improved YOLOv5s-based drug traceability code detection algorithm.This algorithm automates the extraction and storage of drug traceability codes from the drug's outer packaging, thereby enhancing the efficiency of blockchain storage for drug traceability codes.Our approach involves replacing the traditional convolution module in the Backbone section of the YOLOv5 network with a single-step convolution module.This modification reduces the loss of detailed features and enhances the network's ability to detect small target objects.Furthermore, we embed a CA (Coordinate Attention) mechanism in the Neck section of the network.This mechanism encodes position information to obtain attention on the image's width and height, thereby improving the network's feature fusion capability.Additionally, we introduce the LeakyReLU activation function to replace the original SiLU activation function.This replacement reduces the nonlinear calculations of the network during training and increases the training and inference speed.By incorporating these enhancements, our algorithm demonstrates improved performance in drug traceability code detection.It enables the automatic extraction and preservation of drug traceability codes, thereby streamlining the process and facilitating efficient storage on the blockchain.
This paper focuses on constructing a dataset comprising on-site collected images of drug traceability codes.The designed network is then subjected to extensive experiments using this dataset.Additionally, the SPPF structure is employed for pooling operations to expand the network's receptive field and enhance its feature extraction capabilities.During the YOLOv5 detection process, the network initially applies data augmentation techniques to the input training data.These techniques include operations such as HSV colour gamut augmentation, Scale, Shear, Mosaic [10], and Copy-Paste.These augmentations play a vital role in improving the accuracy of detection.Subsequently, the pre-processed training images are fed into the backbone network.Each image is divided into S*S grids, with each grid obtaining a unique size calculated by the adaptive anchor frame [11].The network detects candidate object frames by analysing the grid containing the object's center.Then, features are extracted through the convolutional layers of the backbone network, generating a multi-scale feature map for prediction.The prediction layer utilizes this multi-scale feature map to make predictions.To obtain the final prediction, the feature pyramid structure fuses the multi-scale feature maps.The feature maps of different scales possess distinct receptive field sizes, enabling the detection of objects of various sizes.After processing by the prediction layer, each grid obtains B predicted anchor boxes, confidence scores, and C conditional probabilities [12].

Improved YOLOv5 traceability code detection model
To fulfil the identification requirements of drug traceability codes, enhancements have been made to the YOLOv5s model.These improvements maintain the accuracy of YOLOv5s in detecting intact traceability codes while reducing confidence in detecting damaged traceability codes.This refinement aims to improve the sensitivity in distinguishing between normal and damaged traceability codes [13].Moreover, the modifications optimize the model's reasoning speed to meet the deployment conditions on production and transportation lines, as well as the real-time demands of on-site operations.This chapter introduces the following enhancements:(1) Recognizing the characteristics of low-resolution and small target detection tasks in logistics electronic evidence storage traceability code detection, the Conv module in the YOLOv5s backbone network is replaced with the SPD-Conv module [14].This substitution enhances the network's capability to extract features from small target objects.(2) In order to maximize the utilization of computing and storage resources on detection equipment, the CA (Coordinate Attention) attention mechanism [15] is integrated into the YOLOv5s Neck network.This integration enables the network to focus more attention on relevant objects during detection, thereby enhancing the model's detection accuracy.(3) The default SiLU activation function used in all modules of YOLOv5s is replaced with the LeakyReLU activation function.This replacement reduces the computational load on the model without compromising its performance, ultimately improving both training and inference speeds.The improved YOLOv5 architecture is depicted in Figure 2.
Where scale is the downsampling factor, and Figures (a)(b)(c) show the processing of the SPD layer on the intermediate feature map when the downsampling factor  = 2.The SPD layer splits and reorganizes the input intermediate feature map at the pixel level, compresses the feature map while retaining the details of the original feature map as much as possible, and realizes feature fusion and feature extraction.
Then, these feature submaps are connected in the channel dimension to obtain a feature map ′ that is scaled in the spatial dimension and  2 times in the channel dimension, and then  2 volumes of 3 × 3 are used the product kernel performs a convolution operation with a step size of 1 on ′ to obtain the final output feature map ′′ of the SPD-Conv module.Compared with the traditional convolution operation with a step size of 3 and a step size of 2, the single-step convolution avoids asymmetric sampling, retains as much discriminant feature information as possible, and improves the utilization rate of the input feature map feature information.

Add CA attention mechanism
During the production and transportation process, lighting conditions can vary, leading to diverse logistics electronic evidence traceability code images with variations in brightness and blur.Such variations pose challenges by causing false detections and reducing the sensitivity to distinguish damaged traceability codes.To address this issue, the attention mechanism, inspired by human vision research, is introduced.Its core concept involves assigning weights to each pixel in the feature map, enabling the model to prioritize more relevant areas given limited computing and storage resources.Consequently, this enhances the network's detection accuracy.In this chapter, the CA (Coordinate Attention) attention mechanism is introduced and integrated into the Neck network.By doing so, the feature extraction capability of the network is enhanced, resulting in improved detection accuracy and the ability to discern damaged traceability codes.This integration leverages the inherent strengths of the attention mechanism to selectively focus on significant regions, thereby optimizing the network's performance.
The CA attention mechanism plays a crucial role in enhancing the expressive capability of network feature learning.Its objective is to capture not only the significance of different channel features but also the positional information related to their spatial characteristics.By incorporating automatic learning of the importance of various channel features, including position encoding, it models the interdependence between convolutional feature channels and their spatial information.This, in turn, improves the network's ability to learn meaningful features.One notable advantage of the CA attention mechanism is its versatility.It can accept any intermediate feature vector, denoted as x, as input and produce an output vector, denoted as y, with the same dimensions but with an enhanced representation.Consequently, it can seamlessly integrate into existing network structures.The structural diagram of the CA attention mechanism module is depicted below.In order to obtain the attention on the width and height of the feature map and encode the position information, the feature map is firstly subjected to global average pooling from the two directions of width and height, respectively, and the feature vectors in the two directions of width and height are respectively obtained, as shown in the following formula.
Then concatenate the feature vectors with the width and height of the global receptive field, send the obtained feature vectors into the 1 × 1 convolution module, reduce its dimension to the original /, and then normalize the batches The processed feature vector f1 is sent to the Sigmoid activation function to obtain a feature vector f of the form 1 × ( + ) × /, the formula is as follows.
Then perform 1 × 1 convolution on the feature vector f again, restore the original number of channels, and disassemble the feature vectors Fh and Fw according to the combined length.After the Sigmoid activation function, the feature maps are obtained in height and width.The attention weights g h and g w .As shown in the following formula.
ℎ = ( ℎ ( ℎ )) (5) Finally, the multiplicative weighted calculation is performed on the original feature map to obtain a feature map with attention weights in width and height.

Replace the Activation Function
In the context of drug production and transportation, the timely collection and recognition of drug traceability code images are crucial.The target detection model needs to achieve fast detection while ensuring high accuracy.However, the configuration settings of drug production logistics pipelines are often imperfect, and cost constraints limit the use of high-end hardware for the detection model.Therefore, it is essential for the detection model to minimize computational requirements during the inference process to meet practical application needs.In CNN network models, activation functions are applied to introduce nonlinearity and improve the generalization ability of the model.The choice of activation function directly impacts the computation, detection accuracy, and training and inference time of the network.By default, YOLOv5s employs the SiLU (Sigmoid Linear Unit) activation function.The SiLU function effectively prevents the gradient from vanishing during backpropagation, allowing parameter updates.However, the SiLU function involves exponential operations, significantly increasing the computational workload during training and inference, which leads to slower training and inference speeds.
In this chapter, the Leaky ReLU activation function is introduced and employed as the activation function in the improved YOLOv5s network.The Leaky ReLU function is an enhancement of the traditional ReLU function, allowing it to activate negative values as well.Unlike the ReLU function, which discards negative values entirely, the Leaky ReLU function retains some activation effect on negative values.Additionally, it is a linear function with a predetermined slope coefficient that is set prior to training.Compared to the SiLU function that involves exponential operations, the Leaky ReLU function significantly reduces the computational workload during both training and inference stages.As a result, it greatly improves the training and inference speed of the model, while maintaining its ability to learn and generalize effectively.By adopting the Leaky ReLU activation function, the improved YOLOv5s network achieves faster computations and more efficient training and inference processes.

Experimental Equipment and Parameter Settings
In this experiment, the CPU Inter Core i5-10300H is selected, the GPU model is GTX1650Ti, the video memory interface width of the graphics card is 128bit, the video memory bandwidth is 256GB/s, the core frequency is 1530~1800 MHz, the video memory frequency is 12,000MHz, and the disk is 1TB.Memory 8GB.Choose the PyTorch framework.The experimental parameter configuration is shown in Table 1.
FPS denotes the number of images processed by the algorithm per second and is often employed to evaluate the operational speed of the model.A higher FPS value indicates a faster detection speed of the model and lower time consumption for image detection.

Ablation Experiment
In order to verify the effectiveness of the improved algorithm in this paper, ablation experiments were carried out on the same data set.On the basis of the original model of YOLOv5s, it is modified in turn: replace the SPD-Conv module, embed the SE attention mechanism, and replace the LeakyReLU activation function.The results of the ablation experiments are shown in the Table 2.

Demonstration of Detection Results
In order to better demonstrate the detection effect of the improved YOLOv5 model on drug traceability codes.We select normal traceability codes and traceability codes with different degrees of damage for verification.The damage includes stains, missing, posture offset and excessive reflective.and (e), while the results of our model are shown in images (f), (g), (h), (i), and (j).Among them, (a) and (f), (b) and (g), (c) and (h), (d) and (i), (e) and (j) are five pairs of comparisons obtained from five sample images, corresponding to normal, stains, missing part, posture offset, and excessive reflection detection, respectively.For ease of presentation, we have only captured the area where the traceability code is located in the entire image after detection.This is also the region that the model needs to focus on during the detection process.The experimental results show that the improved YOLOv5s model not only improves the ability to identify normal traceability codes, but also enhances the ability to distinguish damaged traceability codes.It proves that the improved YOLOv5s model has enhanced ability to identify small targets such as traceability codes, and has improved detection accuracy for both normal traceability codes and damaged traceability codes.From the results, it is evident that our improved model has enhanced the detection accuracy of normal traceability codes and the recognition rate of damaged traceability codes.

Conclusion
This paper presents an enhanced network structure for YOLOv5s, which effectively improves both the detection accuracy and speed of the model, while also reducing the number of network parameters and enhancing training and inference speed.The modifications primarily focus on three aspects.Firstly, the backbone network undergoes a transformation by replacing the traditional Conv module with the SPD-Conv module.This alteration significantly enhances the network's capability to retain detailed information during feature extraction, all while maintaining parameter efficiency.Additionally, the CA attention mechanism is introduced into the Neck component, thereby boosting the model's feature fusion ability and detection accuracy.Lastly, the SiLU activation function, originally employed by the YOLOv5s network, is substituted with the LeakyReLU activation function.This substitution effectively reduces computational requirements during both network training and inference without compromising the model's performance.Evaluation of the improved model on a custom-built dataset demonstrates notable improvements over the original YOLOv5s model.The enhanced model exhibits a significant reduction in computational complexity, leading to improved inference speed.Furthermore, it achieves enhanced detection accuracy and speed to varying degrees.The experimental findings firmly establish that the YOLOv5s model, refined through the proposed approach in this paper, delivers the most promising overall performance.
al. introduced the Fast R-CNN algorithm in 2015.It incorporates the SPP-Net which was proposed by He. by designing the SPP layer of the network as a single layer of ROI (Region of Interest) Pooling layer.Additionally, the algorithm employs SVD (Singular Value Decomposition) to enhance image reasoning speed by decomposing the fully connected layer.In 2015, Ren et al. proposed the Faster R-CNN algorithm, which introduces RPN (Region Proposal Networks) and achieves near real-time detection results.However, computational redundancy remains a concern in the detection stage of the algorithm.

Figure 1 .
Figure 1.This is the schematic diagram of the network structure of YOLOv5s.

Figure 2 .
Figure 2.This is a schematic diagram of our improved network structure for YOLOv5s.

3. 1 .
Improvements to the BackboneThe YOLOv5s model employs DarkNet as its backbone network for feature extraction.The network consists of multiple traditional convolution modules, Conv, with a convolution kernel size of 3 × 3, a step size of 2, and varying numbers of output channels.However, this design leads to large model parameters and slow calculation speed.Additionally, the utilization of multiple traditional convolution operations can result in the loss of fine-grained information from the original image.Consequently, the feature map used for prediction may lack crucial details related to small target objects, thereby reducing the model's recognition accuracy and sensitivity to discriminate small targets.In this chapter, a significant improvement is made by replacing all Conv modules in the YOLOv5s backbone network with SPD-Conv modules.This replacement aims to preserve the feature information of small targets to the maximum extent, minimize the loss of detailed information, and enhance the backbone network's capability for feature extraction.By incorporating the SPD-Conv modules, the model becomes more adept at capturing and retaining fine-grained details, resulting in improved feature extraction performance.The SPD-Conv module is specially designed to complete low-resolution images and small object detection tasks, and can be applied to most or even all CNN network architectures.It overcomes the loss and loss of fine-grained information easily caused by the traditional CNN network with Conv as the main component.Learn the disadvantages of inefficient feature representation.The SPD-Conv module consists of a Space-to-Depth (SPD) layer and a single-stride convolutional layer.The schematic diagram of the entire process is shown in the Figure3, where (a) is an intermediate feature map output by the CNN network at any stage of feature extraction, after processing by the corresponding Space-to-Depth layer of (b) and (c), the feature map ′(d) is obtained, after being processed by a single-step convolutional layer, the feature map ′′(e) is obtained, which is the final output of the SPD-Conv module.

Figure 3 .
Figure 3. Procedure map of SPD block.

Figure 4 .
Figure 4.The structural diagram of the CA attention mechanism module.

8 Figure 5 . 4 .
Figure 5.The shapes of SiLU and LeakyReLU.4.Drug traceability code detection experiments4.1.Data Collection and PreprocessingThe drug traceability code recognition experiment utilizes a dataset consisting of images captured from the aluminium foil outer packaging of drugs that contain the drug traceability code.These images were collected throughout the drug production process.A total of 1,586 images with complete traceability codes and damaged traceability codes, images with different poses, and pictures with different degrees of reflection are included.The collected images are labelled with LabelImg, and the labeling format is YOLO format.The drug traceability code is completely marked in the rectangular frame, and txt format labeling files with key elements are generated respectively.This data set is randomly divided into training set, verification machine and test set according to 8:1:1, and finally 1270 training images, 158 verification images and 158 test images are obtained.Among them, there are 79 normal traceability codes and 79 damaged traceability codes in the test set.The figure below is a sample of complete drug traceability code picture taken by camera during inspection.

Figure 6 .
Figure 6.A sample of complete drug traceability code picture.
The detection results of the basic YOLOv5s model and our improved model are shown in the figure below respectively.The results of the basic YOLOv5s model are presented in images (a), (b), (c), (d),

Figure 7 .
Figure 7.The examples for comparison of model detection performance before and after improvement.From the results, it is evident that our improved model has enhanced the detection accuracy of normal traceability codes and the recognition rate of damaged traceability codes.

Table 1 .
The experimental parameter configuration During the evaluation of model performance, several key indicators are utilized, including mean average precision (mAP), model parameters (parameters), model computation amount (FLOPs), and frames per second (FPS).These metrics are commonly employed in object detection to assess the effectiveness of models.Among these indicators,   represents the precision rate for the i-th category, while C denotes the total number of categories.P signifies the precision rate,  refers to the precision rate for a single category, () represents the smoothed precision-recall curve, and its integral operation smoothens the area under the curve.TP corresponds to the number of successfully predicted positive examples, while FP represents the number of negative examples misjudged as positive examples by the model.The calculation formula is as follows:

Table 2 .
Results of the ablation experiments