Detecting defects in fused deposition modeling based on improved YOLO v4

Fused deposition modeling comes with many conveniences for the manufacturing industry, but many defects tend to appear in actual production due to the problems of the FDM mechanism itself. Although some deep learning-based object detection models show excellent performance in detecting defects in the additive manufacturing process, their detection efficiency is relatively low, and they are prone to drawbacks in the face of large numbers of defects. In this paper, an improved model based on the YOLO v4 network structure is developed. We lightweight the model and modify its loss function to achieve better performance. Experimental results show that the improved model, MobileNetV2-YOLO v4, achieves a mAP of 98.96% and an FPS of 50.8 after training, which obtains higher detection accuracy and faster detection speed than the original YOLO v4 algorithm model. Through testing, this improved model can accurately identify the location and information of target defects, which has great potential for real-time detection in the additive manufacturing process.


Introduction
Additive manufacturing (AM) is an advanced manufacturing technology that integrates multidisciplinary such as mechanical manufacturing, materials science, and computer science. This technology enables the manufacturing of parts by accumulating materials layer by layer, making it possible to manufacture complex structural parts that were previously unattainable due to the constraints of traditional manufacturing methods [1,2]. As one of the most widely used and lowest-cost 3D printing technologies, fused deposition modeling (FDM) is the process of extruding materials through a nozzle layer by layer to form a 3D structure, and these layers are obtained by slicing software that slices the 3D design in certain increments [3][4][5]. Figure 1 briefly illustrates the working principle of an FDM printer. It is mainly composed of filament feeding mechanism, motion mechanism, nozzle, thermocouple, print bed and control system.
Despite the many conveniences that FDM brings to the manufacturing industry, some deficiencies caused by the nature of the FDM mechanism limit its industrial use [6]. Air gap is referred to the gap between two adjacent fuses on a deposited layer [7]. Under normal condition, adjacent fuses extruded by printers tend to exhibit a tight fit (see figure 2(a)). However, for a few reasons, including but not limited to poor printer calibration, incorrect selection of operating parameters, etc, such gaps may occur during the printing process (see figures 2(c) and (e)). During the printing process, once the defects mentioned above appear, it is difficult to eliminate by adjusting the printing parameters. When these defects accumulate to a certain extent, the printed part will face some serious problems, such as low strength, low toughness, and rough surfaces [8,9]. Therefore, fuses in printed parts often need to be kept in a tight fit to improve their mechanical properties [10]. To reduce the occurrence of these issues, the most effective solution is to monitor the production process of FDM and other additive manufacturing technologies in real time, which also becomes a challenge. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Today, due to the great complexity of manufacturing engineering and the high number of parameters used conventional approaches are no longer sufficient [11]. The rapid development of deep learning algorithms has provided numerous benefits within the realm of additive manufacturing. The integration of these algorithms with additive manufacturing has emerged as a solution to surmount the challenges currently encountered in this technology [12]. Hence, many researchers have turned to employ deep learning algorithms to address issues in additive manufacturing, including design optimization, real-time monitoring, performance prediction, and energy management [13][14][15][16]. Some researchers collect large amounts of data through simulations or experiments that can be used to train deep learning models that predict the mechanical properties of the parts [17,18]. These research place too much emphasis on using the powerful computing power of deep learning models to summarize potential laws based on big data and make predictions in subsequent applications. However, these resulting operations based on mathematical models often have certain limitations. Other researchers tend to use computer vision and deep learning to achieve defect detection in the additive  manufacturing production process. Jin et al developed a set of real-time monitoring and automatic correction system for 3D printing for the surface defects of FDM prints [19,20]. Khan et al developed convolutional neural network deep learning models based on feature extraction of geometric anomalies occurring in fill patterns to detect real-time malicious defects, preventing production losses and reduce manual involvement in quality inspection [21]. Some deep learning models for recognizing different types of images (such as CT and thermal images) also have been developed [22,23]. The research mentioned above use deep learning models to extract features from images acquired during additive manufacturing and accurately pinpoint the location of defects. These methods can instantly identify the occurrence of every defect as well as the process parameters that contribute to it [24]. It should be noted that these research focus on the application of the model and ignore the performance of the model itself. Due to its complex layer structure as well as parameters, the calculation process of the model becomes time-consuming. In the face of large-amount data processing, the performance of the model will be difficult to achieve the expected effect. Therefore, the improvement of the structure of deep learning models will also become a major challenge.
In this paper, the research focus is to improve the deep learning algorithm to avoid such drawbacks when facing large-amount data processing. To achieve this goal, the model improvement schemes include lightweight processing and loss function optimization. The structure of this paper is as follows. Section 2 introduces the deep learning model to be used in this paper and the algorithm improvement ideas. Section 3 shows the experimental setup used for the study and the training process of the model. Section 4 discusses the training results and performance evaluation of the model. Section 5 summarizes the work and proposes future directions.

Deep learning model
In this section, the working principle of two deep learning algorithms -YOLO v4 and MobileNetV2 -will be introduced. In addition, this section details the idea of how to combine the two algorithms, MobileNetV2 and YOLO v4.

You only look once (YOLO) v4
The core of the You Only Look Once (YOLO) target detection algorithm lies in the small size of the model and fast computation [25]. Unlike two-stage target detection algorithms like RCNN, YOLO only needs to put the image into the network to get the final detection result. For detection, the input image will be divided into grids of different sizes, and each grid is responsible for a different region. If the center of the target to be detected falls in a grid, then that grid is responsible for detecting the target. the structure of YOLO v4 is shown in figure 3, which can be summarized into three parts: backbone, neck, and head.
The backbone feature extraction network of YOLO v4, called CSPDarknet53, introduces the Resblock_body module, which is essentially a large convolutional block composed of a series of residual network structures. When the image (416 × 416 × 3 as an example) is input to CSPDarknet53, feature extraction will be performed continuously by convolution. At this time, the width and height of the input layer are compressed and the number of feature channels is expanded. This is repeated five times, and finally three effective feature layers are obtained: 52 × 52 × 256, 26 × 26 × 512, and 13 × 13 × 1024. The neck part of YOLO v4 consists of the Spatial Pyramid Pooling [26] (SPP) structure and the Path Aggregation Network [27] (PANet) structure, which aims to enhance the feature extraction of the network from the input images. The SPP structure is processed using four different scales of maximum pooling after three convolutions of the last feature layer of CSPDarknet53, with maximum pooling kernel sizes of 13 × 13, 9 × 9, 5 × 5, and 1 × 1. This can greatly increase the perceptual field and separate the most significant contextual features. After the pooling process is completed, the feature layer is stacked and convolved 3 times again, and then fed into the PANet structure as the other valid feature layers. In the PANet structure, the three effective feature layers are performing the operation of feature fusion to obtain the prediction results in the YOLO Head. The prediction result can be expressed as a tensor of N × N × (3 × 5 + C), where N is the number of grids divided by the image and C is the number of detected target classes. But this prediction result does not correspond to the final prediction box on the image, and decoding is still needed to obtain the final prediction result. When the final prediction result is sorted by score and filtered by Non-Maximum Suppression (NMS), it will be directly drawn on the image as the result output.

MobileNetV2
The MobileNet [28] family of models is a lightweight deep neural network proposed by Google for embedded devices such as cell phones, using the core idea of depthwise separable convolution. Figure 4 shows the structure of depthwise separable convolution, which can be regarded as a splitting of the normal convolution. The first part of the depthwise separable convolution convolves the input feature map with the convolution kernel used for feature extraction, which has a channel number of 1. The second part convolves the result of the first convolution with several convolution kernels of size 1 × 1, which are used to adjust the number of channels.
We assume that the size of the input feature map is´Ń N C, 1 the size of the output feature map iśŃ N C , 2 and the size of the convolution kernel used for feature extraction is K. The computational volume of the depth-separable convolution P 1 is shown in equation (1): The computational volume of ordinary convolution P 2 is shown in equation (2): By calculating the ratio of equations (1) to (2), we can conclude that As can be seen from equation (3), the computation of deep separable convolution is much smaller than that of normal convolution. This means that the training and prediction inference time can be effectively reduced when the MobileNet family of networks is used.
In this paper, MobileNetV2 is chosen as a model to replace the YOLO v4 backbone network. MobileNetV2 [29], as an upgraded version of MobileNet, introduces an inverted residual module and a linear bottleneck layer on top of the deep separable convolution. Its structural module can be divided into a backbone part and a residual edge part, as shown in figure 5(a). The backbone part first uses 1 × 1 convolution for dimensionality enhancement, then uses 3 × 3 deep separable convolution for feature extraction, and finally uses 1 × 1 convolution for dimensionality reduction. The whole process can be briefly summarized as 'dimensionality enhancement -convolution -dimensionality reduction'; in the residual edge part, the input is directly connected to the output. Figure 5(b) shows part of the structure of MobileNetV2, and this part can be directly used as the backbone network of YOLO v4.

Lightweight processing
As can be seen from section 2.1, the three effective feature layers obtained from the backbone feature extraction network of YOLO v4 will be input to the Neck network for feature fusion. Here we select 52 × 52 × 32, 26 × 26 × 96, and 13 × 13 × 320 as the effective feature layers of MobileNetV2. By replacing CSPDarknet53 with MobileNetV2, we obtain a preliminary lightweight YOLO v4. At this point, the number of parameters in the network is reduced to 60.69% of the original. To further reduce the number of parameters, we replace the normal convolution in the PANet structure with a deep separable convolution, which reduces the number of parameters of the network to 16.78% of the original one. Table 1 lists the number of parameters for YOLO v4, MobileNetV2-YOLO v4 without the modified PANet structure, and MobileNetV2-YOLO v4 with the full modification. the overall structure of MobileNetV2-YOLO v4 is shown in figure 5.

Improvements to the loss function
The loss function is commonly used to calculate the difference between the predicted value and the true value output of the network model and update the model parameters through the backpropagation algorithm to guide the network model to make more accurate predictions. The loss function of the YOLO v4 model is composed of three parts: complete IoU(CIoU) loss [30], classification loss and confidence loss. Among them, the classification loss is used to measure the error between the object class predicted by the detection frame and the actual object class; Confidence loss is a measure of the difference between the confidence level predicted by the model and the actual existence of the target. As an important part of the loss function in the whole model, the CIoU loss function measures the intersection over union (IoU) between the prediction box and the ground trurh(GT), the center point distance, and the aspect ratio. Figure 6 shows the calculation principle of the CIoU loss function, we assume that box A is the prediction box output by the model, and box B is the ground truth marked in advance.  The formula for calculating the CIoU loss is as follows: , GT represents the Euclidean distance between the two center points of the prediction box b and the ground truth b , GT c represents the diagonal length of the minimum bounding rectangle for the prediction box and the ground truth, w and h represent the width and height of the prediction box, w GT and h GT represent the width and height of the ground truth. As a penalty term for aspect ratios, a is a positive number and v is used to measure the consistency of aspect ratios.
From the equations (5)- (7), we can see that v reflects the difference in aspect ratio between the predicted box and the ground truth, not the actual difference in width and height. This loss calculation is too vague, resulting in models sometimes not being optimized efficiently. To avoid this situation, we replace the CIoU loss used in the YOLO v4 model with the EIoU loss [31], which is calculated as follows: where c w and c h represent the width and height of the minimum bounded rectangle. It can be seen from equation (8) that the EIoU loss is improved on the basis of the CIoU loss. The loss term of the aspect ratio is split into the difference between the width and height of the prediction box and the width and height of the minimum outer frame, respectively. Direct calculation of width and height loss can make the model continuously narrow the gap between the width and height of the predicted box and the ground truth in subsequent predictions, thereby accelerating the convergence of the model and improving the accuracy.

Acquisition of training data
In this work, all printed specimens were produced by a CR-6 SE FDM printer. The material used was white polylactic acid (PLA) filament. The experimental setup is shown in figure 7. Before printing, we used Simplify 3D as the slicing software to generate G-code files to be transferred to the printer for printing. The target defects that we wanted to detect were made more visible by adjusting the extrusion rate of the printer or by modifying the content of the G-code file. To obtain clear training data, a camera was set up above the printer to take random pictures of the printed specimens in real time during the printing process. As shown in figures 2(b)-(f), the acquired images were labeled with the appropriate tags: 'defect' and 'no-defect'.
In order to make the model obtain better generalization ability and higher robustness, this paper uses Mosaic data augmentation to augment the training data. Mosaic data augmentation works by randomly cropping four images and stitching them onto one graph as training data. In this way, multiple images with a single defect category can be randomly combined into a new image with multiple scales and categories, and the background features of the image can be enriched.

Training procedure
The dataset used in this paper consists of about 600 images containing defects and good quality characteristics, which were obtained as a result of data collection efforts. Through random division, 90% of the images in the dataset will be used as the training set for model training, and the remaining images will be used as the test set to test the performance of the model.
The improvements to the model mentioned are twofold: the lightweight processing of the model and the modification of the loss function. To further study the influence of the above improved schemes on model performance, we design a model with different combinations of schemes for comparative training, and the specific combination information is shown in table 2. '√' indicates that the model adopts a related improved scheme, while '-' indicates that the model does not adopt the improved scheme and maintains the original structure or function. It should be emphasized that all models use the same set of training parameters during the training process to facilitate subsequent study of their performance changes, and the details of the training parameters are shown in table 3.    From table 2, it can be seen that Model 1 adopts two improved schemes: lightweight processing and EIoU loss, and Model 2 only replaces its backbone network accordingly. No improvement strategy has been adopted in Model 3 to facilitate comparison as the original network model. Figure 8 shows the change of the training curves of the above models with the same parameter settings, and the specific values used to evaluate the detection accuracy and real-time detection of the model are also listed in table 4. The results show that by replacing the original backbone feature extraction network with a simpler lightweight structure, the convergence of the model in the training process can be effectively accelerated. Compared with Model 3, Model 1 and Model 2 with lightweight processing have achieved 4.14% and 1.74% improvement in detection accuracy, respectively, and have also greatly improved the detection speed. Since Model 1 also changes the loss function, due to the accelerated regression of the EIoU loss function to the predicted box width and height, Model 1 obtains the highest detection accuracy in the above model, with an mAP value of 98.96%, and its detection speed has also increased slightly compared to Model 2, with a FPS of 50.8.

Comparison with other detection models
To further verify the good performance of MobileNetV2-YOLO v4 in detecting defects, we selected other detection models to train under the same conditions. In this comparative experiment, we select YOLO v5 and ResNet50 as reference objects to analyze the feasibility of applying the proposed model to industrial production. Figure 9 shows the training curves changes of several models, and the evaluation index values of the models are listed in detail in table 5.
Since YOLO v5 has been developed in many versions so far, here we select the model with the smallest network structure depth in the entire series as a reference object. The results show that MobileNetV2-YOLO v4  has improved the detection accuracy by 1.89% compared with YOLO v5. Although it is slightly lower than the YOLO v5 in terms of detection speed, it can still meet the needs of actual production. Benefitting from its deep residual block network structure that can fully extract the characteristics of target defects, ResNet50 obtains slightly higher detection accuracy than MobileNetV2-YOLO v4 in the test. However, limited by the structural characteristics of its two-stage object detection model, its detection speed is much lower than that of all the detection models mentioned in this paper.

Application of the model
To verify the actual detection performance of the MobileNetV2-YOLO v4 model, a surface image of a workpiece printed at low extrusion magnification will be input into different models to compare its results. Figure 10 shows the output of the three models. Table 6 also details the time required for each model to inspect images. As can be seen from figure 10, the proposed model can accurately identify areas with defective surfaces and has excellent performance in terms of detection speed. Compared with YOLO v5, its detection speed is comparable to it, and it is significantly faster than the two-stage detection model ResNet50. It proves the feasibility of defect detection using improved models.

Conclusions
This paper shows a lightweight neural network based on YOLO v4 algorithm model, which can achieve better performance by adopting two improvement schemes: lightweight processing and modifying the loss function. The improvement model is based on the selection combination of different improved schemes and the comparison with other detection models to draw the following conclusions:  • Based on lightweight processing and improvement of loss function, MobileNetV2-YOLO v4 model shows better detection accuracy and precision measurement speed than the original model during the training process, and its mAP reaches 98.96%, and the number of frames processed per second (FPS) reaches 50.8.
• By comparing with the two classical object detection models of YOLO v5 and ResNet50, the detection accuracy of MobileNetV2-YOLO v4 is higher than that of the YOLO v5 model and slightly lower than that of the ResNet50 model. However, its detection speed is not much different from the YOLO v5 model, and it is significantly higher than that of the ResNet50 model. Therefore, the performance of MobileNetV2-YOLO v4 meets the requirements of real-time detection.
• By identifying the surface images of workpiece printed at a low extrusion rate, MobileNetV2-YOLO v4 can accurately identify the location of defects and detect them verifying its good inspection performance.
Future work to improve the model includes enriching the dataset to make the model more generalized performant. In addition, the defect detection in this paper is based on the positioning and division of rectangular boxes, and in the future, the defect detection at the pixel level can be carried out through semantic segmentation and other methods to achieve more detailed defect area detection.

Data availability statement
All data that support the findings of this study are included within the article (and any supplementary files).