Classification of wafer map defect based on ResNeSt

With the increasing circuit complexity of chip design, wafers are prone to defects in the production process. Since different defects have their causes, post-analysis is important to wafer production. However, the sample size is not balanced, and the effect of classical residual networks is limited. This article uses a Split-Attention network (ResNeSt) with which feature maps will be input into k-base groups, each consisting of r groups. Thanks to the large receptive field, image features can be obtained better. We add channel attention and give higher weight to the effective feature channels to improve network performance. Compared with the classical residual network, the new network has an improvement of 43.06%, 32.73%, and 10.86% in handling Scratch, Near-full, and Local defects, respectively.


Introduction
With the prosperity of information construction in various industries, the progress of integrated circuit technology is related to national security and economic development.Wafer manufacturing technology, as the basis of semiconductor manufacturing, fundamentally determines the level of the semiconductor industry.Unfortunately, the current foundation of the Chinese semiconductor industry is still backward, and wafer manufacturing is one of the short boards.Therefore, the post-analysis of wafer defects has become an important method to increase wafer production.
The early research on wafer maps was mainly through statistical methods.The development of deep learning solved the problem of incomplete feature extraction.Liu [1] used the neuro-symbolic model based on SDAE to identify defects effectively.Fang and Shi [2] used the ZFNet convolutional neural network and improved the Faster RCNN classifier for classification and detection.
The depth of a deep neural network does improve the accuracy of the test.Unfortunately, the training of CNN is getting more and more difficult due to the depth of neural networks.At the same time, it is prone to the problems of gradient disappearance and gradient explosion [3].To solve these problems, the residual neural network (ResNet) is proposed.It uses a short-circuit connection to combine low-level feature information with deep feature information so that useful features from shallow layers can still be obtained in the deep network layer, which improves performance [4][5].
Since the advent of ResNet, many networks have been derived, such as GoogleNet, ResNeXt, SE-Net, Sk-Net, etc., all of which have improved on the original ResNet in some respects.Zhang et al. [6] proposed ResNeSt, adopting the Split-Attention network.Each ResNeSt block consists of r cardinal groups split into k groups.The weight of each group is determined by the weighted combination of the global context information represented by its split [7].

Dataset
The dataset WM-811K is collected from real production processes and contains eight types.There are eight types of basic wafer maps recognized.They are shown in Figure 1.These images reflect the specific abnormal behaviors of certain processes during wafer production.For example, center detection [8] occurs in the film deposition step.The Edge-Loc type [9] is caused by uneven heating during diffusion in most cases.Scratch type [10] is generally caused by human error during transportation and handling or faults in chemical-mechanical polishing.All types in the dataset were labeled by experts and included a total of 811, 457 images, of which approximately 20% were annotated by experts, a total of 172, 950 images.Similar to ResNeXt, the feature is divided into k cardinal groups.However, the base group of each ResNeSt block consists of r groups, which differ from the ResNeXt.The input features are divided into k × r independent groups.
The split attention module inside the base group is shown in Figure 3.In each radix group, after the feature map of the radix group passes through the convolution layer, the model will sum the obtained features and then use the global average pooling to fit the dimension of the feature and the split image feature.Then, r-softmax can recalculate the weights of each split image feature by assigning the weight coefficients to the feature vectors using two sets of convolution kernels for 1×1 convolution.Finally, each split feature is multiplied by its corresponding weight and then summed to realize the aggregation of features among each base group.Then, both sides of the current ResNeSt Block will link to the standard residual structure, and the new features are input into the next layer network.
ResNeSt, as an improvement of ResNet, retains the residual structure and proposes the split attention, captures the cross-feature interaction information, reduces the computation, and improves the network representation ability.

Experiment results and analysis
In this paper, 19, 885 wafer maps are selected as the dataset, and the wafer maps are randomly combined into two parts, in a ratio of 4:1.The set with 3977 data is used for training.Another set, which takes up 15908, is used as a test set.To stabilize classification results, this paper uses the training set for 10 training rounds and then uses the test set for the classification test.The classification test was conducted in 10 rounds.The accuracy of the model in this paper and the control group ResNet50 for each category is shown in Figure 4.The total precision of the 10 classification tests is shown in Table 1.As shown in Figure 4 and Table 1, ResNeSt performs better than ResNet when dealing with uneven datasets.The average accuracy of ResNeSt50 is 96.97%, which is higher than 94.07% of ResNet, according to Table 1, when we remove the maximum value and the minimum value from them to take their average.The accuracy of ResNeSt50 for each class, except the class of Random, is higher than that of ResNet50.This situation is particularly evident in the low-volume categories, which shows that ResNeSt enhances the ability of feature capture.Class Scratch shows the best improvement, as much as 43.06%.Compared with the classical residual network, the new network has an improvement of 43.06%, 32.73%, and 10.86% in handling Scratch, Near-full, and Local defects, respectively.

Conclusion
A research method of wafer defect detection based on a distraction mechanism residual network is proposed.The average accuracy of classification results is 95.41% by using a distraction network, which shows the effectiveness of this method.

Table 1 .
The overall accuracy of the 10 classification tests.