Wafer Defect Identification with Optimal Hyper-Parameter Tuning of Support Vector Machine using the Deep Feature of ResNet 101

As semiconductor processing technologies continue to advance, semiconductor wafers are becoming more densely packed and intricate, resulting in a higher incidence of surface imperfections. Therefore, it is crucial to detect these defects early and accurately classify them to pinpoint the root causes of the defects in the manufacturing process, ultimately leading to improved yield. Therefore, defect detection is critical in the industrial production of monocrystalline silicon. This study employs deep learning techniques to propose a framework for detecting defects on silicon wafers, focusing on optimizing the hyperparameters of support vector machines (SVM). Three methods were utilized to fine-tune the SVM parameters: Bayesian optimization, grid search, and random search techniques. This study demonstrates how selecting optimal values for SVM parameters can lead to better classification. Additionally, real manufacturing data were utilized to evaluate the performance of the proposed SVM classifier, with a comparison to state-of-the-art techniques in the field. By using deep features from ResNet 101 and a support vector machine, this work achieves 74.5% accuracy in identifying wafer defects without employing any optimization technique. However, the performance of the model was further improved by utilizing the random search optimization technique, which yielded the best result among the three optimization techniques tested, with an accuracy of 88.1%.


Introduction
The manufacturing sector suffers enormous losses owing to semiconductor wafer defects.Additionally, it has a significant impact on product development.For wafer defect categorization, semiconductor engineers use a variety of methods such as manual data extraction of pertinent attributes or algorithms based on machine learning.However, these methods are unreliable and it is necessary to increase their classification power.To automatically identify wafer faults, this study proposes a convolutional neural network based on deep learning.The variety of potential faults makes it challenging to develop a reliable machine-vision system that can correctly identify and categorize numerous types of wafer defects.Determining the causes of flaws, which can be achieved by examining defect patterns, is essential for reducing the frequency of defects.The wafer test time can be reduced, and the yield can be increased using a productive defect detection method.Pre-processing and categorization techniques frequently lead to the loss of important sensor signal data that are necessary for identifying wafer defects.To manufacture semiconductors, wafer flaws must be identified and categorized accurately.They allow for the implementation of quality control and yield enhancement techniques, and provide useful insights into potential fault sources.
The DTL-MobileNetV2 model performed better in identifying and categorizing six types of monocrystalline silicon wafer defects, including cracks, double contrast, holes, microcracks, sawmarks, and stains, owing to the use of an existing deep-learning model to extract features from monocrystalline silicon wafer defects.When tested against the testing set, the DTL-MobileNetV2 model-based classification method for monocrystalline silicon wafer defects attained a remarkable accuracy of 98.99%.As a result, DTL is suitable for minimizing misclassification and increasing the total production capacity [1].This demonstrates that DTL is a highly effective method for identifying various types of flaws in monocrystalline Si wafers.We used a data augmentation strategy to address the problem of class imbalance.The proposed model uses convolutional layers to extract significant characteristics rather than manually.Additionally, cutting-edge regularization methods, including batch normalization and spatial dropout, are used to improve the classification performance of the CNN-WDI model.Nine different types of wafer map defects were classified with an average accuracy of 96.2% using the CNN-WDI model [2].a visual classification technique based on machine learning for visible surface flaws in semiconductor wafers.The proposed method uses deep learning-based convolutional neural networks to identify and classify four different types of surface flaws: center, local, random, and scrape.The method makes use of deep learning-based convolutional neural networks to recognize and classify four different types of surface flaws: center, local, random, and scraping flaws.The experimental findings showed that the proposed method, without any further improvement, had a high accuracy rate of 98-99%.Regarding its effectiveness for wafer-defect classification in experiments, the method performed better than previous machine learning approaches [3].This study introduces DLADC, an ADC system that uses deep learning technologies and a deep convolutional neural network (CNN) architecture to identify and categorize surface defects on semiconductor wafers.Scanning electron microscopy (SEM) images are the input for the proposed system, which outputs the type and location of the identified fault.Additionally, knowledge about the size of the flaws can offer important insights into the cause of machine failure, which can help to enhance the manufacturing process and reduce defects.Therefore, the suggested system's subclassification of particle-type defects into different size groups is quite helpful for semiconductor makers.Furthermore, based on experimental findings using a genuine semiconductor defect dataset, an increased accuracy rate of 93.69% was attained [4].constructed a machine learning system to look for surface flaws on semiconductor wafers.This technology combines a mask R-CNN-based AI algorithm with an optical scanning microscopy device.A dataset of microscopic images of wafers with different devices, including MEMS, silicon photonics, and superconductors, at different manufacturing phases, was used to train the system.Images of surface flaws were included in the training dataset.With an average precision of 92.6% and recall of 93.8, the experimental findings demonstrated that the proposed system achieved high flaw detection and classification accuracy.Results from testing on the most recent dataset of 192 photos with a resolution of 1600 × 1200 pixels and a 5x optical magnification revealed an accuracy of 86% and a detection time of 0.5 seconds per image [5].The DEFF suggests that the ADC method makes use of an ensemble feature structure.The system has a decision network layer and featured network layer.The ensemble features are computed by concatenating the features learned by various pre-trained CNN models to represent wafer faults.Different types of wafer surface damage can be categorized automatically using the DEFF approach [6].A CNN is a well-liked deep learning architecture, and its acceptance is credited with its efficiency.To perform wafer map retrieval tasks and classify defect patterns, this study suggests employing CNN and XGBoost.The authors assessed the performance of CNN and XGBoost against various machine-learning models, such as support vector machines, random decision forests, and adaptive boosting.The findings demonstrate that the proposed approach, which combines CNN and XGBoost, outperforms competing models, with XGBoost being the most widely used machinelearning framework among data scientists.This is considered to be the first time such an accuracy has been recorded for this challenge, as the proposed solution employing CNN with extreme gradient boosting attained a high classification accuracy of 99.2% for the test dataset [7].The suggested DL model was built on convolutional neural networks and utilized to identify wafer defect patterns.(CNNs).Convolutional and fully linked layers were employed with CNNs for feature extraction,

80
Engineering Tribology, Processing and Modeling classification, and pooling.Eight convolutional layers and three fully linked layers constitute the model.CNN is a reliable technique for locating flaws in semiconductor wafers.Even in the presence of random noise, it performs well for both single-defect detection and multiple-defect identification.
In particular, the model exhibited 84 percent accuracy for mixed defect detection and 100 percent efficiency for single-defect pattern recognition [8].The stacked denoising autoencoder (SdA), a deeplearning technique, was used in the proposed method for fault detection and classification (FDC).By locating global and invariant features in sensor signals, the SdA model can extract features and categorize them simultaneously.In comparison to 12 other models that used various feature extractors and classifiers, experiments on wafer samples obtained from a photolithography tool revealed that using a stacked denoising autoencoder (SdA) to create a fault detection and classification (FDC) model led to higher classification accuracy.In particular, when the measurement noise severity increased, the SdA model achieved up to 14% higher classification accuracy [9].The YOLO architecture was assessed for wafer map defect identification and classification, the You Only Look Once (YOLO) architecture is assessed.On a dataset of 19200 wafer maps, the YOLOv3 and YOLOv4 variations achieved more than 94% classification accuracy in real time.assessment of the You Only Look Once (YOLO) architecture for the identification and localization of wafer defects.The categorization of wafer defects using the ResNet50 and DenseNet121 architectures was examined in the absence of localization capabilities.While ResNet50 and DenseNet121 only achieved 89% and 92% accuracy, respectively, the YOLOv3 and YOLOv4 variations of the YOLO architecture achieved over 94% classification accuracy in real time.Conclusion: Defects in semiconductor wafers can be identified and classified using object detection algorithms [10].

About Dataset
The dataset of wafer defects was obtained from the Kaggle website, comprising 54.02k images.The dataset was divided into two categories: balanced and imbalanced.Both varieties contained nine distinct types of wafer surface defects."Experienced process engineers are hired to define wafer defect patterns and assign them unique labels such as Center, Donut, Local, Edge-Loc, Edge-Ring, Scratch, Random, Near-Full, and None.Additionally, these defects become more common owing to the increasing integration density of circuits and wafer design complexity.Each wafer defect occurs because of the specific abnormal behavior of certain fabrication processes.For instance, center defects may occur due to uniformity issues in chemical and mechanical planarization, Edge-Loc defects may occur because of thin film deposition, and Edge-Ring defects due to etching problems" [2].In the imbalanced dataset, the "None" category has the highest number of images, with 36.4k,followed by "Edge_Ring" with 8,554 images, "Center" with 3,462 images, "Edge_Loc" with 2,417 images, "Scratch" with 500 images, "LOC" with 1,620 images, "Donut" with 409 images, "Random" with 609 images, and "Near_Full" with only 54 images.
In contrast, the balanced dataset had an equal number of images for each defect type, with 409 images per category.Each image in both the datasets had a resolution of 26 × 26 pixels.Figure 1

Proposed Methodology
A strategy must be devised for constructing and sustaining the categorization algorithm in an actual production setting, where wafers go through multiple pieces of machinery before being assessed for their yield.To take advantage of early yield forecasting, the classification model must capture wafer measurements before their arrival at the end of the production line.Nonetheless, the manufacturing process can encounter alterations or deviations that affect the effectiveness of the classifier.Therefore, we suggest empowering an experienced process engineer to determine when to revise a model.This determination can be made with the aid of in situ monitors that are utilized in every processing equipment to monitor the key variables of the process.However, the SVM model must wait until several new wafers have been processed with the updated change and subsequently evaluated at the end of the line stage.The gathered measurements and yield values linked to these new m wafers were then employed as the new training set for updating the SVM model.The experimental section demonstrates how the adaptable SVM model enhances the yield classification when the model change is implemented following the desired process changes [11].

Support Vector Machine
Recently, SVM, developed by Vapnik in 1995, has emerged as a method that has gained significant attention owing to its impressive outcomes.The primary distinction between ANN and SVM lies in the principle of risk minimization.Whereas ANN employs empirical risk minimization to minimize the error in the training data, SVM implements the principle of Structural Risk Minimization by constructing an optimal separating hyperplane in the hidden feature space and utilizing quadratic programming to determine a unique solution.To achieve this, the SVM constructs an optimal separating hyperplane WXWX + b=0, thereby implementing the principle of Structural Risk Minimization.
To obtain the optimal hyperplane: {x ∈S(w,x)+b=0}, the vector w's norm must be minimized.Conversely, margin 1/||w|| between the two classes should be maximized.This is achieved by  =1,2,.. ( The circled points in SVM are referred to as "support vectors."These vectors satisfy the condition Yi(W.Xi) + b = 0 and confine the margin.Therefore, any movement of these support vectors changes the normal vector W.
Nonlinear SVM utilizes a mapping function ɸ to map the training samples from the input space into a higher-dimensional feature space.In the nonlinear case, the data are first mapped to another Euclidean space H using a mapping function ɸ: Rd → H. Instead of dot products, a "kernel function" K is employed, such that K(Xi, Xj) = ɸ(Xi).ɸ(Xj).There are various kernel functions, including polynomial, radial basis function (RBF), and sigmoid kernel, which are represented as functions (2) enhancing the classification accuracy.

Optimization Techniques
To achieve optimal results using deep learning algorithms, it is essential to tune their parameters appropriately.Selecting a powerful deep-learning algorithm and adjusting its parameters are critical for developing a high-accuracy classification model.However, manually performing parameter optimization can be extremely time-consuming, particularly when the learning algorithm has many parameters.One of the biggest challenges in setting up an SVM model is selecting an appropriate kernel function and determining the optimal parameter values.Poor classification outcomes often result from the selection of inappropriate parameter settings.Therefore, it is crucial to identify the most appropriate parameter settings for an SVM model in order to ensure the best possible classification results.
Hyperparameters in neural networks include the number of layers, learning rate, momentum, minibatch size, and other factors that directly impact training performance.Therefore, optimizing the hyperparameters and selecting the most suitable set to achieve good training results, robustness, and strong generalization ability of the neural network is essential.However, optimizing hyperparameters is a complex combinatorial optimization problem that can be challenging.Furthermore, evaluating a set of hyperparameter configurations is time-consuming and costly.Therefore, it is crucial to carefully and efficiently optimize hyperparameters to achieve the best possible training outcomes [13].
This paper employed three methods to tune the SVM parameters: Bayesian optimization, grid search, and random search techniques.The aim of parameter tuning is to reduce testing time while maintaining test accuracy, and a feasible parameter setting must achieve comparable accuracy to the original model.

Grid Search Optimization Technique
The grid search method systematically tests a predefined subset of hyperparameter values by exhaustively evaluating all the possible combinations of hyperparameters within a specified range.Hyperparameters are typically defined using the minimum value, maximum value, and step size for each hyperparameter.The search can be conducted on linear, quadratic, or logarithmic scales.Performance metrics are then used to evaluate the performance of each combination of hyperparameters.
Grid search tunes the SVM hyper-parameters (such as C, γ, degree, etc.) using a performance metric based on the cross-validation (CV) technique.The primary objective is to find optimal hyperparameter values that lead to an accurate prediction of new data by the classifier while avoiding overfitting.
To determine the optimal values of C and γ using k-fold cross-validation, we divided the available data into k subsets (typically, k=10).One subset was used as the validation set and the model was trained on the remaining k-1 subsets.Next, we evaluated the performance of the SVM classifier on the validation set by using different values of C, γ, and other parameters.Next, we repeated this process for all k possible partitions of the data and calculated the average cross-validation error for each combination of hyperparameters.Finally, the combination that achieved the highest crossvalidation accuracy (or lowest error) was selected and used to train an SVM on the entire dataset [14].

A. Random Search Optimization Technique
A grid search can result in unnecessary computations of the hyperparameters that have little impact on the performance of the model.A more efficient technique called random search can be used to overcome this problem.In a random search, hyperparameters are randomly sampled from the search space, and the configuration with the best performance is chosen.The intuition behind this approach is that a sufficiently large set of random samples is likely to include the global optimum or a close approximation with high probability.Moreover, random search is typically faster than grid search [13].

Bayesian Optimization Technique
The Bayesian optimization algorithm finds that the best parameter values are different for both grid search and random search.Bayesian optimization is an adaptive method for hyperparameter optimization.It uses information from previous experiments to predict the next set of parameters that might provide the greatest benefit.Bayesian optimization algorithms are different from grid search and random search in that they use all the information they already know when testing new parameter combinations.The Bayesian optimization algorithm starts by making a prior distribution based on what is known about the objective function.As new sampling points are tested, this information is used to update the prior distribution.This creates a posterior distribution that better represents the true objective function.The algorithm then tests the points in the area where the posterior distribution indicates that the global optimum is most likely to be found.This process of testing new points and updating the previous distribution occurs repeatedly until the algorithm determines the best set of hyperparameters.In Bayesian optimization, the algorithm balances exploration and exploitation so that it does not become stuck in a local optimal.Exploration refers to looking for new places to take samples from in the hyperparameter space, whereas exploitation refers to taking samples from places that have worked well in the past.By finding a middle ground between these two methods, Bayesian optimization can obtain the global optimum of the objective function with fewer samples than grid or random searches [13].

Resnet 101
The ResNet-101 network has 101 layers and is a deep convolutional neural network.Its design was based on the VGG-19 model, and it is one of the most complex structures proposed for the ImageNet competition.In a CNN, many layers are linked to one another and trained to perform different tasks.ResNet learns various feature levels through its layers.The convolutional layers of the model usually have 33 filters.One aspect that makes ResNet stand out is that each layer has the same number of filters for the same output feature map size.If the size of the output feature map is reduced by half, the number of filters doubles.This helps ensure that the temporal complexity of each layer remains the same.Direct downsampling is performed by combining two layers with a stride of two.ResNet ends with a global average pooling layer and a fully connected layer that is activated by SoftMax.ResNet's main characteristic is residual learning, which can be thought of as the subtraction of input features acquired from that layer.ResNet uses short-cut connections between each pair of 33 filters.The input of the kth layer was directly connected to the output of the (k+xth)th layer.This skips layers and helps prevent the problem of gradients that get smaller and smaller until they disappear.This is done by recycling the activations from the layer before the current layer until the layer after it has learned its weights.While the network was being trained, the weights made the next layer louder and the layer before it quieter.ResNet is easier to train than ordinary deep convolutional neural networks and solves the problem of accuracy loss [15].

Results and Discussion
In this experiment, two different methods were used to compare the performance of the SVM models.The first method involves using the SVM model without any optimization technique, whereas the second involves using the SVM model with optimization techniques to improve the classification accuracy.This experiment used three different optimization techniques: Bayesian, Grid Search, and

84
Engineering Tribology, Processing and Modeling Random Search.The proposed method was implemented on an HP Victus laptop with a 12th generation Intel Core i7 CPU, running on the Windows 11 operating system, including a built-in NVIDIA graphics processing unit, and employing MATLAB 2022a software.The training dataset comprised 70% of the data, the validation dataset comprised 20%, and the remaining 10% was reserved for testing the augmented dataset.The classification accuracy refers to the proportion of wafers that were classified correctly from the total number of wafers.In contrast, the false-negative rate is the percentage of bad wafers incorrectly classified as good.Bayesian optimization was utilized to tune the hyperparameters of the SVM method by employing a one-vs-one approach for multiclass classification.The hyperparameters chosen were the box constraint level and kernel scale, with values of 825.6872 and 12.9019, respectively.A Gaussian Kernel function is employed.To evaluate the performance of the proposed method, validation and evaluation were performed.After 30 iterations, the minimum classification error is 0.10946.The proposed method was validated using 20% of the dataset, and it achieved a validation accuracy of 87.8%.The testing accuracy was evaluated using 10% of the dataset, and an accuracy of 85% was achieved.Figure 3 presents the Iteration Vs.Classifier Error graph and confusion matrix of SVM Classifier with Bayesian Optimization The confusion matrix of this approach is shown in Figure 4.
A grid search is a systematic search based on a predefined subset of the hyperparameter space.It partitions the range of parameters to be optimized into a grid and evaluates all the points to obtain the optimal parameters.The grid search optimizes the SVM parameters using cross-validation as the performance metric.Our next strategy for enhancing accuracy is to utilize the grid search optimization technique with the SVM classifier.The Grid Search optimization technique was used to tune the hyperparameters of the SVM method for multiclass classification.The hyperparameter space was defined and divided into a grid, and all the points in the grid were evaluated to obtain the optimal parameters.The performance metric used for optimization was cross-validation.This approach used a linear kernel function with a box constraint level of 0.1.After 30 iterations, the minimum classification error is 0.11292.The proposed method was validated and evaluated using 20% and 10% of the datasets for validation and testing, respectively.As a result, the achieved validation accuracy was 87.3% and the testing accuracy was 88.4%.Random Search hyperparameter tuning of the SVM method uses the one-vs-one approach for multiclass classification.The box constraint level for this approach is 0.062075, and the linear kernel function is used.Validation and testing were performed on the dataset to evaluate the performance of the proposed approach.After 30 iterations, the minimum classification error is 0.11804.The validation accuracy achieved was 88.1%, and the testing accuracy was 88.1% when 20% and 10% of the dataset were used for validation and testing, respectively.The confusion matrix for this approach is shown in Fig. 6.

Fig. 1 .
displays the samples of wafer surface defects.Shows the balanced dataset of the defect wafer surface

Fig. 2 .Fig. 3 .Fig. 4 .
Fig. 2. Confusion Matrix of SVM Classifier without optimizationFigure2presents the accuracy scores of Resnet 101 using SVM hyperparameters for nine different types of wafer defect detection.As shown in the figure, the model achieved an accuracy score of 74.5%.However, because accuracy is critical in wafer defect detection in the manufacturing industry, it is necessary to improve it further.To achieve this, an optimization technique was employed with the SVM classifier to enhance the accuracy of wafer defect detection.Bayesian optimization has emerged as a popular technique for addressing optimization problems in various domains where traditional numerical methods fall short.For example, one of its popular applications is hyperparameter tuning, which aims to minimize the validation error of a machine learning algorithm by tuning its hyperparameters.The process involves evaluating the objective function, which is the validation error, by training the machine learning model and assessing its performance on validation data.In this context, Bayesian optimization was applied in conjunction with an SVM classifier to improve the accuracy of wafer defect detection.

Fig. 5 .Fig. 6 .
Fig. 5. Iteration Vs.Classifier Error graph and confusion matrix of SVM Classifier with Grid Search Optimization

Figure 5 Fig. 7 .Fig. 8 .
Fig. 7. Iteration Vs.Classifier Error graph and confusion matrix of SVM Classifier with Random Search Optimization

Table 1 .
Shows the comparison between three different optimization techniques.In this paper, we propose a solution for classifying defects in monocrystalline silicon wafers using Resnet 101 deep features.We also explore the use of Bayesian optimization, Grid Search (GSGS), and Random Search (RSRS) techniques to fine-tune the hyperparameters of Support Vector Machines (SVMs) for accurate defect detection.This approach aims to enable semiconductor engineers to automatically classify wafers and diagnose defects early without relying on specialized or empirical knowledge.Based on the results presented in Table1, the test accuracy achieved by Bayesian optimization is lower than the validation accuracy, indicating that the model may overfit the training data.Grid Search, on the other hand, has a larger difference between the test accuracy and validation accuracy compared to the Random Search optimization technique.Furthermore, Random Search showed no difference between test accuracy and validation accuracy.Therefore, Random Search is the preferred optimization technique for adjusting the hyper-parameters of the SVM using the deep features of Resnet 101 to detect defective wafers.