Automated visual inspection of CMS HGCAL silicon sensor surface using an ensemble of a deep convolutional autoencoder and classifier

More than a thousand 8″ silicon sensors will be visually inspected to look for anomalies on their surface during the quality control preceding assembly into the High-Granularity Calorimeter for the CMS experiment at CERN. A deep learning-based algorithm that pre-selects potentially anomalous images of the sensor surface in real time was developed to automate the visual inspection. The anomaly detection is done by an ensemble of independent deep convolutional neural networks: an autoencoder and a classifier. The algorithm was deployed and has been continuously running in production, and data gathered were used to evaluate its performance. The pre-selection reduces the number of images requiring human inspection by 85%, with recall of 97%, and saves 15 person-hours per a batch of a hundred sensors. Data gathered in production can be used for continuous learning to improve the accuracy incrementally.


Introduction
Silicon sensors are used in high-energy physics experiments due to their sufficient radiation tolerance, energy resolution and cost-effectiveness.In the high radiation area, the active element of the High-Granularity Calorimeter (HGCAL) [1], which will replace the endcap calorimeters of the CMS [2] experiment at the Large Hadron Collider (LHC) [3], will consist of more than 27,000 hexagonal 8" silicon sensor wafers to achieve unprecedented transverse and longitudinal segmentation.An HGCAL sensor is shown in figure 1 (left).The producer of the sensors is Hamamatsu Photonics K.K, and the sensors will arrive to CERN in batches.In order to ensure that the sensors meet the criteria for operation at the LHC, a fraction (5%) of each batch will undergo quality control (QC) in a dedicated clean room at CERN.Thus, more than a thousand sensors will be processed over the course of several months.
The QC procedure adopted during the construction of the CMS silicon trackers involved a visual inspection (VI) in addition to quantification of several electrical properties [4].Similarly, a major part of the QC of the HGCAL sensors is the electrical characterization of the sensors, during which the sensors are biased up to 1000 V [5].Defects and dust on a sensor surface can potentially lead to an electrical failure of the sensor.Examples of a typical defect, a scratch, and a dust particle are shown in figure 2. Given that these defects are rare and unwanted, they are referred to as anomalies.The anomalies can occur during manufacturing, packaging, delivery, or associated handling of the sensors.In an effort to prevent failures, the sensor surface is visually inspected and cleaned prior to the electrical characterization.Aside from exceptionally severe scratches, most of the anomalies are invisible to the naked eye.Traditionally, silicon VI is carried on manually with the help of a microscope.However, dozens of square meters of sensor surface will be inspected during the assembling of HGCAL, and therefore, a standardized and automated method must be in place.Previous to this work, hundreds of microscope images were taken using a scan program, and the images were inspected on a computer monitor by a human operator.This work presents a deep learning-based pre-selection algorithm (PSA) that fully automates the VI.In addition, the PSA is believed to reduce human bias in the VI.The PSA is built upon the proof-of-concept work described in [6].Although the PSA is presented in the context of HGCAL sensor QC, the same approach could be applied to similar use cases of automating the VI of images.
The PSA proposed in this work detects anomalous scan images via an ensemble of a deep convolutional autoencoder (AE) and a deep convolutional classifier neural network.The AE acts on each scan image, and the classifier acts on patches of the image, allowing for localization of anomalous areas (annotation).The PSA has been deployed in a clean room at CERN, and data gathered in production are used to evaluate the performance of the anomaly detection.The performance is mainly measured using two metrics: First, the False Negative Rate (FNR), defined as

FNR =
False Negative False Negative + True Positive (1) should be minimized for a reliable PSA.Second, a relatively large False Positive Rate (FPR), defined as is allowed, but an upper limit of 10% is set to sufficiently automate the VI.This paper is structured as follows.The data acquisition and characteristics are described in section 2.An introduction to automated VI is given in section 3, followed by a description of the proposed architecture and the model training process in section 4. The results of the deployment at CERN are presented in section 5.In section 6, the efficiency and causes of incorrect predictions are discussed, and the proposal for continuous improvement of the anomaly detection capabilities is presented.Finally, conclusions are provided in section 7.

Setup and data set
A custom semi-automated VI system has been implemented in the clean room for HGCAL sensor testing.Using a programmable xy-stage, the sensor is moved beneath a combination of a microscope and a camera in a scan pattern.An example of a scan map is shown in figure 1 (right), where 385 images are taken.A scan image, referred to as a whole image, contains 2720 × 3680 pixels and is stored in Bayer format [7].The Bayer format is a particular arrangement of RGB color filters common to camera systems, which retains the color information but reduces the required bits per pixel from 24 to 8 bits.Examples of whole images in RGB format are shown in figure 2.
The images acquired during the semi-automated VI require human inspection.A small fraction of the images of a typical scan are anomalous, meaning that the operator has to inspect hundreds of normal images to find the anomalous ones.This makes the semi-automated VI tiring, slow, and typically not 100% effective due to visual fatigue.Moreover, it can be biased by inspector's overexposure to normal images which are prevalent in the data set.In addition, since multiple inspectors with varying experience and alertness share the QC task, the VI is further biased by their subjectivity.
The environmental conditions, such as zoom level, sensor alignment underneath the microscope and lighting conditions, can change in between scans, and the PSA must be invariant to these changes.An example of changing lighting conditions between the measurements is demonstrated in figure 2, where the left and right images differ in the overall hue.As the PSA is integrated into the data acquisition of the semi-automated system, it must be real-time.
Taking into consideration the data imbalance, variable environment, and requirements for accuracy and speed, the PSA presented in this paper was developed using images acquired during semi-automated VI of 50 sensors.The data were acquired in batches over the course of several months, and consists of more than 25,000 images.Fifteen sensor scans acquired after the deployment of the PSA are used to evaluate its performance.

Overview of existing methods for automated visual inspection
The task of the PSA is identification and localization of rarely occurring outliers in data.Sometimes, an analytical approach in the form of a series of image processing filters and functions can be used to detect anomalies instead of more complex methods such as deep learning.For example, anomalies have been detected from images of the silicon strip sensors of the Inner Tracker of the ATLAS detector [8] using methods such as a Gaussian filter and Sobel derivatives [9,10].However, due to changing environmental conditions (including room lighting) and characteristics of the normal HGCAL sensor surface, these methods cannot produce robust results.Thus, deep learning, and specifically deep convolutional neural networks (CNNs) [11], are explored in this work.CNNs are known to perform well in image classification tasks.Several classifier networks have been developed, such as the VGG16 [12].Characteristic to its architecture are sequential convolutional layers with small 3 × 3 filters and 2 × 2 max pooling layers.However, classifier networks are not object detectors, as they do not indicate the location of the object.Instead, a widely used network for object detection is the Region-based Convolutional Neural Network (R-CNN) [13], which performs object detection via three distinct networks.The first network is a region proposal network, which extracts up to 2,000 regions of interest from the input image.The regions of interest are passed onto the second model, which is a CNN that extracts the features of each region.Finally, a classifier CNN is applied on the features to produce the classification output in the form of bounding boxes.The R-CNN has been used in automating the VI of silicon micro-strip sensors [14].
Unfortunately, the R-CNN is too slow for real-time object detection, as thousands of iterations are required per image to produce the detection output.Thus, faster versions of the model, such as the Faster R-CNN [15], have been developed.However, a preferred approach for real-time object detection is to perform both the region and feature extraction in the same network and with a single iteration for an image.An example of such a network is the You Only Look Once (YOLO) network [16], which splits an image into cells via a grid and predicts n bounding boxes and the class probabilities for each cell.In the context of the HGCAL project, the use of VGG16 and a version of YOLO known as YOLOv4-tiny have been studied for the VI of wire bonds of the sensor modules [17].
The above discussed CNNs have to be trained in a supervised fashion, with training samples from all classes.Self-supervised anomaly detection can be implemented using AEs, which are composed of two neural networks.The first network, known as the encoder, reduces the dimensionality of the input data into a representation referred to as the latent space.The second network is a decoder, which reconstructs the latent space back into the original dimensionality of the input.An AE is trained by minimizing the loss function, expressed as the reconstruction error.The reconstruction error is usually quantified as the mean absolute error between the input y and the reconstructed output ŷ, and defined as The squared error, defined as can also be used.For anomaly detection, the AE is trained on only normal samples, and thus, it will reconstruct anomalies poorly.Convolutional AEs have been used for anomaly detection from image-like data in multiple applications [18,19,20], also at the LHC [21,22].For images, the reconstruction error is calculated pixel-wise, and localized increases in the reconstruction error indicate anomalies.

Proposed approach
In this section, the architecture of the PSA is described.The full inference pipeline of a whole image consists of the following steps: (i) Apply a patching grid.
(ii) Apply a background detecting classifier, referred to as the background detector, to patches of the whole image.
(iii) Apply an AE to the whole image and calculate the reconstruction error as the pixel-wise absolute difference D.
(iv) Apply an anomaly detecting classifier, referred to as the anomaly detector, to the patches of D.
Thus, each whole image is iterated over three times, and notably, the images are kept in the Bayer format throughout the entire inference pipeline.The anomaly detection process is schematically shown in figure 3. The models were implemented using Keras [23] and TensorFlow 2 [24], and trained on a NVIDIA GeForce GTX 1080 GPU [25].
The hyperparameters of all three models were optimized manually using their respective losses as a metric.The main reasons for patching are that the fraction of area covered by an anomaly is much larger for an anomalous patch than for an anomalous whole image, and that the input size for a classifier becomes smaller.Also, the patching allows the general location of the anomaly in the whole image to be computed.In addition, data augmentation can be done more efficiently by applying it to anomalous patches only, and a class can be under-sampled flexibly from patched data.

Background elimination
During a typical scan, approximately 15% of the images contain the background, referring to the surface the sensor lies on.Occasionally, the background, which is typically a black sheet of plastic, contains features which can be incorrectly selected as anomalies.While FNR is the most important metric to optimize, to further reduce the FPR of the PSA, a background detector is applied to the patches before the whole image is autoencoded.Anomalies in the background patches are ignored.The background detector was built as standalone to allow its elimination from the inference pipeline.
A CNN with four convolutional layers using the ReLU activation, followed by dropout layers with a rate of 20%, and totaling 84,401 trainable parameters, was trained on images sampled from the training data.The patches were given binary labels corresponding to sensor surface and background.The training data and parameters are described in table 1. Classification between the sensor surface and background is a trivial task for a deep CNN, and a test accuracy of over 99% was achieved.

Anomaly enhancement using an autoencoder
The structure of the AE is schematically shown in figure 5.The encoder consists of five convolutional layers, and in mirror-like fashion, the decoder consists of five transposed convolutional layers.The compression factor of the AE is 1,600.The Exponential Linear Unit [26] was used as the activation function.For the training, the L2 loss and Adam [27] optimizer with a learning rate 10 −4 were used.In total, the AE has 126,353 trainable parameters, and it was trained with 16,000 normal whole images for 277 epochs, until validation loss reached a plateau.Due to memory constraints, the batch size had to be set to one.The training data and parameters are summarized in table 1.The AE can be interpreted as a data pre-processing step that makes the subsequent anomaly detection and its localization more robust against environmental changes.The normal and constant features in the images are reduced, while the anomalies are enhanced.An example of how the AE reconstructs an anomalous whole image is shown in figure 6.

Input image
Autoencoder output Difference Figure 6.An example of the autoencoder output (center) for an anomalous input image (left).(right) Reconstruction error, measured as the absolute pixel-wise difference between the autoencoder output and the input.A dust particle is enhanced compared to the normal area, and can be easily isolated.The anomaly is zoomed in on all images.

Anomaly detection
First, the performance of the AE as a standalone anomaly detector was studied using 1,465 anomalous and 225,370 normal patches as training data.A threshold for the reconstruction error was determined based on a validation data set, consisting of 157 anomalous and 27,179 normal patches, such that the validation FNR and FPR were minimized.An increase in AE reconstruction error for anomalous patches is visible in figure 7, where the selected threshold is also indicated ‡.However, as the distributions overlap greatly, the AE reconstruction error cannot be used as a robust enough classifier.‡ Some patches, e.g. of the black areas on the sensor surface, are easy to reconstruct by the AE, appearing as the small bump at the lower range of the reconstruction error in figure 7.
The test FNR is 27%, while FPR is 37%.Therefore, while the AE works very well as a pre-processing step that enhances the anomalies, it cannot be efficiently used to detect and localize them.To tackle anomaly detection and localization, an additional classifier was trained to detect the anomalies.A modified version of the VGG16 network was used as the anomaly detector classifier.The following modifications were made to the original network structure: the input size was decreased to 160 × 160, the number of filters in the hidden layers was decreased, dropout layers were added in between the fully connected layers, and the final softmax layer was replaced with a sigmoid layer.A normalising pre-processing layer was used to scale features between zero and one.The resulting CNN with 23 layers has 2,847,777 trainable parameters.The architecture is illustrated in figure 8, and a summary of the training data and parameters for the anomaly detector are given in table 2. In total, ∼10 6 training patches were used.
During inference, the expected normal-to-anomalous patch ratio is approximately 1,900.Such an imbalanced data set can lead to poor performance of the classifier, therefore, a data augmentation techniques and under-sampling the normal class were used on the training data set to bring this ratio to around 100.The anomalous patches were augmented by applying random and uniform brightness change in the range 0.75 -1.25, multiples of 90-degree rotations, and horizontal and vertical flipping.In addition, the focal loss [28], a dynamically weighted binary cross-entropy loss commonly used with imbalanced training data sets, was used as the loss function with default parameters γ = 2 and α = 0.25.

Validation
In production, the pre-selected whole images are shown to an inspector.In addition, a fraction (10%) of normal images are added to the set shown to the inspector to ensure the minimization of false negatives.The inspector either accepts, rejects or adds anomalous patches to validate the predictions.If only pre-selected images would have been shown, there would be a natural tendency to approve all images as anomalous and trust the PSA.The validated data set is used as ground truths for performance monitoring and for continuous learning to incrementally improve the accuracy of the PSA.

Results
Fifteen sensor scans acquired after the deployment of the PSA, corresponding to 2,052,240 patches from 5,030 whole images, were manually given ground truth labels.
The absolute and normalized confusion matrices for the patches are shown in figure 9 (left).Due to data imbalance, the classification threshold must be set so that FPR for patches is significantly lower than the FNR to achieve an acceptable FPR for the whole images.A more relevant evaluation metric for the PSA is its performance on whole images.A whole image is pre-selected if one or more patches are classified as anomalous.The confusion matrices for the whole images are shown in figure 9 (right), and evaluation metrics are reported in table 3. Twelve anomalous whole images are incorrectly not pre-selected, resulting in an FNR of 2.5%.Two missed images were of novel anomalies and ten were of light scratches and small dust particles.Examples of missed anomalous whole images are shown in figure 10.The FNR of whole images is lower compared to the FNR of patches because most anomalous whole images have multiple anomalous patches.Thus, as long as at least one of the anomalous patches is classified correctly, the whole image is selected.The FPR of whole images is less than 10%.Examples of whole images pre-selected to be anomalous are shown in figure 11, where images (d-f) are false positives.In total, 85% of all whole images can be considered normal and do not require any human inspection.

False negatives and positives
It was observed that false negatives can sometimes be attributed to the fixed grid.If only the default grid is used during inference, anomalies overlapping with or close to the grid lines can be missed.A proposed method for combating this is applying a secondary grid, which is illustrated in figure 3. The secondary grid is 16 × 23 patches in size, and it is overlaid on top of the default grid so that the patching is shifted from the top left corner by 80 pixels in both directions.Only the whole images not selected using the default grid would then be evaluated with the additional secondary grid.
The anomaly detector tends to select patches at a guard ring §, cell numbers and cell borders as false positive, due to the increase in AE reconstruction error in these areas.In addition, contact marks are a common source of false positives, see (e) in figure 11.With the proposed approach, where the detection is invariant to the anomaly location on the sensor, these cannot be eliminated.

Efficiency
Average inference times for a whole image are reported in table 4, with both the default grid and the secondary grid.An average sensor scan consisted of 335 images, for which the picture-taking time is 9.2 minutes.On average, an inspector inspects a whole image in two seconds, evaluating one scan in 11 minutes.On an RTX A2000, the inference time is less than 6 minutes for the scan with both grids.The GPU is necessary, as the short run time allows parallel picture-taking and pre-selection.At variance, the evaluation of a scan would take approximately 45 minutes on a regular CPU, significantly delaying the subsequent electrical characterization.Table 4. Average inference times on a GPU or a CPU for whole images using the default and secondary grids.

Continuous learning
Given that new data will be measured continuously, the PSA should be retrained to adapt to the new data.Using new normal images to train the AE would not improve its task of poor anomaly reconstruction, so it does not require retraining.Also the background detector is considered to be robust enough to not require immediate retraining.However, the accuracy of the anomaly detector could be improved via training with novel anomalous patches.For example, training instances such as the missed anomaly (d) in figure 10 did not exist in the original training data set.
Anomalous images acquired after the initial deployment were given ground truth labels to extend the training data set.The anomaly detector was retrained starting from § A guard ring is a structure at the periphery of a sensor designed to protect it from currents from the cutting edge, and shown in figure 11 (f) as the black line with false positives.
randomly initialized weights with 2.1 times more anomalous patches than the original model.An independent test data set was used to compare the original and retrained models.The test data consists of 136 anomalous whole images and 754 normal whole images, for which the test metrics are presented in table 5.The retrained model performs significantly better on the test set.
Table 5.The performance of the pre-selection algorithm using the original and retrained anomaly detectors.Metrics are calculated for whole images using an independent test set.

Metric
Value

Conclusion
A deep learning-based pre-selection algorithm (PSA) that fully automates the visual inspection (VI) of the silicon sensors produced for the construction of the CMS HGCAL detector was developed.An ensemble of a deep convolutional autoencoder and a neural network for classification is used, with a patching applied before the classification to allow the general localization of the anomalies in the images of the sensors.The automated VI is a vital part of the quality control (QC) of the analyzed sensors.The performance of the PSA was evaluated using fifteen full sensor scans acquired in production in a clean room dedicated to sensor testing at CERN.The recall, which measures the fraction of anomalous images that are found, was 97.46%, with an acceptable FPR of 6.25%.The images are evaluated in real-time, and approximately 85% of all images can be discarded as normal, thus removing the need for human labor to inspect them.The developed automated VI is standardized, and therefore also believed to be less biased by the subjectivity of a human inspector.On average, it saves 10 minutes of resources per sensor, and for each batch of hundred sensors, this corresponds to 17 person-hours less required to manually inspect the images.
The accuracy is considered sufficient for deployment, even though the PSA was shown to fail to select a small fraction of images with light and small scratches and dust particles, in addition to novel types of anomalies.It was demonstrated that as more anomalous images are acquired in production, the data can be used to retrain the anomaly detection model to further improve the accuracy of the PSA.
A major advantage of the presented approach is its intrinsic generality.The algorithm acts on microscope images of the sensor surface, each of which covers only a small fraction of the total surface area of a full 8" sensor, and the anomaly detection is invariant to the image location on the sensor.Thus, the PSA is applicable to variable or incomplete scans, or partial sensors.
Thanks to its generality, accuracy and speed, the presented architecture of the preselection and annotation model could be used in other applications of automating the detection of small anomalies from images taken in a changing environment.

Figure 1 .
Figure 1.(left) An HGCAL silicon sensor wafer, with a pad zoomed in.The diameter of the sensor is 8".(right) An example of a scan map.

Figure 2 .
Figure 2. Scan images in RGB format for anomalous sensors with a scratch (left), and a dust particle (right).The difference in color is induced by lighting conditions.

Figure 3 .
Figure 3.The inference pipeline of the anomaly detection.The input image x is processed with an autoencoder, and the pixel-wise reconstruction error D is calculated between the input and the autoencoder output x .D is given as input to the anomaly detecting classifier in patches.

4. 1 .
PatchingThe whole images are split into patches using a fixed grid.The patches are 160 × 160 pixels in size, and the default grid covers the entire image, resulting in 17 × 24 patches.The patching can be considered as a simplified version of the region proposal for R-CNN.For training the background and anomaly detectors, the patches are given the corresponding binary labels: 0 for sensor surface/normal and 1 for background/anomalous. Examples of anomalous and normal patches processed with the AE are shown in figure4.

Figure 4 .
Figure 4. Sample of anomalous (top) and normal (bottom) patches of the sensor surface used to train the anomaly detector.

DecoderFigure 5 .
Figure 5.An illustration of the dimensions of the autoencoder, which consists of two convolutional neural networks: the encoder and the decoder.The autoencoder takes a whole image as an input, and the compression factor into the latent space is 1,600.

2 Figure 7 .
Figure 7. Normalized autoencoder reconstruction error for anomalous and normal patches.A threshold to classify new samples is also illustrated.

1 Dropout with rate 0. 2 Figure 8 .
Figure 8.An illustration of the architecture of the anomaly detector, which is a modified version of the VGG16 network.

Figure 9 .
Figure 9. Confusion matrices for the patches (left) and the whole images (right).Top row corresponds to the confusion matrix normalized over ground truths.

Figure 10 .
Figure 10.Examples of missed anomalous whole images.(a) A light stain (b, c) light dust particle (d) novel black anomaly (e, f) minor dust particle.

Figure 11 .
Figure 11.Examples of whole images pre-selected with the annotated patches.(a) A large scratch (b) dust particle and a stain (c) small dust particle (d) false positive (e) false positive on contact marks (f) false positive on a guard ring.

Table 1 .
Summary of training data and parameters for the background detector and the autoencoder.For the background detector, Class 0 refers to sensor surface and Class 1 to background.

Table 2 .
Summary of training data and parameters for the anomaly detector.Class 0 refers to normal and Class 1 to anomalous.