Outlier Detection Based on Autoencoder Ensembles with Denoising layer and Attention Mechanism

In the field of outlier detection, two common challenges have persisted. Firstly, outlier detection datasets are often small in size, which can lead to overfitting issues when using deep learning models such as autoencoders. Secondly, as the dimensionality of datasets increases, many dimensions may be irrelevant or noisy, which can adversely affect the model’s ability to learn meaningful features. This phenomenon is known as “the curse of dimensionality.” To address these challenges, this study proposes a solution using an ensemble of autoencoders with denoising layers to mitigate overfitting. Additionally, a novel attention mechanism is introduced to predict the importance of each feature, thereby addressing the curse of the dimensionality problem. The proposed approach is evaluated on five datasets, including BreastW and Vowels, and compared with existing methods. Experimental results demonstrate that the proposed method outperforms existing methods on four out of the five datasets, showcasing its effectiveness.


Introduction
With the rapid development of information technology, more and more industries are introducing computer-based scientific management for important data.Alongside the continuous global economic growth, there has been significant progress in various fields, accompanied by the explosive growth of data.While people have access to vast and complex datasets, extracting truly valuable information from them has become a major challenge.As a result, there is a new demand for faster and more accurate extraction of important information from these large and complex datasets [1].Outlier detection, as a subfield of data mining, aims to swiftly and accurately identify exceptional samples from a large set of normal samples that may contain a small number of anomalies [2].In many cases, people consider this small portion of exceptional data as noise while regarding normal data as the signal.However, the concepts of "noise" and "signal" are not absolute, as the meaning conveyed by exceptional data is no less significant than that of normal data.For example, in the context of financial fraud, exceptional data represents fraudulent transactions hidden among numerous legitimate transactions [3].Early outlier detection methods based on statistical approaches relied on expert Isolation Forest [14].In the Isolation Forest, outliers are defined as objects that can be easily isolated.It uses a simple yet efficient method for outlier detection.Firstly, the algorithm randomly selects features and then divides the data into two groups based on these features.This process is repeated until all data points are partitioned into isolated groups.According to the definition of outliers in Isolation Forest, outliers are typically separated into isolated states earlier, which means they usually have shorter paths.The greatest advantage of Isolation Forest is that it does not require training and has low time complexity.However, when performing outlier detection on high-dimensional datasets, it may exhibit unstable performance due to the random feature selection.
Thanks to the advancements in deep learning, there has been an increasing number of outlier detection methods based on autoencoders.These methods inherit the advantages of deep learning, such as powerful performance, simplicity in design, and the ability to uncover intrinsic features of the data.In general, outlier detection can be directly performed using a single autoencoder [15].The basic principle is as follows: the autoencoder is first trained using normal samples, and then the trained model is used to discriminate the test set.Additionally, the reconstruction error is used as the outlier score for each point.Since the autoencoder is trained on normal samples, the reconstruction error for normal samples tends to be small, while it is larger for abnormal samples.The limitations of this approach mainly lie in its susceptibility to overfitting on small to medium-sized datasets and the lack of improvement in addressing the curse of dimensionality.AAE (Active Autoencoder) is a novel method that improves upon the basic autoencoder [16].AAE enhances the performance of autoencoders by incorporating influence-based active learning and utilizing an expanded-contracted operator to modify sample weights.By employing these techniques, AAE is able to achieve better performance compared to traditional autoencoders.
In addition, Rand Net needs to be introduced separately because the proposed method in this paper builds upon Rand Net [7].It is also a type of method based on autoencoders.Rand Net contributes to two main aspects.Firstly, it replaces the fully connected autoencoder with a randomly connected autoencoder to prevent overfitting.Secondly, it utilizes an ensemble of autoencoders for outlier detection to compensate for the limitations of individual autoencoders.Additionally, Rand Net incorporates advantages such as random sampling training and adaptive learning rate.The proposed method in this paper improves upon Rand Net in two aspects: enhancing the structure of the autoencoder and incorporating an attention mechanism.

Autoencoder with denoising layer
To address the challenges of overfitting and the need for diversity in base learners for ensemble learning, Rand Net utilizes randomly connected autoencoders instead of fully connected autoencoders.Although this approach sacrifices some performance by discarding certain connections, it helps reduce the risk of overfitting.Building upon this idea, the proposed method further improves Rand Net by introducing an additional denoising layer alongside the randomly connected autoencoder.This denoising layer serves two purposes.Firstly, it introduces Gaussian noise to the training data, which acts as a constraint on the base learners, preventing them from overfitting and encouraging them to learn the most important features.Secondly, by adding Gaussian noise to the training data, it enhances the diversity of the data.Combined with random sampling during training for each base learner, this approach satisfies the requirement of diversity in base learners for ensemble learning.Moreover, the modified model no longer requires a high connection dropout rate, as it achieves a balance between avoiding overfitting and sacrificing a minimal amount of performance.The structure of the autoencoder with the added denoising layer is illustrated in Figure 1.
During the training process, the training data undergoes the denoising layer, where random noise is added to simulate the presence of noise in the data.To mimic real-world scenarios, the noise generation follows a Gaussian model.Unlike conventional denoising autoencoders that directly set certain dimensions of the training data to zero, this method is more realistic.The equation for adding noise to the training set is represented as Equation (1).

𝑋 𝑋 𝑟𝑎𝑛𝑑𝑛 𝑛𝑓
(1) ∈ ℝ represents the original data,  ∈ ℝ is randomly generated Gaussian noise,  represents the data after adding noise, and  is the noise factor used to control the magnitude of the added noise.A smaller  results in less impact of noise on the training data, while a larger  leads to a greater impact. is an adjustable hyperparameter, and selecting an appropriate nf plays a crucial role in the model's performance.

𝐿 𝑋 , 𝑋 ∑ 𝑋 𝑋
(2)  The loss function of the autoencoder is the reconstruction error, which is generally calculated using Mean Squared Error (MSE).The reconstruction error measures the difference between the input samples and their reconstructed outputs.The loss calculation for the autoencoder in this method is shown in Equation (2). ∈ ℝ represents the output.Although the input data is the data after adding noise, the reconstruction error is still calculated with respect to the original data.

Attention mechanism module FE block
To mitigate the impact of the curse of dimensionality on the model, this paper draws inspiration from the method proposed by Hu et al. [17].It introduces an attention mechanism module for calculating the importance of features extracted by the autoencoder.The attention mechanism evaluates and assigns weights to individual features within each feature group and then enhances or diminishes the original features based on their weights.This allows important features to be further strengthened while reducing the influence of less important features.Therefore, this module is referred to as the Feature Enhancement block (FE block).Figure 2 illustrates the calculation process of the FE block.Just as shown in Figure 2, the first step is to transform the input feature  ∈ ℝ through the function , into the feature  ∈ ℝ .Equation ( 3) demonstrates this process.
The next step is to calculate the weight of each feature using  ,  .This part consists of two fully connected layers (FC).The network structure of  ,  is illustrated in Figure 3.
The next step after obtaining the weights is to multiply them back to the original features.In Equation ( 5), ⨂ ∈ ℝ represents the features after feature enhancement, which is the new features obtained by multiplying the weights  to the original features.

𝑋 𝐹
, ⨂  ⨂ The new features ⨂ are added to the original features  through a residual connection, resulting in the final features  ∈ ℝ . Equation ( 6) illustrates this process.The inclusion of a residual connection instead of directly using ⨂ as the final features serve two purposes.First, it facilitates network training.At the beginning of the network training,  can be set to 0, which is equivalent to  , allowing for effective training.Second, it prevents network degradation.The residual connection breaks the symmetry in the neural network, ensuring that even if training a certain layer of the network does not lead to any improvement, the network's performance does not degrade.
Due to the modular design of the FE block, it can be added to each network layer of the autoencoder.Figure 4 illustrates the autoencoder with the inclusion of the FE block.During the testing phase, the test data is not corrupted by adding noise, and the reconstruction error of the autoencoder is used as the outlier score for each point.The testing phase does not involve the denoising layer.The framework of the model during the testing phase is illustrated in Figure 6.

Outlier Score Computing
Figure 6.The model framework during the testing phase.

Outlier scoring
During the testing phase, each point in the test set is evaluated on the ensemble of autoencoders to calculate the reconstruction error as its outlier score.It is assumed that there are  autoencoders in the ensemble, and the test set consists of  points, each with  dimensions.The -th autoencoder's the -th sample is denoted as  ∈ ℝ , and its corresponding output is denoted as  ∈ ℝ .As a result, for the -th autoencoder, we obtain a set of outlier score vectors, denoted as  . is shown in Equation (7).
The final outlier score of a sample is determined based on the median of the reconstruction errors from all the autoencoders.Figure 7 illustrates the scoring process.
Example of the outlier score vector and final outlier score computation.

Dataset description and evaluation metrics
The datasets used in the following experiments are publicly available datasets from UCI [18].Table 1 summarizes the basic information of these datasets.The majority of points in these datasets are labeled as normal points, while a small portion of points is labeled as outliers.There is a significant class imbalance between normal points and outliers.Additionally, all datasets have been standardized, with values scaled to the range of [0,1].The BreastW dataset is a record of breast cancer cases, where the data is classified into two classes: benign and malignant.The malignant class is considered an outlier, while the benign class is considered as normal.The Vowels dataset is a multivariate time series dataset that primarily consists of the time series of the vowel /ae/.Each time series represents a sentence ranging in length from 7 to 29, with each pronunciation having twelve features.For the outlier detection task, each frame in the training data is treated as an individual data point.In this dataset, classes 6, 7, and 8 are considered as outliers.The WBC dataset is a record of measurements from breast cancer cases.The data in WBC is classified into benign and malignant, where the malignant class is downsampled to 21 points and treated as outliers.The Glass dataset is a glass identification dataset primarily recording information about different types of glass found at crime scenes to assist forensic investigations.In this dataset, class 6 has significantly fewer instances compared to other classes and is labeled as an outlier.The Wine dataset is a record of the chemical analysis results of different varieties of wine, covering the concentrations of 13 different components in the wines.The first class is downsampled to 10 instances and treated as outliers.
In the field of outlier detection, the AUC (Area Under the Curve) score is commonly used as an evaluation metric to assess the performance of a method.AUC is a metric used to evaluate the performance of binary classification models, typically applied to the Receiver Operating Characteristic (ROC) curve.AUC represents the area under the ROC curve and ranges between 0 and 1.The AUC value ranges between 0 and 1.A perfect classifier has an AUC of 1, while a random classifier has an AUC of approximately 0.5.A higher AUC value indicates better performance of the model in accurately distinguishing between positive and negative samples.On the other hand, a lower AUC value close to 0.5 indicates poorer classification performance of the model, where it struggles to effectively distinguish between positive and negative samples.

Ablation experiments
In order to test the effects of the denoising layer and the attention mechanism module (FE block), ablation experiments were conducted on five datasets: BreastW, Vowels, WBC, Glass, and Wine.In this experiment, Rand Net served as the baseline model.The model with the added denoising layer was referred to as Rand-DL Net, and the model with both the denoising layer and the FE block was referred to as FE Net.The experimental results are summarized in Table 2 and Figure 8.On the five datasets, the Rand-DL Net, which had the denoising layer added, showed an average improvement of 1.4% compared to the baseline model.The FE Net, which had both the denoising layer and the FE block, further improved the performance by an average of 1.7% compared to the Rand-DL Net.Moreover, the FE Net exhibited an average improvement of 3.3% compared to the baseline model.These experimental results indicate that the inclusion of the denoising layer in the autoencoder allows for the extraction of more important features.At the same time, the FE block enhances the impact of useful features and suppresses the influence of irrelevant features.Table 2

Comparison experiments with other methods
To further validate the effectiveness of the proposed model, comparative experiments were conducted on the aforementioned five datasets against other existing methods.The following methods were included in the comparison experiment: COF [11], IForest [14], AE [15], Rand Net [7], and AEE [16].
For the COF, the hyperparameters were set as follows: the number of nearest neighbors (n_neighbors) was set to 20, and the contamination ratio (contamination) for outlier points was set to 0.1.For the IForest, the hyperparameters were set as follows: the number of isolation trees (n_estimators) was set to 100, the maximum number of samples (max_samples) used in building each tree was set to min (256, n_sample), where n_sample is the number of samples in the dataset.The maximum number of features (max_features) considered for each tree was set to min (64, n_features), where n_features represents the dimensionality of the dataset.The contamination ratio was set to 0.1.For the AE, the hyperparameters were set as follows: the number of layers (n_layer) in the neural network was set to 7, and the ratio of the number of nodes between adjacent layers was set to 0.5.For the Rand Net, the hyperparameters were set as follows: the number of autoencoders (n_ae) was set to 100, and the number of layers in the neural network was set to 9. For the AEE, the number of layers in the neural network was set to 7. For the FE Net, the hyperparameters were set as follows: the number of autoencoders (n_ae) was set to 100, the number of layers in the neural network was set to 9, and the noise factor (nf) was set to 10 -1 .These hyperparameter settings were used for the respective methods in the comparative experiments on the five datasets mentioned.3 presents the AUC scores of each method on the five datasets.The experimental results demonstrate that the proposed method achieves the best performance on four datasets.Additionally, we observe that the COF algorithm performs the best on the Glass dataset, while the various methods based on autoencoders show moderate performance.This observation may be attributed to the characteristics of the dataset distribution.Overall, the proposed method outperforms other methods in outlier detection tasks, indicating its superior performance.

Hyperparametric analysis
Because this method is an improvement on Rand Net, we will not discuss the hyperparameters that have already been discussed in Rand Net.This section mainly analyses the impact of the noise factor  in the denoising layer on the model's performance.Table 4 and Figure 9 illustrate the changes in AUC scores on the Wine dataset for different values of .The line plot shows a trend of initially decreasing, then increasing, followed by another decrease, and finally stabilizing.As  increases from 0 to 10 0 (1), the model's AUC score initially decreases.This is because a large value of  excessively alters the original distribution patterns of the dataset, leading to a decrease in model performance.However, when  is set to 10 -1 , the model achieves the highest AUC score.This is because the dataset has been normalized to the range of 0 to 1, while the noise generation follows a Gaussian distribution.With  set to 10 -1 , the data values change by approximately 10%, striking a balance between excessive alteration and insufficient change to the dataset.When  decreases from 10 -2 to 10 -5 , the model's performance initially declines and then gradually stabilizes, eventually becoming similar to the performance without the denoising layer ( = 0).This is mainly because a very low value of nf has minimal impact on the dataset, resulting in model performance similar to that of the model without the denoising layer.In conclusion, determining an appropriate value for the noise factor  is crucial, and the experimental results indicate that the model performs best when nf is set to 10 -1 .

Conclusion
The paper proposes a novel outlier detection method that utilizes an ensemble of denoising autoencoders with a noise reduction layer and attention mechanism.This method improves the issue of overfitting that commonly occurs when using autoencoders for outlier detection on small to mediumsized datasets.Additionally, the inclusion of an attention mechanism addresses the curse of the dimensionality problem.Finally, we conducted ablation experiments, comparative experiments with existing methods, and hyperparameter experiments on five publicly available datasets.From the evaluation results, it can be observed that the proposed method outperforms the others in terms of performance.Although the proposed method enhances the performance of outlier detection to some extent, it also introduces additional parameters that need to be trained.In the future, reducing the number of parameters while maintaining the model performance can be considered to improve the efficiency of outlier detection tasks.

Figure 1 .
Figure 1.Structure of the autoencoder with the denoising layer.

Figure 2 .
Figure 2. Diagram of the FE block.

Figure 3 .
Figure 3. Network architecture of the  .Each layer in  ,  has the same number of input and output nodes as the input data size.The activation function is set to .In order to facilitate the enhancement or attenuation of features, the output of the last layer is passed through the Sigmoid function to adjust the output to the range [0, 1].Equation (4) demonstrates the process of weight calculation.  ,    ,     ,         (4)  ∈ ℝ ,  ∈ ℝ , and  ∈ ℝ represent the weights of two fully connected layers and one Sigmoid layer, respectively.According to Equation (4), given the input features  ∈ ℝ , the calculation of  ,  yields the weight scores of the features,  ∈ ℝ .⨂   , (5)

Figure 4 .
Figure 4. Autoencoder with FE block.3.3.Model frameworkIn this section, we will discuss the framework of the model during the training and testing phases.During the training phase, the training data is corrupted by adding noise, and the optimization objective is to minimize the reconstruction error of the autoencoder.The autoencoder employed during the training phase includes the denoising layer.The framework of the model during the training phase is illustrated in Figure5.

Figure 5 .
Figure 5.The model framework during the training phase.

Figure 9 .
Figure 9.The impact of the  value on the AUC score.

Table 1 .
Summary of the datasets

Table 3 .
AUC scores of each method.The best AUC score is highlighted in bold.