Comparison of Robust Estimators for Detecting Outliers in Multivariate Datasets

Detecting outliers for multivariate data is difficult and does not work by visual inspection. Mahalanobis distance (MD) has been a classical method to detect outliers in multivariate data. However, classical mean and covariance matrix in MD suffer from masking and swamping effects. Masking effects happened when outliers are not identified and swamping effects happened when inliers are identified as outliers. Hence, robust estimators have been proposed to overcome these problems. In this study, the performance of a new robust estimator named Test on Covariance (TOC) is tested and compared with other robust estimators which are Fast Minimum Covariance Determinant (FMCD), Minimum Vector Variance (MVV), Covariance Matrix Equality (CME) and Index Set Equality (ISE). These five robust estimators’ performance is being tested on five real multivariate datasets. Brain and weight, Hawkins-Bradu Kass, Stackloss, Bushfire and Milk datasets were used as these five real datasets are well-known in most outlier detection studies. Results show that TOC has proven to be able in detecting outliers, does not have a masking effect and has the same performance as other robust estimators in all datasets.


Introduction
The presence of outliers in multivariate data can affect the proper classical multivariate analysis, misleads the conclusions, make modeling difficult and disrupt measures of mean and covariance matrix. Outliers could be easily detected in univariate and bivariate data by using graphical presentation. However, the detection is difficult when the dimension increase [1,2].
One of the ways to detect multivariate outliers is to calculate a distance from each point to the center of the data. This method is known as the distance-based method which is based on the Mahalanobis distance (MD) [3]. An outlier would then be a point with a distance larger than some cut-off value. MD is one of the important tools to detect outliers in multivariate data [4]. MD is given by (1) where   S x, i d is the MD for i-th observation, x is the sample mean and S is covariance matrix [3]. The x and S used in equation (1) are classical estimators for the sample mean and covariance matrix and not robust. A small portion of outliers will affect the estimation of x and .
S MD depends on the sample mean and covariance matrix which are subject to masking and swamping effects [3][4][5]. Masking occurs when some of the outliers are left unidentified (false negative) and swamping occurs when nonoutlying data are mistakenly identified as outliers (false positive) [6].
The outlier detection problems and the disadvantages of classical estimators in contaminated data have become a motivation to study robust methods for estimating mean and covariance matrix. A robust method is designed specifically to be resistant to outliers [3]. The robust method aims to lessen the effect of outliers and allow the majority of data to determine the result of the analysis [7]. A robust estimate of mean and covariance matrices are then replaced and used in MD and will yield robust MD or robust distance that is less sensitive to outliers [3,8].
Various robust estimators had been proposed and developed in the previous studies such as S, M, MM, Minimum Volume Ellipsoid (MVE), Minimum Covariance Determinant (MCD) and Fast-MCD (FMCD) estimators. Among these robust estimators, FMCD that had been developed by [9] is widely used because FMCD possesses the desirable properties of robust estimators which are affine equivariant, high breakdown point, bounded influence function and has lower computational complexity [2,[10][11][12].
However, FMCD still has weaknesses which are the computational complexity as the dimension increase and singularity problems as FMCD is based on covariance determinant [2]. Therefore, in 2007, [2] proposed Minimum Vector Variance (MVV) to overcome the problem of FMCD. MVV can overcome the singularity problem as the computation of MVV is based on vector variance [2]. MVV was also found to have the same breakdown point as the MCD-based methods, lower computational time than FMCD, covariance matrix does not need to be positive definite and can be applied to high dimension datasets [2].
Although MVV can solve problems faced by FMCD, the computational time of MVV is still low when the number of variables increases [13]. Hence, [14] proposed Covariance Matrix Equality (CME) and Index Set Equality (ISE). The CME and ISE are the tests of equality between two covariance structures. These two robust estimators can find the robust mean and covariance matrices [15]. It is also found that ISE is simple to compute and has better performance than FMCD, MVV and CME [15].
However, ISE does not involve any arithmetical computation and it is still open to finding the test of equality between two covariance structures [13]. This motivates [16] to propose the test of equality between two covariance structures and named the new robust estimator with Test on Covariance (TOC). Details of TOC are discussed in [16].
A simulation study had been done in [16] to investigate the performance of TOC. Results from [16] show that TOC is applicable and a promising approach to detect outliers for multivariate data. Hence, in this study the performance of TOC will be investigating further by using real multivariate datasets. Multivariate datasets that will be used are Brain and Weight, Hawkins-Bradu Kass, Stackloss, Bushfire and Milk datasets. These five datasets had been used in most of the multivariate outlier detection literature and had become a benchmark to measure the performance of the proposed methods. The performance of TOC will be compared with other robust estimators (FMCD, MVV, CME and ISE) to detect outliers in these five datasets.

Robust Estimators
MVV, CME, ISE and TOC are modifications of the FMCD estimator. All these estimators differed at Step 6 in the FMCD algorithm. FMCD algorithm is given as follows [15].
Step 1: Select an arbitrarily subset old H containing h different observations, where h is the smallest , where p is the number of variables and n is the sample size.
Step 2: Compute the mean vector Step 3: Compute Step 5: Define . Then go to Step 3. Otherwise, the process is stop and The procedures of MVV, CME and ISE can be done by replacing Step 6FMCD with the following step as given in [13,15].
Step 6MVV: If   , . Then go to Step 3. Otherwise, the process is stop.
Step . Then go to Step 3. Otherwise, the process is stop.
A new robust estimator named TOC had been proposed by [16]. The idea of TOC is coming from CME and ISE which test the equality of covariance structure for old subset and new subset in the algorithm. The equality of two covariance structures is tested by using equation (2) , p is the number of variables and as given in [17].
Step 6 for TOC is given below.
Step . Then go to Step 3. Otherwise, the process is stop.

Illustrative Examples and Performance Measures
In this study, five real multivariate datasets will be used as illustrative examples to identify outliers in multivariate data. The datasets are Brain and Weight, Hawkins-Bradu Kass, Stackloss, Bushfire and Milk datasets. These datasets have become a standard for most of outlier detection studies in multivariate data such as in [1], [18 -22]. Table 1 shows a summary of the datasets.  [20,21,29].
Robust mean and covariance matrices from FMCD, MVV, CME, ISE and TOC will be obtained and were used to identify outliers in the datasets. The steps to identify outliers are given as follows, Step 1: Step 2: Use the cut-off value x is an outlier.
The performance of each robust estimator for each dataset will be measured by three measurements.
i. Number of outliers successfully detected. The number of outliers detected by each robust estimator will be counted. Observations that are outliers from each dataset have already been identified in the previous study (refer to Table  1). Each robust estimator will be investigated either it can detect the outliers or not. ii.
Number of outliers falsely detected as inliers (masking effect). Any outliers that are not identified as outliers will be counted as having a masking effect. Each robust estimator will be investigated either it misclassifies outliers as inliers. iii.
Number of inliers falsely detected as outliers (swamping effect). Any inliers that are not identified as inliers will be counted as having a swamping effect. Each robust estimator will be investigated either it misclassifies inliers as outliers.

Results and Discussion
In this section, we compare and discuss the performance of FMCD, MVV, CME, ISE and TOC on multivariate outlier detection by using five real multivariate datasets.

Brain and Weight data
Brain and weight (BW) dataset contain two variables which are body weight and brain weight for 28 species of animals. According to [18] and [19], this dataset is part of a larger dataset in [30]. [1] used Minimum Volume Ellipsoid (MVE) in their study and found that observations 6 th ,14 th , 16 th , 17 th and 25 th are outliers. Observations 6 th , 16 th and 25 th are dinosaurs with a small brain and heavy body, while observations 14 th and 17 th are human and rhesus monkey with high brain weight [1]. However, the method used by [1] tends to detect too many outliers [31]. According to [18,19,23], it is believed that this dataset only has three outliers which are observations 6 th , 25 th and 16 th .

Hawkins-Bradu Kass (HBK) data
Hawkins-Bradu-Kass (HBK) dataset is an artificial dataset generated by [32]. This dataset was generated to show some of the merits of robust methods and the effectiveness of the robust methods to identify outliers [18]. This dataset has 75 observations and four variables (one response and three explanatory variables). For this study, only three explanatory variables will be used. Observations 1 -14 are known to be outliers for this dataset [1,[18][19][20].

Stackloss data
Stackloss data is a dataset obtained from an experiment for the oxidation of ammonia into nitric acid, measured on 21 consecutive days [22]. The dataset has three explanatory variables (rate of incoming ammonia, cooling water temperature and acid concentration) and one response variable (stackloss) [19,22]. In this study, only three explanatory variables are used. Observations 1 st , 2 nd , 3 rd and 21 st are outliers [18,19].
From Table 4, all robust estimators successfully detect the outliers and do not misclassify outliers as inliers at the cut-off value of 058 .

Bushfire data
Bushfire data is a dataset used to locate bushfire scars and was taken from [33]. The dataset contains satellite measurements on five frequency bands, corresponding to each of 38 pixels with 13 outliers which makes the percentage of outliers is 34%. According to [20,21], observations 7 th -11 th and 31 st -38 th are classified as outliers.
Only FMCD, ISE and TOC have successfully detected all outliers in Bushfire dataset. MVV only can detect 84.6% of outliers, while CME can only detect 53.8% of outliers. Therefore, MVV and CME misclassify outliers as inliers (masking effect) with the rate of 15.4% and 46.2% respectively. Table 5 shows that all robust estimators misclassify inliers as outliers (swamping effect). CME has the highest percentage which is 40%, while FMCD only misclassifies 8% of inliers as outliers. Meanwhile, TOC and ISE has a similar performance.

Milk data
Milk dataset provided by [34] is a composition of 86 containers of milk with 8 variables. The 8 variables are density, fat content, protein content, casein content, cheese dry substance measured in factory, cheese dry substance measured in laboratory, milk dry substance and cheese produced. There are 17 outliers in this dataset which makes the percentage of outliers is 20%. Observations 1 st -3 rd , 12 th -17 th , 27 th , 41 st , 44 th , 47 th , 70 th , 74 th , 75 th and 77 th are classified as outliers by [20] , [21] and [29].
As can be seen from Table 6, all robust estimators successfully detect 17 outliers and do not misclassify outliers as inliers (masking effect). However, all robust estimators have a swamping effect. FMCD, MVV and ISE show the lowest swamping effect with 5.8%, while CME and TOC have 7.2%.  Table 7 shows a summary of the best robust estimators for each dataset. It can be seen that all robust estimators successfully detected outliers for four datasets (BW, HBK, Stackloss and Milk). For Bushfire dataset, only FMCD, ISE and TOC successfully detected all outliers. The same results can be seen for the masking effect. All robust estimators do not have a masking effect for four datasets except for the Bushfire dataset. Only FMCD, ISE and TOC do not misclassify outliers as inliers (masking effect) for Bushfire dataset. For the swamping effect, all robust estimators misclassify inliers as outliers for four datasets (BW, Stackloss, Bushfire and Milk) except HBK dataset. For HBK dataset, only MVV has a swamping effect.

Conclusions
In this study, the performance of a new robust estimator by [16] named Test on Covariance (TOC) to detect outliers in real multivariate datasets is being tested and compared with other robust estimators. The performance of TOC is compared with Fast Minimum Covariance Determinant (FMCD), Minimum Vector Variance (MVV), Covariance Matrix Equality (CME) and Index Set Equality (ISE). These five robust estimators' performance is being tested on five real multivariate datasets which are Brain and weight (BW), Hawkins-Bradu Kass (HBK), Stackloss, Bushfire and Milk datasets. The performance of each robust estimator measured by the number of outliers successfully detected, number of outliers falsely detected as inliers (masking effect) and number of inliers falsely detected as outliers (swamping effect). Ideally, the best robust estimator would be a robust estimator that can detect all outliers, has the lowest masking and swamping effect. It is found in this study that all robust estimators successfully detected outliers in BW, HBK, Stackloss and Milk datasets. The same result was also obtained for the masking effect. However, only FMCD, ISE and TOC successfully detect outliers and do not have a masking effect for Bushfire dataset. For the swamping effect, all robust estimators misclassify inliers as outliers for BW, Stackloss, Bushfire and Milk dataset. Meanwhile FMCD, CME, ISE and TOC do not have the swamping effect of HBK dataset. This means that FMCD, CME, ISE and TOC did not misclassify inliers as outliers for HBK dataset. From these results, TOC has proven to be able to detect outliers, does not have a masking effect and has the same performance as other robust estimators in five real multivariate datasets. This shows that TOC is applicable and a promising approach for outlier detection in multivariate data. Hence, TOC can be used when outliers are existed in multivariate datasets.