An Efficient Algorithm for Server Thermal Fault Diagnosis Based on Infrared Image

It is essential for a data center to maintain server security and stability. Long-time overload operation or high room temperature may cause service disruption even a server crash, which would result in great economic loss for business. Currently, the methods to avoid server outages are monitoring and forecasting. Thermal camera can provide fine texture information for monitoring and intelligent thermal management in large data center. This paper presents an efficient method for server thermal fault monitoring and diagnosis based on infrared image. Initially thermal distribution of server is standardized and the interest regions of the image are segmented manually. Then the texture feature, Hu moments feature as well as modified entropy feature are extracted from the segmented regions. These characteristics are applied to analyze and classify thermal faults, and then make efficient energy-saving thermal management decisions such as job migration. For the larger feature space, the principal component analysis is employed to reduce the feature dimensions, and guarantee high processing speed without losing the fault feature information. Finally, different feature vectors are taken as input for SVM training, and do the thermal fault diagnosis after getting the optimized SVM classifier. This method supports suggestions for optimizing data center management, it can improve air conditioning efficiency and reduce the energy consumption of the data center. The experimental results show that the maximum detection accuracy is 81.5%.


Introduction
Recently there has been a boom in data center usage, it is prompted by the great developments in new information technologies and applications (e-commerce, Big Data, and cloud computing) and other services [1]. Servers are complicated to maintain, and require a special environment in which to operate. In order to make sure the server operates at a temperature below its overheating threshold and prevent abnormal shutdown due to excessive temperature or overload operation, many data centers have to improve the power of air conditioning, resulting in high energy consumption and low efficiency of the data center. A report by Koomey in 2010 reveals that the use of electricity in data centers during 2010 was about 1.1%-1.5% of the worldwide electricity and this proportion is growing [2]. Remarkably, the computer room air conditioner (CRAC) units account for nearly 40% of the total power consumption in a data center [3]. One of the main factors that affect the cooling efficiency is the uneven distribution of temperature inside the data center [4]. The super hot servers with hotspots in data center account for about 5%. We need to solve the problem of heat dissipation of both these 5% super high heat cabinet and the vast majority servers. However, we cannot handle these problems in a same way. Research shows if the temperature variation inside data center changes from 10 o C to 2 o C, it can reduce about 25% energy consumption of the CRAC [5]. Therefore it is important to monitor the temperature within data centers, and then find the cause of the hotspots. Thus we can exclude the local high temperature and balance the internal temperature of data center by adjusting the load distribution or the refrigeration resource distribution. It improves the efficiency of air conditioning and realizes the energy saving of the data center. At present, for monitoring the temperature inside a large data center, the traditional solution is to place temperature sensors in the key position of a data center, then all measurement data is collected by a specific method. The application of wired sensors is not widespread due to the expensive installation and configuration costs. Although wireless sensors have the advantages of low cost and non-invasive measurement, reference [6] shows that the wireless electromagnetic environment in large-scale data center is not conducive to the sensor network. Moreover, the spatial resolution of temperature sensors is not enough to get detailed information. If the sensor is not deployed in right position, we cannot accurately obtain the size and temperature of hot spots without sufficient information. Infrared thermography (IRT) has been applied widely in the field of militaries, medical diagnostics and security monitoring with the advantage of non-contact detection, no electro-magnetic interference, safety and reliability, and wide monitoring range [7,8,9]. Compared with temperature sensors, the thermal camera can obtain more intuitive two-dimensional thermal images, fault associated with abnormal temperature distribution can be easily detected by IRT and the image processing technology can further analyze the cause of hot spots.
In order to obtain a rapid and accurate fault diagnosis result according to the inspecting data of electric equipments, some intelligent diagnosis systems are constructed based on image processing and artificial intelligence. The application of intelligent fault diagnosis system can assess the abnormal degree of electronic equipment, even if the experts or experienced personnel are not at scene [10]. Generally, the intelligent diagnosis system based on infrared image consists of four steps, as shown in Figure 1. Firstly, shot the infrared images of electronic equipment under different fault conditions by infrared camera. Secondly, find the region of interest (ROI) and extract the feature information of segmented regions, these features have enough discrimination. Finally use artificial intelligence algorithm and decision to determine the equipment state.

Figure 1 General steps of intelligent diagnosis system
This paper proposes an efficient algorithm for server thermal fault diagnosis based on infrared images. The intent is to optimize the management of data center, improve the efficiency of CRAC and reduce the energy consumption of data center. The innovation of this article is to propose an infrared image processing technology based on modified entropy features, it is superior to texture and Hu moments features, and it produces higher fault diagnosis rate. The rest of this paper is as follows. In section 2, the realization of the intelligent diagnosis method based on infrared image is proposed, and the feature extraction algorithms are mainly discussed. Section 3 explains the implement of the proposed method. Experimental results and classification performance of SVM classifier are presented in Section 4. Finally, summarize and conclude our work.

Methodology
The classification and diagnosis system of the server thermal failure is presented in this article. Firstly, the image distortion caused by different shooting positions is eliminated by image standardization, and the ROIs of the infrared image are segmented manually. We use three algorithms for feature extraction. For the larger feature space, a principal component analysis (PCA) is employed to reduce the feature space dimensions, and guarantee high processing speed without losing the fault feature information. Finally, different feature vectors are taken as the input of support vector machine (SVM) respectively to train classification model, and then make thermal fault diagnosis after getting the optimized SVM classifier. The key issues of deciding a server condition depend on the extracted information which can be distinctive enough to be classified and the pattern recognition method in intelligent diagnosis. We extract the features from target images which reflect the characteristics of important and primitive. In this research, three feature extraction algorithms are applied to the segmented images, namely the texture feature extraction algorithm, Hu moments feature extraction algorithm and modified entropy feature extraction algorithm.

Texture Feature
Texture as an important characteristic of the image reflects the spatial structural information, and it is also an important factor that cannot be ignored during classification. Traditionally, they are calculated considering the neighborhood pixels on the image. A considerable number of approaches have been reported to extract texture features. The most widely used is the grey level co-occurrence matrix (GLCM) proposed by Haralick et al. (1973). GLCM of image reflects the integrative information of pixels with regard to direction, adjacency spacing relationship and the range of variance change. It is the basis to analyze image textures. Seven texture characteristic parameters are extracted using GLCM, which are used as the basis of classification. They are contrast, uniformity, entropy, variance, covariance or product moment, inverse difference moment, and correlation.
The texture features vector is used to train SVM model, then thermal fault conditions of the server are diagnosed by the optimized SVM classifier.

Hu Moments Feature
Hu moments are used in pattern recognition to provide a scale, orientation and position-invariant representation of an object [11]. The practice shows that the characteristics of the image cannot be guaranteed by the use of the origin moment or the central moment. To this end, m.k.hu first proposed the invariant moments in 1961, He gave the definition of moments of continuous functions and the basic properties of moments, and proved the properties of translation invariance, rotation invariance and scale invariance.
They are suitable for describing the overall shape of the object, so it has a wide applications in edge extraction, image matching and object recognition. The amount information included in seven Hu moments are different because of different calculations. The useful information of image generally focused on lower order moments which are relatively small in calculation, the calculation of the high order moments are large that contain some details easily affected by noise, the difference between each target of the high order moments is not easy to distinguish. In this research, seven Hu moments feature vectors of the image are adopted as the input data. The feature space is M1 M2, M3, M4, M5, M6 and M7.

Modified Entropy Feature
The relationship between gray value of infrared image and its temperature is shown in formula (1) Although the global entropy can represent statistical properties of the image, it cannot reflect the spatial characteristics of the gray distribution. For this problem, a feature extraction algorithm of local entropy is applied in this paper. The mean value of neighborhood gray are taken as the spatial feature in the image, and combined with the gray value of pixels, a two-tuple characteristic is constituted, denoted as (i, j). i and j respectively represent the gray value of pixels and the mean gray value of neighborhood.
( , ) Formula (3) reflects the proportion of the two-tuple in the image, where f (i, j) is the frequency of the corresponding two-tuple. M×N is the scale of the local window. In formula (1), the variable P i replaced with P ij , thus the local entropy with M×N local window is acquired. Local entropy of the image can be used to reflect comprehensive features of the gray value with position information and the gray level distribution of the surrounding pixels. It is the result of the interaction of all pixels in the local window, which is not sensitive to the single point noise, so the local entropy feature has certain anti-noise filtering ability. In order to extract the statistical characteristics of local entropy, this article proposes the concept of modified entropy feature algorithm. For an image with size of X×Y, calculate the mean value of local entropy for each row and column pixels, therefore we obtain X and Y feature vectors, which are corresponding to the row and column number.
The modified local entropy reflects the inhomogeneous gray distribution of the row (column) pixels and the relationship between neighborhood pixels. It is helpful for the thermal fault diagnosis of server by taking the global and modified local entropy as the input features.  Our approach was applied in a laboratory at Dalian University. The FLIR E8 thermal camera with fusion technology is set up against the rack that equipped with a server of the DELL710 model. The distance between target server and the thermal camera is between 0.5 and 1 meter. During monitoring process, the ambient temperature is between 20 o C and 30 o C, other information related to the accuracy of the results is recorded, such as ambient humidity is between 16% and 25%. In this research, the thermal faults are classified into 4 classes, namely main fan failure, inlet air blocking, CPU  Figure 2) are taken as the research objects. Sample number of each running state is 255, and a total of 1275 infrared images are obtained.

Image Standardization
Because of the different viewing angle and distance of infrared camera, the geometrical distortion of the infrared image appears in the process of image acquired, which can introduce errors into the extracted features and cause inaccurate recognition. Hence, this paper adopts an image standardization method to obtain the thermal distribution features of the server on a uniform model and provide information for evaluating the operating state [12]. Firstly, prepare a template, which is scaled to a unified rectangle with an aspect ratio equal to the server facade. Then the thermal distribution of each sever is standardized by image registration. Image registration technology can establish one to one pixel relationship between infrared images and visible images, so the temperature of any position on the visible image can be known, and it is easy to feature extraction. The server region is segmented manually into a template which is mapped to a standard rectangle that is proportional to the size of the actual server. In this research it will be mapped to a server image of 87 ×424 pixels (shown in Fig. 3) for subsequent image processing.

Parameter Optimization
The application of SVM with radial basis kernel function (RBF) obtains preferable performance than the MLP artificial neural network in thermal faults detecting of electrical installations [13]. In this paper, support vector machine (SVM) is used as a classifier which has good generalization ability and it is better able to solve the small practical problem [14]. Gauss function is usually used as the radial basis function (RBF) kernel. The classification performance of SVM is affected by parameters selection of penalty parameter C and kernel function parameterγ. Parameter C controls the tradeoff between the maximum margin and the minimum error rate of the two classes. The parameterγis the parameter of Gauss radial basis kernel function, which determines the complexity of the distribution of sample data in high dimensional space. Therefore it is necessary to search the parameters, the goal is to determine a pair of good parameter combination (C, γ) for the classifier predict correctly with sample data.
Cross validation is a method can avoid over fitting of SVM model [15]. For a K-fold cross validation, initially the training data is divided into k subsets of equal size, then each subset is used as the test data set, the rest of the k-1 subset is used as the training data set. Every subset of the whole training set is predicted once, thus the problem of over fitting can be avoided. According to the lack of theoretical guidance, it requires the machine learning engine for more training. Herein we use the grid search algorithm to select the best combination. Some parameters are shown in TABLED I.  [-5,5] Additionally, considering the time required in the practical application, we adopt the principal component analysis (PCA) algorithm to reduce the dimensions of the extracted features for large feature space. PCA can not only eliminate redundant information which reduces the correlation among the feature vectors, especially decreases data quantity, but also make the sample space more compact and reasonable for SVM [16].

Experimental Result and Classification Performance Analysis
Data set contains totally 1275 infrared images of the server under different thermal fault conditions captured by thermal camera, and the number of infrared images of each thermal failure state is equal. In this experiment, 60% of the data are selected randomly as training data and the remaining 40% of the samples chosen for testing. For each infrared image, the Hu moments method extracts 7 characteristics contained in the feature vectors. Another 7 features are extracted by the texture feature extraction algorithm. However, the feature space of the global entropy and modified local entropy features are too huge for SVM. Specifically, firstly extract the global entropy feature of each image which is only one feature, then use 9 ×9 local window to extract the row and column mean value of local entropy features, they are 87 features and 424 features respectively, and totally 512 image features. It necessitates the use of PCA, which can speed up the data processing speed without loss of the original data and make the feature information more representative. The new set of features of the modified entropy compressed by PCA contains 49 dimensions. The combination input features are mapped into a new feature space with 50 dimensions by PCA.
In this thesis, we use the SVM classifier combined with different input characteristics to analyze the thermal faults of server. The accuracy of thermal faults diagnosis with different input features are shown in TABLEII. The results showed that the classification accuracy of both the CPU overloading condition and the No.2 CPU fan failure is the highest. It is obvious that the all kinds of characteristics algorithm introduced in this article can identify the causes of the above two hotspots completely. Particularly, the features of global entropy and modified local entropy are superior to the Hu moments and texture features for the remaining thermal fault cases. Different feature vectors are used to train SVM classifier. In experiment, we also use 5-fold cross validation and grid search method to optimize the combination of penalty parameter C and kernel function parameterγ. According to the results in TABLEIII, we noticed that when all features are combined as input data, the maximum correct rate of the thermal fault diagnosis reaches up to 81%, and the parameter C=1 andγ=0.0313. In this paper, the classification performance of SVM with the global entropy and the modified local entropy features is better than that of Hu moments and texture features. The results of this study can reliably predict server thermal faults, it provides the suggestions for thermal management, and improves the cooling efficiency in data centers. This study has a certain practical value.

Conclusion
In this paper, an efficient algorithm of thermal fault diagnosis based on infrared image is presented. In the intelligent fault diagnosis system, we extract the Hu moments feature, texture feature and global entropy and the modified local entropy feature as different inputs of SVM classifier. 5-fold cross validation method and grid search method are used to adjust the penalty parameter C and kernel function parameter γ. The comparison results demonstrate that the global entropy and modified local entropy features developed in this thesis obtain a higher accuracy of the server thermal fault diagnosis than that of Hu moments feature and texture feature. The experimental results show that SVM trained by all feature vectors produced better performance with 81% of accuracy. The proposed method has been validated through actual thermal faults diagnosis.