Online detection of failures generated by storage simulator

Modern large-scale data-farms consist of hundreds of thousands of storage devices that span distributed infrastructure. Devices used in modern data centers (such as controllers, links, SSD- and HDD-disks) can fail due to hardware as well as software problems. Such failures or anomalies can be detected by monitoring the activity of components using machine learning techniques. In order to use these techniques, researchers need plenty of historical data of devices in normal and failure mode for training algorithms. In this work, we challenge two problems: 1) lack of storage data in the methods above by creating a simulator and 2) applying existing online algorithms that can faster detect a failure occurred in one of the components. We created a Go-based (golang) package for simulating the behavior of modern storage infrastructure. The software is based on the discrete-event modeling paradigm and captures the structure and dynamics of high-level storage system building blocks. The package's flexible structure allows us to create a model of a real-world storage system with a configurable number of components. The primary area of interest is exploring the storage machine's behavior under stress testing or exploitation in the medium- or long-term for observing failures of its components. To discover failures in the time series distribution generated by the simulator, we modified a change point detection algorithm that works in online mode. The goal of the change-point detection is to discover differences in time series distribution. This work describes an approach for failure detection in time series data based on direct density ratio estimation via binary classifiers.


Introduction
Disk-drive is one of the crucial elements of any computer and IT infrastructure. Disk failures have a high contributing factor to outages of the overall computing system. During the last decades, the storage system's reliability and modeling is an active area of research in industry and academia works [1][2][3]. Nowadays, the rough total amount of hard disk drives (HDD) and solid-state drives (SSD) deployed in data-farms and cloud systems passed tens of millions of units [4]. Consequently, the importance of early identifying defects leading to failures that can happen in the future can result in significant benefits. Such failures or anomalies can be detected by monitoring components' activity using machine learning techniques, named change point detection [5][6][7]. To use these techniques, especially for anomaly detection, it is a necessity in historical data of devices in normal and failure mode for training algorithms. In this paper, due to the reasons mentioned above, we challenge two problems: 1) lack of storage data in the methods above by creating a simulator and 2) applying new online algorithms that can faster detect a failure occurred in one of the components [8].
A Go-based (golang) package for simulating the behavior of modern storage infrastructure is created. The primary area of interest is exploring the storage machine's behavior under stress testing or exploitation in the medium-or long-term for observing failures of its components. The software is based on the discrete-event modeling paradigm and captures the structure and dynamics of high-level storage system building blocks. It represents the hybrid approach to modeling storage attached network [9,10]. This method uses additional blocks with a neural network that tunes the internal model parameters while a simulation is running, described in [11]. This approach's critical advantage is a decreased requirement for detailed simulation and the number of modeled parameters of real-world system components and, as a result, a significant reduction in the intellectual cost of its development. The package's modular structure allows us to create a model of a real-word storage system with a configurable number of components. Compared to other techniques, parameter tuning does not require heavy-lifting changes within developing service [12].
To discover failures in the time series distribution generated by the simulator, we modified a change point detection algorithm that works in online mode. The goal of the change-point detection is to discover differences in time series distribution. This work uses an approach for failure detection in time series data based on direct density ratio estimation via binary classifiers [8].

Internals
The simulator uses a Discrete Event Simulation (DES) [13] paradigm for modeling storage infrastructure. In a broad sense, DES is used to simulate a system as a discrete sequence of events in time. Each event happens in a specific moment in time and traces a change of state in the system. Between two consecutive events, no altering in the system is presumed to happen; thus, the simulation time can directly move to the next event's occurrence time. The scheme of the process is shown in Figure 1. The event handling loop is the central part that responsible for time movement in the simulator. The Master process creates necessary logical processes (Client1, IOBalancer, HDD Write, etc.) and populates a Priority Queue by collecting events from modeling processes. The last part of the implementation is running the event handling loop. It removes successive elements from the queue. That would be correct because we know that the queue is already time sorted and performed the associated actions.
The simulator's programming environment provides the functionality to set up a model for specific computing environments, especially storage area networks. The key site of interest is In the simulator, load to storage system can be represented by two action types: read file from disk and write file to disk. Each file has corresponding attributes, such as name, block size, and total size. With the current load, these attributes determine the amount of time required to perform the corresponding action. The three basic types of resources are provided: CPU, network interface, and storage. Their representation is shown in the Figure 3 and informative description is given in the Table 1. By using basic blocks, real-world systems can be constructed, as shown in the Figure 2.

Comparison with the real data
The data from the real-world storage system were used to validate the behavior of the simulator. A similar writing load scenario was generated on the model prototype, together with intentional controller failure (turn-off). The comparison is shown in the Figure 4. As we can see, the simulator's data can qualitatively reflect the components breakup. Change point detection Consider a d-dimensional time series that is described by a vector of observations x(t) ∈ R d at time t. Sequence of observations for time t with length k is defined as: Sample of sequences of size n is defined as: It is implied that observation distribution changes at time t * . The goal is to detect this change. The idea is to estimate dissimilarity score between reference X rf (t − n) and test X te (t). The larger dissimilarity, the more likely the change point occurs at time t − n.
In this work, we apply a CPD algorithm based on direct density ratio estimation developed in [8]. The main idea is to estimate density ratio w(X) between two probability distributions P te (X) and P rf (X) which correspond to test and reference sets accordingly. For estimating w(X), different binary classifiers can be used, like decision trees, random forests, SVM, etc. We use neural networks for this purpose. This network f (X, θ) is trained on the mini-batches with cross-entropy loss function L(X (t − l), X (t), θ), We use a dissimilarity score based on the Kullback-Leibler divergence, D(X (t − l), X (t)). Following [14], we define this score as: .
According to [8], the training algorithm is shown in Alg. 1. It consists of the following steps performing in the loop: 1) initializing hyper-parameters 2) preparing single datasets X rf and X te 3) calculating loss function J 4) applying gradients to the weights of neural network.

Results
To

Conclusion
The simulator for modeling storage infrastructure based on the event-driven paradigm was presented. It allows researchers to try different I/O load scenarios to test disk performance and model failures of its hardware components. By providing large amounts of synthetic data of anomalies and time series of a machine in various modes, the simulator can also be used as a benchmark for comparing different change-point detection algorithms. In this work, the density ratio estimation CPD algorithm were successfully applied to the simulator data.