Analysis of mean time to data loss of fault-tolerant disk arrays RAID-6 based on specialized Markov chain

This scientific paper is devoted to the analysis of the mean time to data loss of redundant disk arrays RAID-6 with alternation of data considering different failure rates of disks both in normal state of the disk array and in degraded and rebuild states, and also nonzero time of the disk replacement. The reliability model developed by the authors on the basis of the Markov chain and obtained calculation formula for estimation of the mean time to data loss (MTTDL) of the RAID-6 disk arrays are also presented. At last, the technique of estimation of the initial reliability parameters and examples of calculation of the MTTDL of the RAID-6 disk arrays for the different numbers of disks are also given.


Introduction
These days the data storage systems [1,2] are widely used as hardware platform for the information systems, which provides the business processes in the modern enterprises. The stability of the information systems directly depends on reliability of the data storage systems. Therefore, to increase the fault-tolerance of the data storage systems the disk arrays with data striping are often applied.
The key reliability index of the modern disk arrays is the mean time to data loss, and, accordingly, development of the reliability models for the disk arrays is an actual scientific problem.
Nowadays there is a number of academic books [3][4][5][6] dedicated to the reliability theory, containing the simplified reliability models of technical systems, which do not consider specific features of the modern disk arrays with data striping and provide quite overestimated values of the mean time to data loss. Also there is a set of the specialized reliability models for the disk arrays [7][8][9][10], which do not take into account the time necessary for the faulty disks replacement.
Accordingly, within the scope of this scientific paper, the authors offered a reliability model taking into account the time of disks replacement for the RAID-6 disk arrays with data striping.

Data redundancy in the RAID-6 disk arrays
The RAID-6 disk array consists of n ≥ 4 independent disks with equal capacity. It ensures safe operation and data availability at failure of maximum two disks (any of them). The effective capacity of the disk array is equal to (n -2) / n part of total capacity of all disks. The 2 / n part of disk space of each disk is intended for storage of redundant (control) data, which is calculated on the basis of user data stored on the other disks.
In case of failure of any one or two disks, it is possible to calculate missing information according to user and control data stored on the remaining disks. In case of failure of three disks, as well as in case of failure of any third disk before replacement at least of one of the faulty disks and completion of the data replication process on the replaced disks, all data of entire array become irreparably lost.
So, one may consider the RAID-6 array as a good compromise between the fault-tolerance and redundancy. The following figure 1 gives the distribution scheme of user and redundant data blocks in the RAID-6 array with five disks as an example.

Analysis of the mean time to data loss of disk arrays RAID-6
At first, let us consider a reliability model, based on Markov chain, offered by the authors for the RAID-6 disk arrays with data striping, taking into account the different rates of disks failure in normal state, degraded and rebuild states, and nonzero time of the faulty disk replacement. The reliability model uses the following set of states of the RAID-6 disk arrays and transitions between them: • State 0 -normal state: all n disks of array are operable and data of the array are available. The array can pass from this state to state 1 with rate 0 λ n (failure of any operable disk). • State 1 -degraded state 1: one disk is faulty and waits for replacement, the remaining n -1 disks of the array are operable, data of the array are available. The array can pass from this state either to state 2 with rate (replacement of one of the faulty disks). • State 3 -rebuild state 3: the faulty disk is replaced and is involved in the data replication process, the remaining n -1 disks of the array are operable, data of the array are available. The array can pass from this state either to state 0 with rate 1 θ (completion of the data replication on the replaced disk), or to state 1 with rate R λ (failure of the replaced disk during the data replication), or to state 4 with rate • State 4 -rebuild state 4: one of the faulty disks is replaced and is involved in the data replication process, the other faulty disk waits for replacement, the remaining n -2 disks of the array are operable, data of the array are available. The array can pass from this state either to state 1 with rate 2 θ (completion of the data replication on the replaced disk), or to state 2 with rate R λ (failure of the replaced disk during the data replication), or to state 5 with rate D µ (replacement of the second faulty disk), or to state F with rate (read error on one of the operable disks during the data replication).
• State 5 -rebuild state 5: the both faulty disks have been replaced and are involved in the data replication process, the other n -2 disks of the array are operable, data of the array are available. The array can pass from this state either to state 0 with rate 2 θ (completion of the data replication on the both replaced disks), or to state 4 with rate R 2λ (failure of one of the replaced disks during the data replication), or to state F with rate (read error on one of the operable disks during the data replication).
• State F -failed state: data of the disk array are unavailable and irreparably lost. Accordingly, the Markov chain, which represents the discussed above set of states and transitions between the states, are shown below (figure 2):  Accordingly, the Kolmogorov-Chapman differential equations system for the Markov chain discussed above is as follows: Considering that the initial state of the RAID-6 disk array is state 0 and only in the states 0-5 the disk array is operable and user data are available, we can derive the formula for calculation of the mean time to data loss of the RAID-6 array considering it as the mean time of staying of the disk array in the states 0-5: On the basis of advanced mathematical analysis, the authors solved the mathematical problem and obtained the formula for calculation of the mean time to data loss of the RAID-6 array: In this case:  (average replacement time for the faulty disks tends to zero), the mathematical analysis provides the following simplified formula: Note 2. In case when the faulty disk replacement rate is 0 µ D = (no replacement of the faulty disks), the mathematical analysis provides the following simplified formula:

Estimation of the initial reliability parameters for the RAID-6 disk arrays
The failure rate of disks 0 λ in operable state of the disk array can be estimated on the base of mean time to failure of disk DF T (Mean Time to Failure), obtained from the practical experience or provided by the disk manufacturer. The failure rate of disks 1 λ and 2 λ in case of unavailability of one and two disks in the RAID-6 disk arrays is higher than the rate of 0 λ due to the fact that beside the primary load, the operable disks bears extra reading operations for calculation of data of the unavailable disks. One may simply consider that failure rate 1 λ is twice as high and failure rate 2 λ is three times as high. As for the failure rate of existing replicated disk R λ , it is significantly higher than the 0 λ rate because of the large amount of the write operations. One may consider that R λ is five times as high.
Finally, taking into consideration the aforesaid, one can estimate parameters 0 λ , 1 λ , 2 λ and R λ using the following simple formulas: The rate of disk replacement varies depending on the replacement method: whether the disk is replaced automatically using the additional spare disks and the hot-spare technology, or detection and disk replacement is manually carried out by technical specialists. However, in the both cases it is possible to conclude that the replacement rate is defined by the given (or obtained from the practical experience) mean time of waiting for spare (faulty disk replacement) WS τ : The rate of data replication 1 θ in case of unavailability of one disk depends upon given capacity V (byte) of disks, average recorded speed WR v (byte/s) for disks and average data recalculation speed C1 v (byte/s) for the disk array in case of unavailability of one disk. As for the case of two disks unavailability, data recalculation speed C2 v is obviously lower because the data recalculation takes more time, and therefore the rate of data replication 2 θ will be also lower. One can simply estimate rates 1 θ and 2 θ using the following formulas: Estimation of the rate of data read errors 1 ε on operable disks during the data replication in case of unavailability of one disk is based on the probability of bit unrecoverable read error URE P , provided by disks manufacturer or obtained from the practical experience, given disk capacity V (bytes) and calculated rate of data replication 1 θ (hour -1 ). Similarly, one can estimate the rate of data read error 2 ε on the operable disks during the data replication process with rate 2 θ (hour -1 ) in case of unavailability of two disks. One can estimate rates 1 ε and 2 ε using the following formulas: . θ 8 ε ; θ 8 ε