Detection and identification of faults in clock ensembles with the generalized likelihood ratio test

In this paper we propose an approach for performing fault detection and identification in clock ensembles based on the generalized likelihood ratio test. We show that by applying a set of purposefully-designed statistical tests, one can successfully detect faults occurring in a clock of the ensemble, and identify which measurement in the ensemble is most likely to have triggered the detection. We first develop the theoretical framework for the characterization of the detectors and their performance, and validate the derivations via Monte Carlo simulations. Then, we apply the statistical tests to an ensemble of cesium clocks, aiming at detecting and identifying three types of non-nominal behaviors. The faulty conditions are obtained by injecting a pattern of phase steps, a phase and frequency drift, and an oscillatory phase component.


Introduction
Clock ensembles play a fundamental role in providing a stable and robust time scale for high-integrity applications. Compared to single clock approaches, clock ensembles are characterized by improved stability and robustness. However, clock faults can still impact the stability and availability of the generated time scale. Thus a clock ensemble must be equipped with a dedicated fault monitoring system performing three steps: fault detection, fault identification, and fault adaptation. Detecting a fault means raising an alarm when a fault occurrence is observed, while the identification step aims at locating where in the system the fault is occurring. Finally, fault * Author to whom any correspondence should be addressed.
Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. adaptation consists in taking corrective actions to reduce the impact of the fault on the system, for instance by repairing or temporarily removing the faulty unit. Since in this paper we only tackle the first two steps, we will talk of fault detection and identification (FDI).
A FDI algorithm is implemented in a detector. The detector observes a particular quantity (the observable) and scans for deviations from the expected nominal behavior, which are caused by the occurrence of a fault. Different faults act differently on the system, thus we must choose the right observable (or a set thereof ) to properly identify different types of expected errors. In this work we use three observables, namely Kalman filter residuals, phase measurements, and dynamic Allan variance (DAVAR) of differential clock measurements. For each of these we devise a dedicated detector based on the generalized likelihood ratio test (GLRT), which accounts for the statistical distribution of the observable in both nominal and faulty conditions. We derive a model-based test and a Kalman filter residuals [4][5][6]29] [ 3, 23-25] GLRT on clock measurements [7][8][9][10][11] This work Dynamic Allan variance [12][13][14][15][16] This work Trend analysis-smoothing-recursive filters [18][19][20] [ 27, 28] Optimal stopping method [21] -Average, least-squares, standard deviation on sliding windows [17] [ 26] Interferometry [22] self-consistency test. In the model-based test the detector compares the current observation to the value predicted by a clock model: any deviation between the two can originate from a fault, but also from a mismodelling of the expected behavior. We apply the model-based test on the entire set of clock measurements (overall model test) to perform fault detection, and then singularly on each measurement to perform fault identification (w-test) [1,2]. In the case of homogeneous ensembles, which only contain clocks of the same type, we construct the self -consistency test. This test does not require to define clock noise parameters. The availability of multiple clocks of the same type allows the estimation of the nominal clock behavior directly from the observations. Instead of detecting deviations between observed and modelled behaviors, we now detect deviations between one measurement and the average observed behavior of all measurements. After deriving the different types of statistical tests, we describe the expected performance of each in terms of probability of false alarm and probability of missed detection. Monte Carlo simulations are then used to validate the theoretical derivation. Finally, we perform an experimental assessment of all the tests devised by injecting different types of fault patterns in a laboratory clock ensemble, and we verify whether the tests can successfully detect and identify the faulty signal. Table 1 shows a selection of literature on fault detection in clocks and clock ensembles. While most of the studies focus on detecting faults on single clock signals, some works specifically target clock ensembles. In [3] the authors propose the usage of a Kalman filter for fault detection in a clock ensemble: by tracking the evolution of the Kalman filter residuals it is possible to detect discrepancies between model and observations. This method is further investigated in [4] and applied in [5,6]. The use of the GLRT for detecting clock failures is described in [7], and further developed in [8][9][10]. In [11] the method is applied to the detection of anomalies on a GPS satellite clock. The DAVAR method, detailed in [12][13][14], makes use of the overlapping Allan variance (OAVAR) computed on sliding windows of given length, and compares the temporal changes in the data to predefined DAVAR values. The detector is triggered if the deviations exceed a statistically significant threshold. The application of this method to clocks onboard satellites of global navigation satellite systems (GNSSs) is shown in [15,16]. In [17] the author describes three algorithms for detection of jumps in frequency measurements: the BLKAVG algorithm computes the average of frequency points on non-overlapping moving windows, detecting a jump when the difference in averages among two windows is larger than a given threshold; the SEQAVG algorithm employs an averaging window only once a potential jump is detected, to confirm or reject the detection; finally, the CUSUM algorithm computes the cumulative sum of the differences between each frequency data point and the average value on the entire data set. The slope of the cumulative sum reflects the average behavior of the measurement and can be used to detect frequency jumps. Other techniques for fault detection in clock signals include trend analysis and filtering of the frequency signal [18][19][20], the optimal stopping method [21], and interferometric analysis [22]. Fault detection in clock ensembles is further discussed in [23][24][25] using Kalman filters. Instead, the methods in [26][27][28] use least-squares fitting, standard deviation computed on sliding windows, and infinite impulse response filters.

State of the art
In this work we integrate and extend some of these approaches for FDI. Expanding from our previous works [30,31] we provide a theoretical framework for the design of fault detectors in clock ensembles. The techniques shown here can be directly applied to real-time dependable timing system, greatly improving the robustness of the generated time scale. The paper shows how to take advantage of the GLRT for the observation of different quantities, so that the integrity layer of a system can be easily expanded with further detectors using the same approach. Furthermore, the GLRT allows us to predict the performance of the detectors in terms of probability of false alarm (P fa ), probability of missed detection (P md ), and minimum detectable bias (MDB). This enables tuning the detectors to fulfill the desired integrity requirements, namely which magnitude of fault is detectable and with which probability. By taking as example a precise timing facility running a clock ensemble to provide timing services, these methods would benefit both the provider and its users. Firstly, the provider can guarantee a given integrity level, since the detection performances can be determined beforehand; then, in case of a faulty clock, warnings can be promptly provided to the users, while the fault is located in the system and corrective actions are put in place. An additional advantage of these methods is represented by the self-consistency test, which detects whether one of the clocks diverges from the ensemble's average behavior. Since no a priori clock noise model is assumed, this test is not triggered by any influence affecting all clocks in the same way. The evaluation of clocks on board GNSS satellites is an example of such a benefit: the estimation of the clocks' states is affected by the residuals of the orbit determination process, which determine an increase of the observed OAVAR for sampling intervals around half of the orbital period [32]. A model-based test would in this case trigger a detection since the observed behavior differs from the clock model, even if the misbehavior is not due to the clock itself. Instead, the self-consistency test would not trigger, since the additive bias is present on all clocks.

Clock ensemble
The system under study consists of a set of clocks, a measurement device, a Kalman filter and a number of detectors. While we keep the theoretical development as general as possible concerning number and type of clocks, the experimental verification is performed with an ensemble of 5 cesium frequency references. Figure 1 shows a schematic of the ensemble. The measurement device generates differential clock measurements, which are phase and/or frequency differences between two clocks. In our setup the first clock acts as reference against which we measure the remaining units, but different measurement topologies can be used. The clock measurements are fed to the Kalman filter, which first predicts the states of the clocks according to a specified clock model, and then updates the predicted states with the available observations. The Kalman filter is executed on a laboratory computer connected to the measurement device. This computer also runs the FDI algorithm. Furthermore, a dedicated device can inject a desired fault pattern into the signal of a clock, so that we can run experiments for assessing the performance of the detectors. The injected fault is indicated by ∇(t) in figure 1.

Clock model and ensemble model
We use a two-state clock model with constant frequency drift and Markov processes, as described in [33][34][35]. The state vector of each clock contains one state representing the phase deviation and one state for the frequency deviation. When required for more accurate modelling, the state vector of the single clock is extended with one or more additional states, related to additive Markov processes [34]. We combine the states of each clock to obtain the state vector x of the entire ensemble. In time discrete notation we describe the dynamics of the system and its observation with where the subscript • k indicates the quantities at time t k , τ 0 = t k − t k−1 is the constant time discretization, Φ the state propagation matrix, d the drift vector, z the vector of clock measurements, and H the measurement matrix. The process noise n k and the measurement noise v k are assumed Gaussian: Figure 1. Functional representation of the clock ensemble under study, with measurement topology and processing units: the clocks, whose state vector is x(t), are measured by a device generating clock measurements z(t). A Kalman filter computes the estimates x(t) of the clock states. The FDI algorithm uses the clock measurements z(t), the states estimate x(t), and the dynamic Allan variance (DAVAR) σ 2 y (t, τ ) of the clock measurements to observe and evaluate the state of the system. For assessing the FDI algorithm, we inject a fault ∇(t) to the signal of the second clock. The computer executing the algorithm is not depicted.
where Q is the process noise covariance matrix and R is the covariance matrix of the measurement noise. The matrices Φ and Q are block diagonal, where each block is the corresponding matrix for the single clock i, i Φ (τ 0 ) and i Q (τ 0 ), computed with the model parameters of the respective clock i σ 1 and i σ 2 : If Markov processes need to be added to model specific behaviors, we expand the matrices accordingly [34]. Denoting the constant frequency drift of the i-th clock as i d, we generate the drift vector by stacking the vectors i d (τ 0 ) of the single clocks: The measurement matrix H describes how the clocks are measured with respect to each other. It depends on the measurement topology of the ensemble and on whether differential phase or frequency measurements are available. In our case we measure the phase of each clock with respect to the first clock.
Additionally, we define a second measurement matrix H, which maps the clock units to the measurements. It can be obtained from H by eliminating the null columns. For instance, the matrix corresponding to the aforementioned topology with four clocks and three differential measurements is This matrix represents the measurement topology without dependence on the number of states used to model the system.

Kalman filter
We run a clock ensembling algorithm based on a Kalman filter, as described in [34,36]. At each time step, the filter provides an estimate of the clock states by executing the following steps: x − is the predicted estimate of the state vector x, with associated error covariance matrix P − , K is the Kalman gain, used to generate the updated state estimate x, with associated error covariance P. To avoid the unbounded growth of the components in P, we employ the covariance x-reduction method (15) described in [36]. The reduced covariance matrix P r is obtained via the diagonal matrix S, an identity matrix whose entries corresponding to the phase components in the state vector are set to zero. Finally, (16) ensures that the resulting reduced covariance matrix is diagonal. In (14), I denotes an identity matrix, whose size is the number of states in x.

Faults, observables and hypotheses
In this paper we consider two different classes of faults. The first kind consists of abrupt changes in the observed phase or frequency signals, such as jumps and steps, which present a very fast dynamics. The second type of faults shows a slower behavior and includes phase or frequency drifts as well as oscillatory components in the signals. A drift can arise due to the aging of the clocks, while the oscillation can generate from periodic effects, such as daily temperature fluctuations or orbital-related dynamics. While abrupt faults are promptly reflected into the Kalman filter residuals, drifts cause a step-tostep phase and frequency deviation which might be too small to be detected by testing the filter residuals. Thus, to detect this type of faults we need a quantity which 'accumulates' the error, such as the phase measurements and their DAVAR. Clearly, the required accumulation of a sufficient number of samples introduces a delay between the fault onset and the time of first detection.
In the following sections we introduce the theoretical descriptions of the three observables: Kalman filter residuals, phase measurements, and DAVAR. For each observable we define a null hypothesis H 0 , which describes its statistical distribution when the system operates under nominal conditions. An alternative hypothesis H A describes instead the observable's distribution when a fault is occurring in the system. With these hypotheses we then derive a statistical test decision based on the GLRT.

Kalman filter residuals
The Kalman filter residuals track the deviations between the observed clock measurements z and the measurements predicted by the model. Recalling (13) and (14), we express the residuals and their covariance as: In nominal conditions we expect the residuals to remain small, while their magnitude increases if the observed measurements differ from the modelled behavior. We therefore define the two hypotheses: Here we introduce the vector of faults ∇ affecting the system, and the matrix C describing how the faults act on the observables. The size of vector ∇ depends on the shape of matrix C, whose free design allows the selection of a given fault mode to be tested. For example, setting C = I M , an identity matrix of size M, enables us to perform fault detection by assuming a potential fault on all elements of the observable vector, with ∇ becoming a vector of M entries. When assigning C = c i instead, with c i a M-elements vector with 1 at the i-th entry and zero elsewhere, we can scan all single measurements i = 1 . . . M to locate the faulty one. In this case ∇ is a scalar.

Phase measurements
We describe the time-varying distribution of the phase measurements of the clocks with the model equation (1): where x(t 0 ) contains the initial phase and frequency states of the ensemble. Then, from (2) we have where We define the phase residuals as the difference between observed and expected phase measurements, so that the corresponding hypotheses can be defined: These hypotheses have the same form of those derived for the Kalman filter residuals (19). While the latter track the evolution of the phase measurements with respect to the previous time step, the phase measurements residuals describe the overall phase evolution from the initial time instant.

Dynamic Allan variance
The third observable is the DAVAR of differential phase measurements computed on a sliding window of given length. In [12] the DAVAR value is centered in the middle of the sliding window, whereas in this paper the evaluation time corresponds to the right limit of the window. In this way the DAVAR can be used in a real-time scenario employing the latest available data samples. Thus, the DAVAR at time t k uses the measurements in the window [t k − W, t k ], where W is the length of the sliding window. We use the notation i j σ 2 k,τ ,W for the DAVAR of the measurement between clocks i and j, at time instant k, for sampling interval τ and window length W. The expected value i jσ 2 τ and the variance Var i j σ 2 τ ,W of the DAVAR are assumed constant in time and can be expressed using the clock noise parameters: Here, the noise parameters refer to the combined contribution of the two clocks included in each measurement. The function in (27) is explicitly derived in [37] for different noise components. Since an expression of the variance also considering Markov processes is not available yet, we use a simplified clock model without additional processes, whose parameters are reported in section 4.4, expression (73). The DAVAR at time t k distributes according to a χ 2 -distribution [38], namely (in the following we drop the i j subscript): where are the degrees of freedom. By combining (28) and (29) we define the normalized DAVAR We express the distribution of ξ using the gamma distribution since it allows us to also express the faulty case. If a fault occurs in the system, we expect a change of the measured DAVAR, therefore we modify the theoretical value by the quantity ∇ k,τ ,W : The quantity ξ refers to a single measurement. To design an overall model test, we need to express the distribution of a vector of observables ξ containing the observation of all measurements. However, expanding expression (33) to a multivariate gamma distribution is not trivial. In this work we focus on the scalar case, leaving the extension to the multivariate case for future work. It is worth noting that we can derive an overall model test by approximating the gamma distribution (33) to a Gaussian distribution. This holds for a number of degrees of freedom p τ ,W sufficiently large, which occurs when the sampling interval τ is small compared to the window length W. Under this approximation we can follow the same procedure previously derived for tests on the phase measurements: compute the DAVAR residuals as the difference between observed and expected behaviors, and derive the null and alternative hypotheses, which would be in the same form of (19) and (25). However, in this paper we only focus on the w-test derived from the exact distribution (33) of ξ, which is valid for any sampling interval. The corresponding hypotheses read

Fault detection and identification with the GLRT
In this section we derive the statistical tests based on the observables' distributions. In the GLRT framework we devise a model-based test, which can be in the form of an overall model test or a w-test, and a self-consistency test. The derivation of the detectors starts from the definition of the hypotheses a Self-consistency test is not applicable, since Kalman filter residuals already include model information. b Not derived in this work.
on the observed distributions. In section 3 we showed that the hypotheses are of the same form for normally distributed observables (Kalman filter residuals and phase measurements), thus we need only one derivation of a test expression which can then be applied to all of them. Additionally, we need a separate derivation of the w-test for gamma distributed observables (normalized DAVAR). Table 2 gives an overview of the observables and the tests derived in this work. Assume we have a vector of observables with given distribution y ∼ f (y|α), where the parameter α takes values in the set G. If we consider the hypotheses where G 0 ⊂ G, the GLRT for this problem is then defined as The threshold a defines the critical region K, i.e. the set of values of y leading to a rejection of H 0 .
Since the measurements are affected by noise, the test can deliver a wrong result. We talk of false alarm (or type I error) if H 0 is rejected when in fact H 0 is true. The probability of false alarm P fa of the test (or size of the test) is A missed detection (type II error) occurs instead when H 0 is wrongly accepted, when in fact H A is true. The probability of missed detection P md is

Model-based GLRT for normally distributed observables
The Kalman filter residuals and the phase measurements residuals follow a normal distribution. Thus, we generalize the hypotheses (19) and (25) by removing all observable-specific annotations: This allows us to derive a general expression of the test, which is then used for the different observables by using the respective matrices. By following the derivation shown in appendix A, we obtain the test with If q is the number of entries in ∇, the test T is χ 2 -distributed with q degrees of freedom. The distribution is central under the null hypothesis and non-central under the alternative hypothesis: with non-centrality parameter We design two versions of this test by varying the shape of matrix C. In the limit case of q = M we obtain the overall model test, while setting q = 1 yields the w-test. 4 The test is χ 2 -distributed with M degrees of freedom, and we compute the threshold k M from the expression The probability of missed detection for a given value of noncentrality parameter λ is We employ the overall model test for the detection step, in which all the entries of the observation vector are tested as a whole.
The w-test is derived by setting q = 1. The matrix C becomes a column vector c i , where the ith entry is equal to one, and zero elsewhere. By iteratively applying the test to all measurements in the ensemble (varying i from 1 to M), we can sweep all the measurements and locate the one which is most probably faulty. This process is also known as data-snooping. The test becomes where ∇ = The test is χ 2 -distributed with 1 degree of freedom, central under the null hypothesis, non-central under the alternative hypothesis. We compute the threshold k 1 from the expression while the non-centrality parameter λ = ∇ 2 c i Ω −1 c i is related to the P md : The MDB represents the minimum magnitude of fault that can be detected by the w-test with a given value of P md :

Self-consistency test for normally distributed observables
The model-based test requires to know a priori how the clocks behave in their environment to determine the parameters of the clock model. This is problematic if the clocks cannot be directly controlled or accessed, for instance for clocks onboard satellites: in this case the application of a clock model not reproducing the actual nominal conditions of the system can lead to mismodelling, missed detection, and in general poor performance of the integrity monitoring techniques. This motivates the self-consistency test, in which we only set the structure of the ensemble and the distribution function of the measurements, but the model parameters are left free and are estimated from the observations [1]. In the case of homogeneous ensembles, where the clocks are of the same type, the test estimates the expected value and variance of the measurements, as well as assessing the occurrence of a fault in a measurement. While the model-based tests are computed from the residuals, defined as the difference between the obser-vation and the modelled behavior, the self-consistency test is computed directly from the observations. We develop the selfconsistency test for the phase measurements (21), while the derivation for gamma distributed observables is left for future work.
In an homogeneous ensemble, we expect all clock measurements to exhibit the same mean value and variance, thus we rewrite expression (21) as where u is a M-sized vector of ones, ζ k the magnitude of the expected value, Ψ dictates the shape of the covariance matrix, and v 2 k controls the magnitude of the covariance matrix. The test directly estimates the scalars ζ k and v 2 k from the observations z k . We set the matrix Ψ according to the measurement topology of the ensemble: By dropping the subscript for time dependence, the hypotheses read The derivation in appendix B leads to the self-consistency test with where q is the number of columns in C, and we use the following idempotent matrices: The test (55) with non-centrality parameter   The test threshold k s can be found by inverting the expression for the P fa : In the following we apply the self-consistency test in a manner similar to the w-test, thus we set C = c i , q = 1, and we sweep the different measurements by changing the position i of the non-zero entry in the vector c i . The MDB for the self-consistency test is where in this case R ∇ ∇ is a scalar.

w-test for gamma-distributed observables
As previously shown, the normalized DAVAR ξ k,τ ,W follows a gamma distribution, for which a dedicated test is here developed. In this work we focus on the scalar version, leading to the w-test. Dropping all subscripts, the hypotheses are The derivation in appendix C leads to which distributes according to a gamma distribution Thus, the w-test on the DAVAR becomes where k l is the lower threshold and k u is the upper threshold, which can be computed from the expression of the P fa : The P md can be found from: The test (69) is double-sided. One can argue that a fault in the system can only cause a degradation of the observed stability and thus an increase in DAVAR, making the lower side of the test pointless. However, this test detects differences between observation and model, therefore we can use the lower side to detect mismodelling. If we select a model predicting an DAVAR higher than the real value, the test will fall below the lower threshold, and the triggering of the test can be accordingly marked.

Monte Carlo analysis
We run a set of Monte Carlo simulations of an ensemble comprising 5 cesium references to verify the theoretical distributions of the devised tests, and to compare the observed values of P fa and P md to the expected ones. This is only possible in simulation, since we need a large number of runs to evaluate the statistical behavior. Two simulations are carried out: one for the nominal case, and one for the faulty case, where a fault is injected in the first measurement. The selected model parameters of the cesium standards are where i U M and i R M parametrize the Markov process. As discussed in section 3.3, we also need a model without Markov processes to obtain a prediction model for the DAVAR, thus we choose In both cases we assume a value for the measurement noise variance of R = I M × 10 −25 s 2 . These values are obtained by a laboratory characterization of the cesium frequency references used in this work. Long term measurements of the clocks are taken and their DAVAR computed, which is then fitted to  derive the model parameters. We set the probability of false alarm to P fa = 10 −3 , leading to the threshold value for the overall model test k M = 18.5, for the w-test k 1 = 10.8, and for the selfconsistency test k s = 998.5. The thresholds k l and k u for the DAVAR test depend on the length of sliding window W and on the sampling interval. A representation is given in the following experimental results. In the different simulations, the P md can be computed using the threshold value and the magnitude of the injected drift. In the faulty case we expect a time dependent non-centrality parameter The matrix Ω (kf) k depends on time, but it converges after some iterations of the Kalman filter, as long as the measurements do not show abrupt changes. For t k = 100 s we obtain λ 100 = 5.2.  From expressions (47) and (50) we expect a P md for the overall model test of P (om) md (λ 100 ) = 0.94, while for the w-test of P (wt) md (λ 100 ) = 0.84. Figure 2 shows the distributions of the overall model test and of the w-test in the nominal case H (kf) 0 and faulty case H (kf) A , as well as the theoretical expected distributions. By counting the number of triggerings in the nominal case based on 10 5 simulations, we compute P fa = 9.6 × 10 −4 , which is in good

Phase measurements test.
In this case we inject a fault with linearly increasing magnitude with slope d∇ dt = 2.5 × 10 −12 s/s in the first entry of the clock measurements vector z. While the distribution of test in the null hypothesis is constant in time, in the faulty case the distribution is time dependent, since the injected fault is growing. Thus we verify the distribution for t k = 500 s, leading to λ 500 = 57.2 for the overall model test and w-test, and λ (s) 500 = 53.6 for the selfconsistency test. Figure 3 shows the theoretical and observed distributions of the three tests in the nominal and faulty cases for t k = 500 s, while in figure 4 the magnitude of the injected fault is compared to the MDB for the w-test and selfconsistency test. Since the entries of the matrix Ω (z) k used to compute the non-centrality parameter are increasing, the MDB grows in time. A slowly increasing drift can be detected only when its magnitude becomes larger than the MDB. In figure 5 we plot a comparison of the expected and observed probabilities of missed detection for the overall model test, w-test and self-consistency test as function of time. The theoretical values are computed with (47) for the overall model test, (50) for the w-test and (71) for the self-consistency test. At the beginning, the injected fault is smaller than the MDB and remains undetected. The detection occurs later when the fault magnitude exceeds the MDB, which is reflected in the decrease in P md .

DAVAR test.
To verify the distributions of the tests on the DAVAR, we simulated an ensemble in the nominal case using the clock parameters (73), while in the faulty case we increased the first parameter of the second clock by a factor 3, so that 2 σ 2 1 = 1.35 × 10 −22 s. In this way the first entry in the measurement vector shows a fault, whose magnitude ∇(τ ) is computed by subtracting the theoretical DAVAR values of the first and second clocks. We employ a sliding window of 1000 s, and we compute the test for different values of sampling interval τ . The upper plots in figure 6 show the distribution of the normalized DAVAR (66) for an increasing sampling interval τ , while the lower plots show the distribution of the corresponding w-test developed using the gamma distribution (68). Figure 7 shows the observed and expected P md for the DAVAR w-test at increasing values of τ , while figure 8 compares the magnitude of the injected fault ∇(τ ) to the value of the MDB (∇ MDB ). The error can be captured at shorter sampling intervals, while in the long term the statistical uncertainty in the determination of the DAVAR increases, thus weakening the power of detection.

Hardware setup and measurements
In our laboratory we run an ensemble of 5 cesium frequency references (Symmetricon 5071A), structured as in figure 1. With a SpectraDynamics high resolution offset generator we inject a fault of desired magnitude in the signal of the second cesium. Finally, a computer collects the measurements generated by a K+K FXE80 counter, and runs the ensembling and FDI algorithms in Matlab. We run three scenarios with different faults: in the first we inject a series of phase steps; the second scenario involves a phase and frequency drift; finally, a periodic phase component is injected in the third case. Here we set the integrity requirements as P fa = 10 −3 , P md = 10 −6 . Figure 9 shows the phase measurements in the first scenario along with the injected phase steps. The phase pattern includes steps of different size and sign. For this scenario we only compute the tests on the Kalman filter residuals and on the phase measurements, but not on the DAVAR.

Tests on the Kalman filter residuals.
Applying the overall model test on the Kalman filter residuals generates the results shown in figure 10. All the injected steps cause a spike in the test value, whose level depends on the size of the respective step. Only the smallest steps around t k = 270 s do not trigger the test, but this is expected as their size is comparable with the MDB |∇ (kf) MDB | = 52.1 ps. For locating the faulty measurement once the overall model test is triggered, we perform the w-test shown in figure 10(b). Although the test on the first channel (using c 1 ) shows the highest values, the tests on the other channels trigger as well. This is due to the correlation between the differential measurements using a common clock, also encoded in the non-diagonal structure of the matrix Ω k . Thus, we cannot unambiguously identify the faulty measurement yet. To perform identification we apply the following strategy: (a) Find which vector c j generates the highest test value; (b) Create a reduced observables vector ρ R by removing the entry j corresponding to c j ; (c) Perform the overall model test on the reduced observables ρ R : 1. If the test is triggered, ρ R still contains the faulty measurement. Restart from step (a) to remove one more entry from the reduced vector ρ R ; 2. If the test is not triggered, the faulty measurement was correctly removed: measurement j is considered as faulty;  (d) Repeat the process until a faulty measurement is identified.
If all entries are removed from the observables vector ρ, the algorithm cannot identify a faulty measurement, and the identification process fails.
The result of this process represents the measurement which is most probably faulty, shown in figure 10(c). Here, we use the symbol ∅ when the fault identification process is unsuccessful. As expected, the first measurement is identified as faulty in all the detected steps. Figure 10(d) shows the overall model test computed on the reduced vector of observables. The reduced test is never triggered, meaning that the algorithm successfully identifies the faulty measurement. Removing entries in the vector of observables requires recomputing the test threshold with a smaller number of degrees of freedom, leading to the dips shown in the threshold value.

Tests on the phase measurements.
Although this scenario is not designed to assess the phase measurements tests, we show in figure 11 the performance of the overall model test. As expected, only the largest steps trigger the test, since the magnitude of the covariance matrix (and therefore the magnitude of the MDB) increases over time.

Scenario II: injection of a drift
In the second experiment we inject a phase and frequency drift, shown in figure 12. It consists of three parts: until t k = 10 5 s no drift is added, while the frequency drift increases linearly for 10 5 s < t k < 2 × 10 5 s, reaching a maximum value of 1 mHz, which is then kept constant until the end of the experiment. Correspondingly, the phase drift shows first a quadratic increase and finally a linear growth. Figure 13 shows the OAVAR computed on the entire measurement length and the OAVAR of the clock models: the dotted points describe the model without Markov process, while the crossed points correspond to the model with Markov process. Figure 14 shows the results of the test applied to the phase measurement residuals. The injected drift is successfully detected about 4320 s after its onset at t k = 10 5 s. The overall model test in figure 14(a) shows a small increase at the beginning of the experiment. This is due to the nature of this test, which is more sensitive to small phase deviations during the initialization of the detection process. The w-test ( figure 14(b)) correctly locates the drift on the first measurement, as seen in figure 14(c). The inset plots show that the w-tests computed using c 2 , c 3 , and c 4 grow more slowly than with c 1 , and are triggered on average 5000 s later. Figure 14(d) shows the overall model test on the reduced vector of phase residuals. The selfconsistency test results in the values in figure 14(e). This test identifies the drift with a delay of 21 945 s using c 1 . Although the detection with this test occurs with a significant delay, one should consider that the injected drift is relatively small (100 ps/s), thus a longer time is needed for the fault to exceed the detection threshold.

Test on the DAVAR.
We expect the test on the DAVAR to detect the fault at long sampling intervals, since the drift   increases the measured DAVAR in the region above τ = 10 2 s (see figure 13). However, the successful detection of the drift depends on the length of the sliding window and on the current time step. The drift can be seen in the DAVAR only if the sliding window is long enough, in this case at least W > 10 3 s. Figure 15 shows the DAVAR and the w-test (67) for three windows of length W = 10 5 s starting at different time instants: before the drift onset, during the quadratic phase increase, and during the linear phase drift. The shaded area in the lower test plot represents the area where the null-hypothesis is accepted, delimited by the lower threshold k l and the upper threshold k u . The DAVAR computed on the window during the quadratic drift clearly shows that the fault can be detected in the region above τ = 200 s. However, the fault remains undetected in the last data segment, since the DAVAR is invariant to linear phase drifts. In the lower plot of figure 15 we note that the test is triggered also for short sampling intervals. This triggering is due to mismodelling: in figure 13 a mismatch can be noted between the measured DAVAR and the values predicted by both clock models. In summary, we expect the DAVAR test to trigger due to the drift for τ > 200 s, W > 10 3 s, and 10 5 s < t k < 2 × 10 5 s. The rejections due to mismodelling should occur for all window lengths, all times, and τ < 10 s.
The value of the DAVAR test is function of time step t k , sampling interval τ , length of the sliding window W, and measurement channel. To show these dependencies, we introduce the compact visualization shown in figure 16. Here we plot the results of the w-test on the DAVAR for the first channel, for one value of sliding window length W. This plot is obtained by reworking the three-dimensional surface of the DAVAR as function of time and sampling interval, corresponding to the waterfall plots in [13]. Since we are mainly interested in knowing whether the test value lies within the thresholds (accept H (Γ) 0 ) or outside (reject H (Γ) 0 ), we can forgo plotting the test value on the z-axis, and we use instead a color code for showing test acceptance or rejection. Thus, we reduce the 3D surface to a plot on the (t k -τ ) plane: the black pixels show the pairs (t k , τ ) where the test is triggered, while we leave a white pixel when accepting H (Γ) 0 . More information can be added to the visualization: for example, by cutting the threedimensional DAVAR curve at a given time and draw a projection onto the (T-τ ) plane, we can visualize the test value as function of the sampling interval, along with the test thresholds (upper plot). Conversely, if we cut the surface at a given sampling interval, we project onto the (T-t k ) plane on the right side the evolution of the test value as function of time (the shaded area represents the injected drift). The hatched area covers points where the test value is not available. The first value is available at t k = W, since we must wait to collect enough points for computing the DAVAR on the window [0, W]. In this example we see that the test detects the mismodelling at short values of τ , and the drift at longer sampling intervals. The detection delay depends on the window length and on the sampling interval. In figure 17(a) the w-test on the first measurement computed for W = 10 4 s is plotted as function of time: the different lines show the test computed for different sampling intervals. Higher values of τ result in higher test values and the detection tends to occur earlier. This is visualized also in figure 17(b), where we plot the detection delay as function of the sampling interval for the first measurement. The minima occur for both window lengths at τ = 2 × 10 3 s, with detection delays of about 3000 s. Figure 18 shows the result of the w-test on the DAVAR (68) based on the gamma distribution, for the 4 measurement channels z i , and for different lengths of sliding window. We note how shorter sliding windows provide results at earlier time instants, but only for shorter sampling intervals. Conversely, longer windows yield values of DAVAR at longer sampling intervals, but only at later times. As expected, the test detects the drift only on the first measurement z 1 , when using sufficiently long sliding windows (area 'D'). The drift is detected only during the quadratic part, while the mismodelling is detected at short τ at any time, for any window length, and on all the measurements (area 'M').

Scenario III: injection of a periodic component
In the third scenario we inject an oscillating phase component with a period of 90 minutes and an increasing amplitude. The amplitude increases similarly to the second scenario: for t k < 10 5 s the amplitude is zero, for 10 5 s < t k < 2 × 10 5 s it increases linearly to reach a final amplitude of 1.38 ns (corresponding to 5 • at a nominal frequency of 10 MHz). Figure 19 shows the injected fault. Figure 20 collects the results for the tests on the phase measurements. This test does not detect any fault, since the injected phase oscillation is relatively small compared to the process and measurement noises, and it lies within the bounds of the expected statistical variance. The self-consistency test on the phase measurements of figure 20(b) shows some triggerings, which are not related to the fault injection. The number of activations falls within the expected probability of false alarm. figure 21 we plot the DAVAR computed on data segments starting at three different time instants. The injection of the oscillation produces two 'bumps' in the DAVAR for sampling intervals around the period length, seen in both the second and third part of the experiment. Figure 22 shows the w-test based on the gamma distribution applied to the four measurements, for increasing lengths of the sliding window. The test successfully detects the oscillation (area 'D'), but only for longer windows, since for W = 10 3 s we cannot use sampling intervals long enough to detect the bumps. Again, we can see the detection of the mismodelling at short values of τ (area 'M').

Conclusions
In this paper we devised and tested FDI in clock ensembles based on the GLRT. The detectors observe three quantities: Kalman filter residuals, phase measurements, and DAVAR of phase measurements. We designed a model-based test, which performs fault detection with the overall model test and identification with the w-test. Furthermore, we introduced a selfconsistency test to identify the faulty unit by observing the phase measurements without a priori assuming a clock model.
The framework of the GLRT offers us a structured way to develop new detectors, and adapt them to the observation of further quantities, provided that they show similar statistics. Since in the GLRT we not only define a nominal hypothesis, but also a model for the faulty case, we can evaluate the probability of missed detection as function of the size of the fault. This is an advantage of these tests compared to other methods where only the nominal behavior is defined. Thanks to this, we can estimate the performance of a detector beforehand, given the integrity requirements of the application.
We performed three experiments to test the capability of the detectors in different scenarios. Considering the results in section 5.1, the test on the Kalman filter residuals is the first choice for detecting faults such as phase steps, since this test detects without delay the occurrence of abrupt faults. Phase jumps can also be detected using the tests on the phase measurements, however the main disadvantage of this detector lies in the fact that the MDB increases in time with a square root law. Thus, the detection performance decreases in time, until the phase measurements are zeroed and the process is restarted. Nevertheless, as shown in section 5.2, the test on the phase measurements successfully detects and identifies the drift affecting the clock. Obviously, the detection occurs with a certain delay after the error onset: this requires further analyses to understand how the delay changes as function of the drift magnitude.
Both in the second and third scenarios we evaluate the test on the DAVAR, which successfully detects the injected drift and oscillation. However, the χ 2 -distribution of the DAVAR complicates the derivation of the GLRT. In this paper we derived a scalar version in section 4.3: although this test performs well in detecting drifts, it is applied to the single measurements and thus it does not take advantage of the structure of the clock ensemble. In future work we want to extend the statistical description of the DAVAR of the clock ensemble using a multivariate gamma distribution, and thus develop an exact overall model test on the DAVAR. The approximation to a Gaussian distribution allows us to develop an overall model test and a self-consistency test on the DAVAR. Although the approximation works well for short sampling intervals, this assumption degrades the test results at long term, which is exactly the region of interest when detecting drifts. Thus, longer sliding windows must be employed, which in turn delays the detection.
The detection of the mismodelling in the DAVAR test at short sampling intervals leads us to an interesting discussion point. On the one hand, the model-based tests clearly require a precise clock model to provide efficient fault detection without triggering too many false alarms. On the other hand, we can consider this as an opportunity to warn the user that wrong clock models may be in use. Our examples showed exactly this case: since the DAVAR tests trigger for short sampling intervals at all the times, on all the channels, and in different scenarios, we conclude that our clock model is imprecise in that region, which is confirmed by observing the difference between modelled and observed DAVAR in figure 13. Clearly, the detector cannot autonomously discern between faulty behavior or wrong model, but an experienced user shall be able to determine which one is the case.
An help in this direction comes from the self-consistency test. Since the test compares different measurements between each other, this method is not influenced by mismodelling. Therefore, by comparing the results of w-test and selfconsistency test, we are able to determine which triggerings are due to faulty clock behaviors and which ones are due to mismodelling or other unmodelled effects. These effects include for instance daily temperature fluctuations, orbital effects, and in general everything equally affecting all the measurements and not predicted by the clock model. However, the self-consistency test works only on homogeneous clock ensembles, containing only clocks of the same type. Furthermore, the estimation of the expected value and of the magnitude of the covariance matrix requires at least three measurements. Finally, the derivation of the self-consistency test for χ 2 -distributed observables is not straightforward and requires further analyses.
The different tests developed in this work complement each other, and could be bundled together in an operational scenario to form an ensemble of powerful fault detectors.

Appendix B. Derivation of the self-consistency test for normally distributed observables
The hypotheses (54) can be joined as and PDF but the term e 0 − e A Ψ −1 e A is zero. In fact, by using the property of symmetry and idempotence of P ⊥ u and P ⊥ C P ⊥ u we have We can show that the product P ⊥ u P u is zero we can express the distribution of the ratio (B.41) with a F-distribution, which corresponds to the ratio of two independent χ 2 -distributed variables: F (n 1 , n 2 , λ) = n −1 1 · χ 2 (n 1 , λ) n −1 2 · χ 2 (n 2 , 0) .