Sensitivity of principal components to system changes in the presence of non-stationarity

Henrik M Bette; Michael Schreckenberg; Thomas Guhr

doi:10.1088/1742-5468/ad0033

1. Introduction

The detection of changes, novelty, failures or faults is a crucial task in numerous real world systems, such as industrial monitoring, energy generation, IT and road traffic, sensor networks, image processing and many more [1–4]. Early or even preemptive detection can help operations, avoid down-times and reduce costs in general. Techniques for such analysis can be roughly categorized into three groups: model-based methods, knowledge-based methods and data-driven methods [5]. While a lot of research is being done in all directions, the first two groups require extensive knowledge about the monitored system and are therefore challenging and time-consuming. This has made data-driven methods especially interesting for researchers in all fields [6].

Principal component analysis (PCA) [1, 3, 7] is one of the data-driven methods. It is commonly used as either a direct detection method or in dimensionality reduction pre-processing. Here, data are projected into the space of the principal components, which are the eigenvectors of the covariance or correlation matrix for a system with multiple measured variables. Whether covariances or correlations are used depends on the system studied. The eigenvectors associated with the large eigenvalues are referred to as the major components. Without a change present, these carry most of the information on the behavior of the system. A projection onto the subspace of the major components includes the largest part of the variance of a data set. The other eigenvectors associated with small eigenvalues are referred to as minor components. For change detection one usually calibrates a normal parameter set of these projections. New data are then projected and compared with the normal space using, for example, Hotelling T² statistics or the sum of squared prediction error together with a threshold [8, 9]. It has been applied to change detection in many different systems, including wind turbines [10], wastewater treatment plants [5], healthcare institutions [11] and traffic [12]. An important question is which of the principal components to look at. For the use-case of dimensionality reduction one generally employs a projection onto the major components [7]. However, the residual subspace, i.e. the minor components, is often most useful for outlier and change detection [1, 13, 14]. Shyu et al [15] proposed a method combining the components relating to the largest and the smallest eigenvalues for novelty detection. Recently, Tveten [16] studied the sensitivity of different principal components in a more structured and complete way, confirming the high sensitivity of minor components to changes.

Many variants of PCA exist. They were developed to tackle different specific problems. Some better known ones include, but are not limited to, static and dynamic robust PCA [17, 18], kernel PCA [19], dynamic kernel PCA [20], moving window PCA [21], recursive PCA [22] and incremental PCA [6]. The latter three tackle the important problem of non-stationarity. Many studied systems are subject to external influences or internal processes that cause varying conditions. These lead to changes in the data, which do not represent novelty or failure. We refer the reader to the review by Ketelaere et al [23] for an introduction to PCA on time-dependent data. A performance comparison is found in [24]. The problem of non-stationarity is also of on-going interest for other novelty detection methods [25–30].

Non-stationarity exists in different forms. First, a system can slowly evolve due to influences such as climate [27] or the economic situation [31]. On the other hand, external conditions can vary on shorter time-scales, causing non-stationarity without an evolutionary trend [29, 32, 33]. Some systems also exhibit periodic non-stationarity [28]. A second distinction can be made in the type of non-stationarity: changes can occur in the absolute values of measured variables, their individual distribution or the relation between different variables. In simple terms these three categories represent novelty as changes in mean values, standard deviations or correlation structure, respectively. Of course, this does not capture all possibilities. For example, distribution changes can also affect higher moments only. Furthermore, relations can be non-linear. However, this simplification already provided valuable insights in the sensitivity of principal components in stationary systems [16]. We study the importance of non-stationarity accordingly.

Of those three categories, non-stationarity in the correlation structure is especially important for PCA (or any other method depending directly on the covariance or correlation matrix). The purpose of PCA is to find new coordinates that already incorporate the linear correlations in the system. If they change, i.e. the eigenvectors of the correlation matrix change their structure, the projections into PCA space are substantially different. Such non-stationarity in the correlation matrix has been thoroughly investigated, for example, in financial markets [31, 34], road traffic [32] and wind turbine data [33]. We assume that such a distinction into well-separated individual states is most likely to happen for a difference in correlation structure. At a certain point, internal or external causes lead to a behavioral change. Changes in mean or standard deviation of variables might also exhibit such distinct states, but are more likely than a correlation structure to change continually. Therefore, our analysis will focus on correlation changes, but we will nevertheless study all types of change.

Our aim with this study is a structured and comprehensive analysis of the sensitivity of principal components in the presence of non-stationarity, while trying to keep it simple enough that the mechanisms can be easily understood. Non-stationarity is modeled as multiple possible normal states of the system. We study all three types of possible changes, but do not mix them. This means, if the change occurring at a certain point in time is of a certain type (mean value, standard deviation, correlation structure), that the states will also differ in that same type. We base the study framework on the one used by Tveten [16] and extend it to incorporate non-stationarity. This facilitates comparison with the stationary case. We analyze the sensitivity of the principal components under the assumption that we know about the non-stationarity and compare it with the results obtained if we did not know about the non-stationary behavior. Thereby, we can study how important knowledge about such a state-wise non-stationarity is for novelty detection with PCA.

The main focus of our study is a general analysis of the influence of non-stationarity on change detection with PCA as well as the presentation of a method to account for this non-stationarity. Additionally, we illustrate our simulation results with an example using real traffic-flow data. Thereby, we show how the proposed methodology can be applied to measured data. In traffic, the non-stationarity is common knowledge: traffic volumes are not at all constant during the course of a day. Furthermore, traffic is very different on workdays compared with weekends or bank holidays. These well known facts make traffic a good example for understanding the application of the method. We first classify the non-stationarity during single work days. Then, we define weekends and the onset of a bank holiday as a change and try to detect these with and without taking the classified non-stationarity into account.

We describe the idea of the experiment with a metaphor in section 2. In section 3 we introduce the problem and notations and set up the simulation framework in a general form. Section 4 contains the results of our simulation studies divided into subsections for the different change types. In these we also present the detailed simulation setup for each case. We apply the method to real data by working out the traffic example in section 5. A summary of results and findings is found in section 6.

2. Analysis concept

To facilitate easy understanding of the concept of our analysis before going into detail, we outline the simulation experiment with a simple example. Imagine a lamp with a light bulb that is burning with a certain color. Let us say it emits blue light. At a certain point in time $t = \tau$ the light bulb changes its color slightly and we study from which viewing angle we can best detect this change. As a measure we take a color chart and estimate the difference h in points between the color before and after the change. Then we repeat this experiment time and again with different bulbs, different colors and different color changes while keeping track of the change detectability from each viewing angle. This is an easily imagined example for the stationary simulation experiment.

Now, let us assume that before $t = \tau$ the light bulb burns with S different colors. At random intervals it changes its color, but all colors are equally likely over time. This represents non-stationarity of the type we are going to analyze. At time $t = \tau$ a change occurs once again. However, the light bulb can be in any state $s \in {1,{\ldots},S}$ , i.e. burn with one of the possible colors. If we know of the non-stationarity and have a criterion for determining which color the lamp should have, we detect the change compared with the realized color $s^{*}$ . Once again we use our color scale and determine a point score h₁ from different viewing angles. However, we want to analyze how much improvement in the detectability of that change is caused by our knowledge about this non-stationarity. Therefore, we also calculate a detectability without it: we compare the color after the change occurs with the average color before the state. So, if our lamp would normally burn with either blue or red, we compare with a violet light. Again we take the color chart and note the difference h₂ in points between the changed color and the average color before change from each angle.

To measure the increase in detectability with knowledge of the non-stationarity we have to compare the two measurements h₁ and h₂. The first one is easy enough, because it should produce the same results as the stationary case if our knowledge about the original state, i.e. color, of the light bulb is perfect. When measuring without knowledge about the non-stationarity we have to take into account that even without a change we will detect deviations from our assumed normal (the average state): at any given time the light will be red or blue, but we will compare it with violet and observe a difference $h_{s,\mathrm{norm}}$ between state s and the average state. This means that changes at $t = \tau$ are detectable only if they differ more strongly from the violet than our normal states (blue and red) do. We have to correct our measurement of difference by subtracting the maximum difference we measure without a change present: $h_2^{\mathrm{corr}} = h_2 - \max\limits_{s}( h_{s,\mathrm{norm}})$ . As in the stationary case, we perform multiple Monte Carlo simulations. The importance of knowledge about non-stationarity for change detection is then measured as $h_1 - h_2^{\mathrm{corr}}$ .

3. Theory and setup

In our analysis, the light bulb from the example above is a D-dimensional system with n observations $\boldsymbol{x}_t \in \mathbb{R}^D$ at times $t = 1,\dots,n$ . We do not look at the time series themselves, but rather describe the system with mean values, standard deviations and Pearson correlations. This description is complete only for Gaussian systems. We assume this to be the case for our study. At the time $t = \tau$ a change occurs in the system, that can either influence the mean, standard deviation or correlation structure of the observed variables. This is analogous to the change in color of the light bulb. For the stationary case the sensitivity to change in principal components has already been analyzed by Tveten [16]. We extend it by the inclusion of non-stationarity. We assume that the system has S possible states and at every time it is in one $s \in \{1, \dots, S\}$ of them. For each of these states s we have a vector of means $\boldsymbol{\mu}_0^{(s)}$ , a vector of standard deviations $\boldsymbol{\sigma}_0^{(s)}$ and a correlation matrix $C_0^{(s)}$ . The index 0 indicates that these values refer to the system before the change at $t = \tau$ occurs. In our example with the light bulb these three values together describe the color of the state. For further analysis, we standardize each state separately to a mean of 0 and a standard deviation of 1 and denote standardization with a hat. This results in $\boldsymbol{\hat{\mu}}_0^{(s)} = \boldsymbol{0}$ and $\boldsymbol{\hat{\sigma}}_0^{(s)} = \boldsymbol{1}$ . Without knowledge about the S different states, two ways of measurement present themselves. One either measures just one set of parameters for the time span $t = 1,\dots,\tau$ as was assumed in [16] or one measures the parameters in epochs of a length $\tau_{\mathrm{ep}}\lt\tau$ and takes the averages. The second option offers the advantage that new data, in which one wants to test for change, are most likely also only available for a time span $\tau_{\mathrm{new}}\lt\tau$ . The comparability of new and old data is therefore increased if the epoch length and time span of the new data coincide, i.e. $\tau_{\mathrm{ep}}\approx\tau_{\mathrm{new}}$ . Correlations, for example, might be largely positive on a large time scale due to a global trend in values, but show more structure when measured on shorter time spans. Therefore we define the average state (analogous to the violet light of the light bulb example) as averages over epochs. Without loss of generality, we assume that each state s occurs for the same amount of time before the change and define the parameters of the average state as element-wise averages

$\begin{equation} \boldsymbol{\overline{\mu}}_0 = \sum_{s = 1}^{S} \boldsymbol{\mu}_0^{\left(s\right)}, ~~~ \boldsymbol{\overline{\sigma}}_0^2 = \sum_{s = 1}^{S} \boldsymbol{\sigma}_0^{2\left(s\right)}, ~~~ \overline{C}_0 = \sum_{s = 1}^{S} C_0^{\left(s\right)}. \end{equation} \tag{ 1 }$

We average the variance instead of the standard deviation. The average over the correlation matrix is also meant element-wise and is according to the centroid calculation in real data clustering in, for example, [31–34]. Standardization with the mean value and standard deviation of this average state will be denoted with a tilde: $\boldsymbol{\tilde{\overline{\mu}}}_0 = \boldsymbol{0}$ and $\boldsymbol{\tilde{\overline{\sigma}}}_0 = \boldsymbol{1}$ .

When the change happens at time $t = \tau$ it will affect the values we currently measure. In the light bulb example this would be the slight change of color. We denote these new values with an index 1 as $\boldsymbol{\mu}_1$ , $\boldsymbol{\sigma}_1$ and C₁. Next, we need to decide how the data after the change should be standardized. In reality this questions is, how the new (or possibly live) data of a system should be standardized before comparing them to the usual behavior. This is independent of whether or not non-stationarity has been accounted for. One can either standardize with the known pre-change values of the usual behavior or with the newly measured ones. If one measures only a single new data point, only the first option is feasible. Without loss of generality, we normalize with pre-change values. However, the correlation matrix after the change will not be a well-defined correlation matrix. In particular the entries on its diagonal will not be one. This does not present a problem for our analysis.

Including non-stationarity in the way of multiple pre-change states, we need to know what normal state the system would have been in after the change at $t = \tau$ to make the right comparison. We will denote this state with $s^{*}$ . In reality this means that after having found multiple normal states, we need one criterion (or possibly several criteria) marking the states. For the current theoretical analysis we assume that we simply know the state $s^{*}$ in the epoch after the change. We can then standardize with the pre-change state parameters to get $\hat{\boldsymbol{\mu}}_1$ , $\hat{\boldsymbol{\sigma}}_1$ and $\hat{C}_1$ . The hat denotes standardization with state parameters. To compare the sensitivity of principal components with knowledge of non-stationarity to the case, in which we do not know about the different states, we will also calculate $\tilde{\boldsymbol{\mu}}_1$ , $\tilde{\boldsymbol{\sigma}}_1$ and $\tilde{C}_1$ , where a tilde denotes standardization with the average state parameters.

To analyze the sensitivity of principal components, we study the projections of the system onto the eigenvectors before and after the change. This is analogous to looking at the light bulb from different angles. For a state s we calculate the eigenvalues $\lambda_j^{(s)}$ and normalized eigenvectors $\boldsymbol{v}_j^{(s)}$ , $j = 1, \dots,D$ of the correlation matrix $C_0^{(s)}$ . Ordering is assumed from largest to smallest eigenvalue, i.e. $\lambda_1^{(s)} \unicode{x2A7E} \dots \unicode{x2A7E} \lambda_D^{(s)}$ . Note that the eigenvectors are equivalent to the principal components. The projection of a data point x _t onto the jth component is then calculated as $y_{j,t} = \boldsymbol{v}_j^{(s)T} \boldsymbol{x}_t$ . Furthermore, we want to compare projections onto these components for epochs rather than a single data point. If we know the vector of means µ and the correlation matrix C of said epoch, we calculate the mean $\mu_j^{{\prime}}$ and standard deviation $\sigma_j^{{\prime}}$ of the projection onto the jth eigencomponent by

$\begin{equation} \mu_j^{{\prime}} = \boldsymbol{v}_j^{\left(s\right)T} \boldsymbol{\mu} ~~~ \text{and} ~~~ \sigma_j^{{\prime}} = \sqrt{\boldsymbol{v}_j^{\left(s\right)T} C \boldsymbol{v}_j^{\left(s\right)}}. \end{equation} \tag{ 2 }$

We have dropped the state identification (s) on the left side. This is possible, because projections are only ever needed into the average system and the system of $s^{*}$ so that no confusion with other states is possible. This allows us to keep the identifier, if it is actually the mean vector or the correlation matrix of the state, which is being projected. For example, the mean of projection onto component j of the original state mean $\boldsymbol{\mu}_0^{(s)}$ would read $\mu_{j,0}^{{\prime}(s)}$ , whereas the projection of the changed state means $\boldsymbol{\mu}_1$ would simply be $\mu_{j,1}^{{\prime}}$ . Then also the overline of an average state would translate into an overline on the projected notation. If standardization notation is necessary it will also translate. For comparison, without knowledge about the non-stationarity projections into the eigensystem $\overline{\lambda}_j$ and $\boldsymbol{\overline{v}}_j, ~j = 1,\dots,D$ of the average state are also necessary. We denote these in the same way with a double prime

$\begin{align} \mu_j^{{\prime\prime}} & = \boldsymbol{\overline{v}}_j^{T} \boldsymbol{\mu} ~~~ \text{and} \end{align} \tag{ 3 }$

$\begin{align} \sigma_j^{{\prime\prime}} & = \sqrt{\boldsymbol{\overline{v}}_j^{T} C \boldsymbol{\overline{v}}_j}. \end{align} \tag{ 4 }$

We now have all projections into the eigenvector system, i.e. onto the principal components, that we need. In short, we have for every component j the projections of

the original state into the correct state $\mu_{j,0}^{{\prime}(s^{*})}$ , $\sigma_{j,0}^{{\prime}(s^{*})}$ ,
the changed state into the correct state $\mu_{j,1}^{{\prime}}$ , $\sigma_{j,1}^{{\prime}}$ ,
the original state into the average state $\mu_{j,0}^{{\prime} ^{\prime}(s^{*})}$ , $\sigma_{j,0}^{{\prime} ^{\prime}(s^{*})}$ and
the changed state into the average state $\mu_{j,1}^{{\prime} ^{\prime}}$ , $\sigma_{j,1}^{{\prime} ^{\prime}}$ .

Having established the mean and standard deviation projections, we must now compare them. Therefore, we define the sensitivity for change as the Hellinger distance [35] between the marginal distributions before and after the change. In the light bulb example this is the difference in points on the color chart. In general, the squared Hellinger distance between two probability measures P and Q on the Lebesque-measurable space X with probability densities p(x) and q(x) is defined as

$\begin{equation} H^2\left(P,Q\right) = \frac{1}{2} \int\limits_{X} \left(\sqrt{p\left(x\right)} - \sqrt{q\left(x\right)}\right)^2 \mathrm{d}x = 1-\int\limits_{X} \sqrt{p\left(x\right)q\left(x\right)} \mathrm{d}x. \end{equation} \tag{ 5 }$

For our case, this is easily calculated for two normal distributions $\mathcal{N}(\mu_1, \sigma_1^2)$ and $\mathcal{N}(\mu_2, \sigma_2^2)$ as

$\begin{equation} H^2\left(\mu_1, \sigma_1, \mu_2, \sigma_2\right) = 1- \sqrt{\frac{\sigma_1 \sigma_2}{\sigma_1^2 + \sigma_2^2}} \exp \left\lbrace - \frac{1}{4} \frac{\left(\mu_1-\mu_2\right)^2}{\sigma_1^2 + \sigma_2^2} \right\rbrace . \end{equation} \tag{ 6 }$

For this theoretical study it is more feasible than a data distribution comparison such as Hotelling T² statistics. For the Hellinger distance to be applicable—under the assumption of normal distributions—we do not need the actual distributions. This allows us to simulate mean values, standard deviations and correlation matrices without having to generate the underlying data.

In equation (6) we have already adopted a notation that is easy to use for our application by writing the Hellinger distance as a function of the means and standard deviations of the normal distributions to be compared. The projected µ and σ are calculated as described above and fully describe the comparison. For example, the Hellinger distance between the pre-change projection of the state $s^{*}$ onto itself and the projection of the changed state into that same system with standardization with state pre-change parameters would read

$\begin{equation} H\left(\hat{\mu}_{j,0}^{{\prime}\left(s^{*}\right)}, \hat{\sigma}_{j,0}^{{\prime}\left(s^{*}\right)}, \hat{\mu}_{j,1}^{{\prime}}, \hat{\sigma}_{j,1}^{{\prime}}\right) = H_j\left(\hat{\mu}_{0}^{{\prime}\left(s^{*}\right)}, \hat{\sigma}_{0}^{{\prime}\left(s^{*}\right)}, \hat{\mu}_{1}^{{\prime}}, \hat{\sigma}_{1}^{{\prime}}\right). \end{equation} \tag{ 7 }$

For the right side we have simply taken the index j, which must always be equal for all arguments in H, and moved it from the arguments to the function for better readability.

We carry out different Monte Carlo simulations for changes in the correlation structure, the mean and the standard deviation. Each time we will assume that the other factors stay constant and the change type is the same for the change occurring at $t = \tau$ and in between the states. The detailed scheme for the simulation will always be given in the sections dealing with the results as these differ for the different change types. In general we will simulate normal states, calculate the average state and then simulate a change at $t = \tau$ . We will always perform this for a multitude of normal states and for each of these for a multitude of change scenarios. The after-change data will be standardized and projected into the corresponding state eigensystem as well as the average eigensystem, Hellinger distances are calculated respectively. We then take the expectation value $\overline{H}_j$ over the Monte Carlo runs.

While in real data some fluctuations are always expected, which lead to non-zero Hellinger distances even without a change occurring, this is true for both the state and the average case and can therefore be neglected in a comparison. Remembering our light bulb example, however, we have to correct the result without knowledge of the non-stationarity. In a non-stationary system with states there is an additional non-zero part of the Hellinger distance, if we do not know about the states. This is because the system would be in a state $s^{*}$ , which is unequal to the average state even without change, but projected into the average eigensystem and compared with the average state. This would be additional noise in our change detection that only occurs when we do not know about the non-stationarity. In our example this was the blue and red light already being different from the average violet color without a change. We therefore calculate the projections of the different states $s = 1,\dots,S$ into the eigensystem of the average state. Of course, to do this, we need to standardize the state variables with the pre-change average parameters. We then calculate the Hellinger distances this would cause even without a change. The maximum of these distances is the threshold distance up to which we have to assume that the system behaves normally. We can only detect a real change at $t = \tau$ if it causes a Hellinger distance larger than the maximum one the states themselves cause. This maximum is subtracted from the Hellinger distance between the system with a change and the average state, as only the difference between these two measures the detectability of a change in a non-stationary system without knowledge about this non-stationarity. As detectability can never be less than 0, we set negative values to 0. The subtraction is done inside the Monte Carlo runs. This ensures that the actual noise created by the state average comparison is taken as it can vary strongly depending on how different the original normal states are. We denote this corrected Hellinger distance with $\mathcal{H}_j$ .

4. Results

The focus of our analysis lies in the difference between knowing and not knowing about the non-stationarity of the system. We present our results separately for the different types of change. This way it is easier to describe the simulation procedure including necessary standardization. We have tested our implementation to produce the same results as given by Tveten [16] for the stationary case, i.e. if we assume the number of states S to be one. The results are shown in figure 1. This shows the sensitivity of the different eigenvectors to changes in the stationary system. As described in section 3, the sensitivity is given by the Hellinger distance between the distributions of data projections onto eigenvectors before and after the change. A higher value of the expectation value for the Hellinger distance on the y-axis therefore indicates a higher sensitivity to changes. Clearly, the minor components, i.e. the eigenvectors with small eigenvalues, display a higher sensitivity to change. This type of plot is used frequently in the following analysis including non-stationary. The interpretation of the values on the y-axis stays the same. Here, we show results for the three different types of changes in one plot. The simulation procedures used for each different change type are described in the following sections.

**Figure 1.** Monte Carlo estimates for the sensitivity to changes of the different eigenvectors in a stationary system. The sensitivities are the Hellinger distances between the distributions of data projections onto the eigenvectors before and after the change. Different types of changes are shown as different colors (cor, correlation; var, variance).
Download figure:
Standard image High-resolution image

We present the simulation and its results for correlation structure, mean value and standard deviation in sections 4.1–4.3, respectively.

4.1. Change in correlation structure

We explore the sensitivity of principal components to a change in the correlation structure of the simulated variables. We think that this is the prominent use case for PCA, as changes in mean and standard deviation can also be detected by comparing their values directly. In contrast to the other scenarios, we do not have to worry about standardization issues as mean and standard deviation always stay the same. This is so because we assume that the change does not influence these parameters and the normal states only differ in correlation structure. Hence, the vector of means $\boldsymbol{\mu}_0^{(s)}$ and the vector of standard deviations $\boldsymbol{\sigma}_0^{(s)}$ are equal for all states. We will therefore refrain from using the standardization notation in this case for the sake of readability.

To obtain results for different combinations of dimension D, number of normal states S and change sparsity K, we perform Monte Carlo simulations with various change scenarios. Thereby we get the estimate $\overline{H}_j$ as an average over all simulation runs. We simulate two scenarios, which differ in the method of creating the different states s that exist before the change. For one scenario they are random and unrelated to each other. For the second we draw one random correlation matrix and obtain the other S − 1 states by changing the first state in the same way that change is introduced at $t = \tau$ later on.

In the case of unrelated random states, our simulation follows these steps:

1.
Draw S random correlation matrices $C_0^{(s)}, ~ s = 1,\dots,S$ of dimension D using the method described in [36].
2.
Calculate the element-wise average correlation matrix $\overline{C}_0$ before the change.
3.
Calculate the Hellinger distances $H_j(\overline{\mu}_0^{{\prime\prime}}, \overline{\sigma}_0^{{\prime\prime}},\mu_0^{{\prime\prime}(s)}, \sigma_0^{{\prime\prime}(s)}), ~ j = 1,\dots,D ~,~s = 1,\dots,S$ between the occurring states and the average state, which gives the discussed base noise for detection.
4.
Draw a change sparsity K uniformly as an integer number between 2 and D. This gives the number of change-affected dimensions.
5.
Determine which dimensions are affected by randomly drawing K integer numbers uniformly between 1 and D.
6.
Randomly draw the normal state $s^{*}$ the system is in after the change at time $t = \tau$ from the available states S states.
7.
Draw a multiplicative change in correlation a uniformly between 0 and 1. Then multiply the correlations between all variables i and d from the affected dimensions $\mathcal{D}$ with this change a for i ≠ d.
8.
Calculate $H_j(\mu_0^{{\prime}(s^{*})}, \sigma_0^{{\prime}(s^{*})},\mu_1^{{\prime}}, \sigma_1^{{\prime}}), ~ j = 1,\dots,D$ between changed state and state, which gives the sensitivity for change detection with knowledge about the non-stationarity.
9.
Calculate $H_j(\overline{\mu}_0^{{\prime\prime}}, \overline{\sigma}_0^{{\prime\prime}},\mu_1^{{\prime\prime}}, \sigma_1^{{\prime\prime}}), ~ j = 1,\dots,D$ between changed state and average state.
10.
Calculate the corrected Hellinger distance. This is done by subtracting the additional noise occurring (the maximum over all the distances between the states and the average state) from the Hellinger distance between the changed state and average state: $\mathcal{H}_j(\overline{\mu}_0^{{\prime\prime}}, \overline{\sigma}_0^{{\prime\prime}},\mu_1^{{\prime\prime}}, \sigma_1^{{\prime\prime}}) = \max( H_j(\overline{\mu}_0^{{\prime\prime}}, \overline{\sigma}_0^{{\prime\prime}},\mu_1^{{\prime\prime}}, \sigma_1^{{\prime\prime}}) - \max\limits_{s}(H_j(\overline{\mu}_0^{{\prime\prime}}, \overline{\sigma}_0^{{\prime\prime}},\mu_0^{{\prime\prime}(s)}, \sigma_0^{{\prime\prime}(s)})),0),$ $j = 1,\dots,D$ .This gives the sensitivity for change detection without knowledge about the non-stationarity.
11.
Repeat steps 4 to 10 for 10³ times.
12.
Repeat steps 1 to 11 for 10³ times.

In step 7, we only apply decreases in correlation. We do this to avoid many indefinite changed matrices [16]. One can easily imagine that a multiplicative increase could often lead to correlation coefficients larger than one. If any indefinite matrices still occur, we use Higham's algorithm [37] to find the closest positive-definite one. The results of such a simulation for D = 20 and S = 3 are seen in figure 2. As mentioned before, correction of the Hellinger distance for change detection with the average state is done inside the Monte Carlo runs. To obtain the single green line representing the correction values, we average over the Monte Carlo simulations.

**Figure 2.** Monte Carlo estimates for the Hellinger distances of projections onto the different eigenvectors in presence of a change in correlation structure for D = 20 and S = 3. The original system states are created randomly and are unrelated to each other. The curves show the results for change sensitivity with knowledge of the non-stationarity (red), the uncorrected change sensitivity without knowledge (blue), the additional base noise induced by the system being in a state but comparing to the average state (green) and the corrected change sensitivity without knowledge (black). They correspond to the calculation steps 8, 9, 3 and 10 in the simulation, respectively.
Download figure:
Standard image High-resolution image

To compare states that emerged by applying changes to one random state, we simply change step 1 to the following substeps (the results are presented in figure 3):

(a)
Draw one random correlation matrix $C_0^{(1)}$ of dimension D using the method described in [36].
(b)
Draw a change sparsity K uniformly as an integer number between 2 and D. This gives the number of change-affected dimensions.
(c)
Determine which dimensions are affected by randomly drawing K integer numbers uniformly between 1 and D.
(d)
Draw a multiplicative change in correlation a uniformly between 0 and 1. Then multiply the correlations between all variables i and d from the affected dimensions $\mathcal{D}$ with this change a for i ≠ d.
(e)
Repeat steps (b) to (d) S − 1 times and use the changed correlation matrices as $C_0^{(s)}, ~ s = 2,\dots,S$ .

**Figure 3.** Monte Carlo estimates for the Hellinger distances of projections onto the different eigenvectors in presence of a change in correlation structure for D = 20 and S = 3. In the original system, one state is created randomly and the others are obtained from it by applying a change. The curves show the results for change sensitivity with knowledge of the non-stationarity (red), the uncorrected change sensitivity without knowledge (blue), the additional base noise induced by the system being in a state but comparing to the average state (green) and the corrected change sensitivity without knowledge (black). They correspond to the calculation steps 8, 9, 3 and 10 in the simulation, respectively.
Download figure:
Standard image High-resolution image

In both scenarios the sensitivity is greatest for the minor components in the changed state to state comparison. This is, of course, in accordance with the non-stationary results (see figure 1) as we always compare with the correct normal state. The changed state to average state distance is larger than the one to the actual state for major components, and a crossing point between the two appears towards minor components. This point lies with larger j for the case of related original states. However, as pointed out before, we need to correct this Hellinger distance by the maximum distance between the actual states and the average state. This corrected Hellinger distance indicates the sensitivity of the change detection without knowledge about the non-stationarity. Its values are smaller than the ones with that knowledge for all principal components. In fact, they often lie below 0, indicating no possible detection at all. The knowledge about non-stationarity greatly increases the possibility of detecting changes.

For the case of unrelated states the blue and green lines seem to be almost flat, indicating that all components possess the same sensitivity. This is an inherent feature of the averaging. If the correlation matrices of the states are all entirely random, the correlation structures tend to cancel each other out. Simply put, the off-diagonal elements of the average matrix tend towards 0. This results in meaningless eigenvector structures. This is further underlined by figure 4 showing the same results for S = 7. With more states the average matrix is closer to 0s on the off-diagonal and the blue and green lines are even flatter. This is one of the reasons for the introduction of the scenario with related states. Another, more application-oriented, reason is that many systems will not change or reverse their entire behavior but rather change the behavior of certain groups of variables. In this case, we see in figures 4 and 5 that with an increased number of states detection without knowledge about the non-stationarity becomes impossible, i.e. the corrected Hellinger distances are always smaller than 0. For S = 3 some detectability was given for minor principal components knowing just the average state, but this vanishes completely for S = 7.

**Figure 4.** Monte Carlo estimates for the Hellinger distances of projections onto the different eigenvectors in presence of a change in correlation structure for D = 20 and S = 7. The original system states are created randomly and unrelated to each other. The curves show results for change sensitivity with knowledge of the non-stationarity (red), the uncorrected change sensitivity without knowledge (blue), the additional base noise induced by the system being in a state but comparing to the average state (green) and the corrected change sensitivity without knowledge (black). They correspond to the calculation steps 8, 9, 3 and 10 in the simulation, respectively.
Download figure:
Standard image High-resolution image

**Figure 5.** Monte Carlo estimates for the Hellinger distances of projections onto the different eigenvectors in presence of a change in correlation structure for D = 20 and S = 7. In the original system, one state is created randomly and the others are obtained from it by applying a change. Teh curves show the results for change sensitivity with knowledge of the non-stationarity (red), the uncorrected change sensitivity without knowledge (blue), the additional base noise induced by the system being in a state but comparing to the average state (green) and the corrected change sensitivity without knowledge (black). They correspond to the calculation steps 8, 9, 3 and 10 in the simulation, respectively.
Download figure:
Standard image High-resolution image

In general, the increase in change detectability with knowledge about non-stationarity compared to without is measured by the difference between the changed state to state distance and the corrected value for the average state. Results for this difference are shown in figures 6 and 7. With non-stationarity present in the analysis, the minor components remain the most sensitive. Knowledge about the non-stationarity enables a more sensitive change detection for different numbers of normal states and different dimensions. As expected, the increase in sensitivity is larger for more states. It is also larger for unrelated normal states. Basically, these states are more different from each other than in the related case, so it is reasonable that the knowledge is more helpful here. We see, however, that for S = 7 for the largest j, where detectability is highest in general, the increase is larger for related normal states. As seen in figures 4 and 5 for S = 7 the corrected state to average state distance is always 0 for both scenarios. So any difference in the sensitivity increase between the scenarios can only stem from a difference in the changed state to state distance. This difference exists purely due to a technicality: because correlation coefficients cannot be larger than one the changes applied here are multiplicative between 0 and 1, i.e. they only reduce correlation. As this is also true for the changes performed to obtain the related states, the related normal states have weaker correlations on average than in the scenario with random states. This leads to a change of the sensitivity in the changed state to state scenario in figures 3 and 5. Here, the results are different from the stationary case because the underlying set of matrices is changed. In general, however, as long as different normal states exist, the knowledge about non-stationarity increases change detection sensitivity. As this is the main interest of the current study we did not pursue this effect further, but it is interesting for future studies. Moving on, the difference in D does not change the overall results. For small j the increase is a bit larger for smaller dimensions, whereas for large j it is the other way around. With the sensitivity for change being greatest for the minor components, the knowledge about non-stationarity is also very important in high-dimensional systems. It is noteworthy that an increase in the dimension of the matrix to D = 100 means that the relative change sparsity can be much higher as the smallest value of changed dimensions remains at two independent of D. Our results are also valid for sparse changes.

**Figure 6.** Sensitivity increase for a change in the correlation structure when knowledge about state-wise non-stationarity is available compared to without that knowledge. Results are shown for different, color-coded dimensions D, S = 3 and unrelated random states (simulation scenario 1) as well as related states that were generated from one random state (simulation scenario 2).
Download figure:
Standard image High-resolution image

**Figure 7.** Sensitivity increase for a change in the correlation structure when knowledge about the state-wise non-stationarity is available compared to without that knowledge. Results are shown for different, color-coded dimensions D, S = 7 and unrelated random states (simulation scenario 1) as well as related states that were generated from one random state (simulation scenario 2).
Download figure:
Standard image High-resolution image

In summary, the knowledge about state-wise non-stationarity is important for the detection of changes. The sensitivity increase with knowledge is largest for the minor components, where change detection without non-stationarity is already most sensitive. For the major components the increase is still detectable, but much smaller. We could speculate that PCA for dimension reduction in a system is therefore still quite possible without knowing about the non-stationarity, but this cannot be tested or verified in our setup and is not in the scope of this paper.

4.2. Change in mean

We now analyze the effect of changes in the mean values of variables. Again, we perform Monte Carlo simulations with various change scenarios to obtain results. The procedure in general is similar to the one used for changes in correlation, but the details are changed. We also simulate two different scenarios of normal states again: related and unrelated. In the case of unrelated, random states our simulation follows these steps:

1.
Draw a random correlation matrix C₀ of dimension D using the method described in [36], which is the same for all states.
2.
Draw S vectors of means $\boldsymbol{\mu}_0^{(s)}$ , where each element is uniformly drawn as a non-integer value between −3 and 3.
3.
Calculate the element-wise average vector of means $\boldsymbol{\overline{\mu}}_0$ before the change.
4.
Assume standardization with average pre-change parameters and calculate $H_j(\tilde{\overline{\mu}}_0^{{\prime\prime}}, \tilde{\overline{\sigma}}_0^{{\prime\prime}},\tilde{\mu}_0^{{\prime\prime}(s)}, \tilde{\sigma}_0^{{\prime\prime}(s)}), ~ j = 1,\dots,D~,~s = 1,\dots,S$ between the states and the average state, which gives the discussed base noise for detection.
5.
Draw a change sparsity K uniformly as an integer number between 2 and D. This gives the number of change-affected dimensions.
6.
Determine which dimensions are affected by randomly drawing K integer numbers uniformly between 1 and D.
7.
Randomly draw the normal state $s^{*}$ the system is in after the change at time $t = \tau$ from the available S states.
8.
Draw an additive change in mean $\Delta \mu$ uniformly between −3 and 3. To obtain $\boldsymbol{\mu}_1$ , i.e. the vector of means after the change, from the vector of means of the original state $\boldsymbol{\mu}_{0}^{(s^{*})}$ the change value $\Delta \mu$ is added to the elements of the affected dimensions.
9.
Assume standardization with known state pre-change parameters, i.e. $\boldsymbol{\hat{\mu}}_{0}^{(s^{*})} = \boldsymbol{0}$ and
$\begin{equation*} \left(\boldsymbol{\hat{\mu}}_1\right)_d = \begin{cases} \Delta \mu,& ~ \text{if } d\,\, \text{is an affected dimension} \\ 0, & ~ \text{otherwise} \end{cases} \end{equation*}$
and calculate $H_j(\hat{\mu}_0^{{\prime}(s^{*})}, \hat{\sigma}_0^{{\prime}(s^{*})},\hat{\mu}_1^{{\prime}}, \hat{\sigma}_1^{{\prime}}), ~ j = 1,\dots,D$ between the changed state and state. This gives the sensitivity for change detection with knowledge about the non-stationarity.
10.
Assume standardization with average pre-change parameters, i.e. $\boldsymbol{\tilde{\overline{\mu}}}_0 = \boldsymbol{0}$ and
$\begin{equation*} \left(\boldsymbol{\tilde{\mu}}_1\right)_d = \begin{cases} \left(\boldsymbol{\mu}_{0}^{\left(s^{*}\right)}\right)_d - \left(\boldsymbol{\overline{\mu}}_{0}\right)_d + \Delta \mu,& ~ \text{if } d \,\,\text{is an affected dimension} \\ \left(\boldsymbol{\mu}_{0}^{\left(s^{*}\right)}\right)_d - \left(\boldsymbol{\overline{\mu}}_{0}\right)_d, & ~ \text{otherwise} \end{cases} \end{equation*}$
and calculate $H_j(\tilde{\overline{\mu}}_0^{{\prime\prime}}, \tilde{\overline{\sigma}}_0^{{\prime\prime}},\tilde{\mu}_1^{{\prime\prime}}, \tilde{\sigma}_1^{{\prime\prime}}), ~ j = 1,\dots,D$ between the changed state and average state.
11.
Calculate the corrected Hellinger distance. This is done by subtracting the occurring additional noise (the maximum over all the distances between the states and the average state) from the Hellinger distance between the changed state and average state: $\mathcal{H}_j(\tilde{\overline{\mu}}_0^{{\prime\prime}}, \tilde{\overline{\sigma}}_0^{{\prime\prime}},\tilde{\mu}_1^{{\prime\prime}}, \tilde{\sigma}_1^{{\prime\prime}}) = \max( H_j(\tilde{\overline{\mu}}_0^{{\prime\prime}}, \tilde{\overline{\sigma}}_0^{{\prime\prime}},\tilde{\mu}_1^{{\prime\prime}}, \tilde{\sigma}_1^{{\prime\prime}}) - \max\limits_{s}(H_j(\tilde{\overline{\mu}}_0^{{\prime\prime}}, \tilde{\overline{\sigma}}_0^{{\prime\prime}},\tilde{\mu}_0^{{\prime\prime}(s)}, \tilde{\sigma}_0^{{\prime\prime}(s)})),0),$ $j = 1,\dots,D$ .
12.
Repeat steps 5 to 11 for 10³ times.
13.
Repeat steps 1 to 12 for 10³ times.

To have states that emerged by applying changes to one random state, we simply change step 2 to the substeps:

(a)
Draw one vector of means $\boldsymbol{\mu}_0^{(1)}$ , where each element is uniformly drawn as a non-integer value between −3 and 3.
(b)
Draw a change sparsity K uniformly as an integer number between 2 and D. This gives the number of change-affected dimensions.
(c)
Determine which dimensions are affected by randomly drawing K integer numbers uniformly between 1 and D.
(d)
Draw an additive change in mean $\Delta \mu$ uniformly between −3 and 3. To obtain $\boldsymbol{\mu}_0^{(s)}, ~ s \neq 1$ , i.e. the vector of means of a state s, from the vector of means of state 1 $\boldsymbol{\mu}_{0}^{(1)}$ the change value $\Delta \mu$ is added to the elements of the affected dimensions.
(e)
Repeat steps (b) to (d) S − 1 times and use the changed vectors of means as $\boldsymbol{\mu}_0^{(s)}, ~ s = 2,\dots,S$ .

As we saw in section 4.1 the most interesting result is the increase in sensitivity with knowledge about the non-stationarity compared with change detection without said knowledge. These simulation results are shown in figures 8 and 9. We see a clear increase in sensitivity for all simulated parameters. As before with changes in correlation structure, the knowledge about non-stationarity is slightly more important when the original normal states are related to each other and therefore not entirely different for S = 3. For S = 7 this difference is no longer visible. This is because the corrected Hellinger distances are always 0 here and the changed state to state distances are not substantially different between the two scenarios. There is no influential change in the underlying set of states in contrast to the one seen in section 4.1.

**Figure 8.** Sensitivity increase for a change in mean when knowledge about the state-wise non-stationarity is available compared to without that knowledge. Results are shown for different, color-coded dimensions D, S = 3 and unrelated random states (simulation scenario 1) as well as related states that were generated from one random state (simulation scenario 2).
Download figure:
Standard image High-resolution image

**Figure 9.** Sensitivity increase for a change in mean when knowledge about the state-wise non-stationarity is available compared to without that knowledge. Results are shown for different, color-coded dimensions D, S = 7 and unrelated random states (simulation scenario 1) as well as related states that were generated from one random state (simulation scenario 2).
Download figure:
Standard image High-resolution image

We again conclude that non-stationarity is in general important for change detection. The increase seems to be more important for small j, i.e. major components, as compared with the results in section 4.1. However, this is most likely due to the fact that the change sensitivity without non-stationarity already exhibits the same changes in their dependency on j (see figure 1).

4.3. Change in standard deviation.

Finally, we analyze the sensitivity for changes in standard deviation in the presence of non-stationarity. The basic Monte Carlo process is the same, but once more the details are different. We simulate related and unrelated normal states again. In the case of unrelated, random states our simulation follows these steps:

1.
Draw a random correlation matrix C₀ of dimension D using the method described in [36], which is the same for all states.
2.
Draw S vectors of standard deviations $\boldsymbol{\sigma}_0^{(s)}$ , where each element is uniformly drawn as a non-integer value between $1/3$ and 2.
3.
Calculate the element-wise average vector of variances $\boldsymbol{\overline{\sigma}}_0^2$ before the change¹ . Then take the square root to get the average standard deviations.
4.
Assume standardization with average pre-change parameters. Then $\tilde{C}_0$ is a well-defined correlation matrix with 1s on its diagonal. To assume normalization with average parameters in the state and calculate $\tilde{C}_{0}^{(s)}$ , we need the state covariance matrix. It is obtained by undoing the correct normalization of the state correlation matrix
$\begin{equation*} \Sigma_0^{\left(s\right)} = \mathrm{diag}\left(\boldsymbol{\sigma}_{0}^{\left(s\right)}\right) \hat{C}_0^{\left(s\right)} \mathrm{diag}\left(\boldsymbol{\sigma}_{0}^{\left(s\right)}\right). \end{equation*}$
It is then wrongly normalized with the average pre-change parameters
$\begin{equation*} \tilde{C}_0^{\left(s\right)} = \left(\mathrm{diag}\left(\boldsymbol{\overline{\sigma}}_{0}\right)\right)^{-1} \Sigma_0^{\left(s\right)} \left(\mathrm{diag}\left(\boldsymbol{\overline{\sigma}}_{0}\right)\right)^{-1}.\end{equation*}$
$\tilde{C}_{0}^{(s)}$ is not a well-defined correlation matrix. We then calculate the Hellinger distance $H_j(\tilde{\overline{\mu}}_0^{{\prime\prime}}, \tilde{\overline{\sigma}}_0^{{\prime\prime}},\tilde{\mu}_0^{{\prime\prime}(s)}, \tilde{\sigma}_0^{{\prime\prime}(s)}), ~ j = 1,\dots,D~,~s = 1,\dots,S$ between the states and the average state, which gives the discussed base noise for detection.
5.
Draw a change sparsity K uniformly as an integer number between 2 and D. This gives the number of change-affected dimensions.
6.
Determine which dimensions are affected by randomly drawing K integer numbers uniformly between 1 and D.
7.
Randomly draw the normal state $s^{*}$ the system is in after the change at time $t = \tau$ from the available S states.
8.
Randomly decide if the change increases or decreases the standard deviation. Then, draw a multiplicative change in standard deviation $\Delta \sigma$ uniformly between 1 and 3 or between $1/3$ and 1, respectively. $\boldsymbol{\sigma}_1$ is obtained by multiplying the elements of $\boldsymbol{\sigma}_{0}^{(s^{*})}$ with $\Delta \sigma$ for the affected dimensions.
9.
Assume standardization with known state pre-change parameters. Then $\hat{C}_0^{(s^{*})}$ is a well-defined correlation matrix with 1s on its diagonal and $\hat{C}_1$ is calculated analogous to the description in step 4. Then calculate the Hellinger distance $H_j(\hat{\mu}_0^{{\prime}(s^{*})}, \hat{\sigma}_0^{{\prime}(s^{*})},\hat{\mu}_1^{{\prime}}, \hat{\sigma}_1^{{\prime}}), ~ j = 1,\dots,D$ between the changed state and state. This gives the sensitivity for change detection with knowledge about the non-stationarity.
10.
Assume standardization with average pre-change parameters. Then $\tilde{\overline{C}}_0$ is a well-defined correlation matrix and $\tilde{C}_1$ can be calculated analogous to the procedure in step 4. Then calculate the Hellinger distance $H_j(\tilde{\overline{\mu}}_0^{{\prime\prime}}, \tilde{\overline{\sigma}}_0^{{\prime\prime}},\tilde{\mu}_1^{{\prime\prime}}, \tilde{\sigma}_1^{{\prime\prime}}), ~ j = 1,\dots,D$ between the changed state and average state.
11.
Calculate the corrected Hellinger distance. This is done by subtracting the occurring additional noise (the maximum over all the distances between the states and the average state) from state: $\mathcal{H}_j(\tilde{\overline{\mu}}_0^{{\prime\prime}}, \tilde{\overline{\sigma}}_0^{{\prime\prime}},\tilde{\mu}_1^{{\prime\prime}}, \tilde{\sigma}_1^{{\prime\prime}}) = \max(H_j(\tilde{\overline{\mu}}_0^{{\prime\prime}}, \tilde{\overline{\sigma}}_0^{{\prime\prime}},\tilde{\mu}_1^{{\prime\prime}}, \tilde{\sigma}_1^{{\prime\prime}}) - \max\limits_{s}(H_j(\tilde{\overline{\mu}}_0^{{\prime\prime}}, \tilde{\overline{\sigma}}_0^{{\prime\prime}},\tilde{\mu}_0^{{\prime\prime}(s^{*})},\tilde{\sigma}_0^{{\prime\prime}(s^{*})})),0),$ $j = 1,\dots,D$ .
12.
Repeat steps 5 to 11 for 10³ times.
13.
Repeat steps 1 to 12 for 10³ times.

To have states that emerged by applying changes to one random state, we simply change step 2 to the substeps:

(a)
Draw one vector of standard deviations $\boldsymbol{\sigma}_0^{(1)}$ , where each element is uniformly drawn as a non-integer value between $1/3$ and 2.
(b)
Draw a change sparsity K uniformly as an integer number between 2 and D. This gives the number of change-affected dimensions.
(c)
Determine which dimensions are affected by randomly drawing K integer numbers uniformly between 1 and D.
(d)
Randomly decide if new state has increased or decreased standard deviation. Then, draw a multiplicative change in standard deviation $\Delta \sigma$ uniformly between 1 and 3 or between $1/3$ and 1, respectively. $\boldsymbol{\sigma}_0^{(s)}, ~s\neq1$ is obtained by multiplying the elements of $\boldsymbol{\sigma}_{0}^{(1)}$ with $\Delta \sigma$ for the affected dimensions.
(e)
Repeat steps (b) to (d) S − 1 times and use the changed vectors of standard deviations as $\boldsymbol{\sigma}_0^{(s)}, ~ s = 2,\dots,S$ .

The results for increase in detection sensitivity due to knowledge about the non-stationarity are shown in figures 10 and 11. It seems to be small for a large number of components j, especially the major components. As was the case with a comparison between changes in mean and correlation structure, this is largely due to the different sensitivity in the stationary case (see figure 1). The curve of the increase always shows a similar behavior over j to the stationary sensitivity itself. This underlines the importance of knowledge about non-stationarity as one would use these components for practical applications. Changes in standard deviation are almost impossible to detect for small j even in the stationary case. Therefore, our results cannot show a large increase in that regime. For large j we can again conclude that knowledge about the non-stationarity of the system is quite important and becomes even more so if the original states are similar to each other. Again, as with results for changes in mean, the difference between related and unrelated states is smaller for S = 7 and even vanished for large j in that case. This is again due to a vanishing corrected Hellinger distance between state and average state in this regime.

**Figure 10.** Sensitivity increase for a change in standard deviation when knowledge about the state-wise non-stationarity is available compared to without that knowledge. Results are shown for different, color-coded dimensions D, S = 3 and unrelated random states (simulation scenario 1) as well as related states that were generated from one random state (simulation scenario 2).
Download figure:
Standard image High-resolution image

**Figure 11.** Sensitivity increase for a change in standard deviation when knowledge about the state-wise non-stationarity is available compared to without that knowledge. Results are shown for different, color-coded dimensions D, S = 7 and unrelated random states (simulation scenario 1) as well as related states that were generated from one random state (simulation scenario 2).
Download figure:
Standard image High-resolution image

5. Application to real data: traffic as an example

To demonstrate how the method works, we give an instructive example with real data collected on German motorways. We look at 35 cross-sections (detectors) on the Cologne orbital motorway in 2015. The measured values are flows of vehicles, i.e. the number of vehicles passing a detector per time, in the counterclockwise direction. The time resolution of the data is 1 min. As the orbital motorway consists of sections with different numbers of lanes, we do not look at a single lane but rather at the flow accumulated over all lanes. The flow at time t at cross-section k is denoted $q_k (t)$ . The dataset is described in more detail in [32].

This example is chosen because it allows a simple definition of 'normal' and 'non-normal', i.e. changes in the system. These terms serve only as labels without further interpretation. We consider as normal the workdays Monday through Thursday as they present the typical rush hour behavior one expects from traffic. Fridays are excluded as the afternoon rush hour is stretched in time due to a strong variation in the end of the working day. Furthermore, we exclude the North Rhine-Westphalia bank holidays. Accordingly, we can afterwards try to detect weekend or holiday behavior as non-normal. Another advantage of this example is that everybody is aware of the non-stationarity of the system. Traffic is very different depending on the time of day. This means the non-stationarity in this system is not governed by a fluctuating variable but simply by the time of day.

The first thing one has to do to test the proposed method on real data is to classify the non-stationarity in the data. There is, of course, not only one way to do this. Here, we calculate correlation matrices for each hour of the normal workdays. This is done with non-overlapping time windows called epochs with a length $\tau_{\mathrm{ep}} = 1\,\mathrm{h}$ . A day will thus contain 24 correlation matrices. The matrix for time $t = \tau$ is calculated from the data $q_k(t), ~ k = 1,\dots,35 ~, ~ \tau-\tau_{\mathrm{ep}} \lt t \unicode{x2A7D} \tau$ . We write $| \tau_{\mathrm{ep}} |$ to denote the number of data points during this epoch. We normalize for each detector k and each time window separately to 0 mean and a standard deviation of 1:

$\begin{equation} Q_k\left(t\right) = \frac{q_k\left(t\right)-\langle q_k\left(t\right)\rangle_{\tau}}{\sqrt{\langle q_k^2\left(t\right) \rangle_{\tau} - \langle q_k\left(t\right)\rangle^2_{\tau}}}, ~ \tau-\tau_{\mathrm{ep}} < t \unicode{x2A7D} \tau. \end{equation} \tag{ 8 }$

Here, $\langle \dots \rangle_{\tau}$ denotes the average in the 1 h epoch. We arrange the data in a $35 \times |\tau_{\mathrm{ep}}|$ matrix

$\begin{equation} Q\left(\tau\right) = \begin{bmatrix} Q_1\left(\tau- \tau_{\mathrm{ep}}\right) & \dots & Q_1\left(\tau\right) \\ \vdots & & \vdots \\ Q_k\left(\tau- \tau_{\mathrm{ep}}\right) & \ddots & Q_k^{\left(l\right)}\left(\tau\right) \\ \vdots & & \vdots \\ Q_{35}\left(\tau- \tau_{\mathrm{ep}}\right) & \dots & Q_{35}^{\left(l\right)}\left(\tau\right) \end{bmatrix}. \end{equation} \tag{ 9 }$

The correlation matrix is calculated as

$\begin{equation} C\left(\tau\right) = \frac{1}{| \tau_{\mathrm{ep}} |} Q\left(\tau\right) Q^{\dagger}\left(\tau\right). \end{equation} \tag{ 10 }$

This matrix contains as the element at position $i,j$ the Pearson correlation coefficient between $q_i (t)$ and $q_j (t)$ during epoch τ. In case of missing values in the data, we disregard all values of that time stamp to ensure the calculated matrix is a positive-definite correlation matrix. We define a Euclidean distance measure between two matrices in the epochs τ and $\tau^{{\prime}}$

$\begin{equation} d(\tau, \tau^{{\prime}}) = \sqrt{\sum_{i,j}(C_{ij}(\tau)-C_{ij}(\tau^{{\prime}}))^2} = ||C(\tau) - C(\tau^{{\prime}})|| ~. \end{equation} \tag{ 11 }$

and apply k-means clustering [38] with k = 2 to all calculated matrices. This yields the cluster centers (element-wise mean matrices) shown in figure 12 and the distribution of the cluster appearances over daytime as shown in figure 13. Clearly, cluster 1 shows times where the traffic flow at all cross-sections is strongly correlated, whereas cluster 2 is mostly uncorrelated. Here, some correlations remain only close to the diagonal, which indicate similar flows on neighboring detectors. Splitting further yields only a separation of the strongly correlated cluster based on the strength of the correlations; no structural differences are detected. Furthermore, we also applied hierarchical k-means clustering [31], hierarchical clustering based on Ward's optimization criterion [39, 40] and complete-linkage and single-linkage clustering [41] to the matrices. We have tested all algorithms for solutions of two, three and four clusters. We studied the resulting cluster centers and also calculated the silhouette coefficient [42] for each matrix

$\begin{equation} s\left(\tau\right) = \left\{ \begin{array}{ll} \displaystyle{\frac{b\left(\tau\right)-a\left(\tau\right)}{\max\left(b\left(\tau\right), a\left(\tau\right)\right)}} & , ~|z_{n\left(\tau\right)}| > 1 \\ 0 & , ~|z_{n\left(\tau\right)}| = 1 \end{array} \right. \end{equation} \tag{ 12 }$

where $n(\tau)$ gives the cluster corresponding to epoch τ and $|z_{n}|$ the elements in cluster n. $a(\tau)$ is the average distance to all other matrices in the same cluster

$\begin{equation} a\left(\tau\right) = \frac{1}{|z_{n\left(\tau\right)}|-1} \sum_{\tau^{{\prime}} \in z_{n\left(\tau\right)}, \tau^{{\prime}} \neq \tau} d\left(\tau, \tau^{{\prime}}\right) \end{equation} \tag{ 13 }$

and $b(\tau)$ is the smallest average distance to a single other cluster

$\begin{equation} b\left(\tau\right) = \min_{m \neq n\left(\tau\right)}\left( \frac{1}{|z_m|} \sum_{\tau^{{\prime}} \in z_m} d\left(\tau, \tau^{{\prime}}\right) \right) ~. \end{equation} \tag{ 14 }$

**Figure 12.** Correlation matrix cluster centers calculated as element-wise means over all matrices sorted into a cluster. The x-axis and y-axis both show the different traffic detectors, but labels are removed on the x-axis for better readability of the figure. Each matrix element is the mean Pearson correlation coefficient between the traffic flow signals of two detectors and its value is color coded.
Download figure:
Standard image High-resolution image

**Figure 13.** Histogram counts of the appearances of clusters 1 and 2 during 24 h of a day. The histogram is calculated over all used weekdays with a bin width of 1 h.
Download figure:
Standard image High-resolution image

This coefficient takes values between −1 and +1 with larger positive values representing matrices that are well clustered and negative values showing matrices that are closer to another cluster than to their own. An indicator for the overall clustering is given by the average silhouette coefficient. This is shown in table 1 for all calculated clusterings. The hierarchical clustering with k-means splits and the one based on Ward's criterion yield very similar results to the standard k-means. This is not surprising as the optimization criteria for the algorithms are similar, but highlights that the found solution provides a reasonable grouping. The complete-linkage clustering gives a similar solution for two clusters, but starts to classify very small groups from then on. It is noteworthy that the two-cluster solution has a higher silhouette coefficient than the k-means solution. However, the matrices and distribution over daytime show no significant differences for two clusters. As one would expect, the single-linkage clustering gives very different results that do not separate the main groups as well, but rather lead to the detection of outliers. This is, of course, because the optimization criterion is very different from the others. We stick with the standard k-means solution for the following analysis. We have seen that its solution is reasonable and stable. It does not diverge towards outsider characterization for more than two clusters, which makes it a good first choice for this sort of classification in other systems as well. In general, one has to tailor the identification of the non-stationarity to the problem at hand. We have found that correlation matrix clustering is useful in many systems [31–33].

Table 1. Silhouette coefficients for different clustering methods and solutions with two, three and four clusters.

Method	Two clusters	Three clusters	Four clusters
k-means	0.423	0.317	0.168
Hierarchical k-means	0.423	0.354	0.327
Ward's criterion	0.379	0.330	0.168
Complete linkage	0.451	0.452	0.345
Single linkage	0.446	0.235	0.219

A closer look at figure 13 allows interpretation of the clusters. The shown histogram counts reveal which cluster is dominant during which time of the day. The strongly correlated times in cluster 1 are caused by an overall increase (5 am–7 am) and an overall decrease (8 pm) in traffic volume. Another, less pronounced, decrease is found during the 0 am epoch. While the first two are caused by the major motion between low traffic flows during the night and high ones during the day, the latter is explained by traffic shifts from late evening traffic with low volumes to night time traffic with almost zero volume. As the 0 am change is not as strong, cluster 2 remains dominant here. Generally, the appearance of cluster 1 marks transition periods between times with low and high traffic volumes. Therefore, we will further split the two correlation clusters: only consecutive times of a dominating correlation cluster are considered to be one cluster, the next appearance of the same correlation cluster is taken to be a new cluster. Thereby we achieve a good separation of times with different mean values, standard deviations and correlation structures. We arrive at five clusters:

1.
0 am–4 am,
2.
5 am–7 am,
3.
8 am–7 pm,
4.
8 pm,
5.
9 pm–11 pm.

Of course, we could merge the first and last cluster, but for simplicity we leave the division as it is.

Next, we project all the normal workday data points into the eigenvector systems of the corresponding cluster state analogously to the mean projection in equation (2). This yields the reference normal distributions for the calculation of the Hellinger distance.

We then choose a random normal workday and a random Sunday. The workday is needed to establish which Hellinger distances appear even without any changes due to fluctuations. Hence, each hour of the two days is projected into the corresponding eigenvectors and the resulting distributions are compared with the reference normal distributions by means of the Hellinger distance. Within each cluster c we get a maximum appearing Hellinger distance for the normal data $H_{c}^{\mathrm{norm}}$ . We define this as the threshold a test data point needs to exceed to be called a system change. This is a rather strict definition minimizing false positives for system changes or anomalies. For the weekend data we calculate $H_c^{\mathrm{anom}}$ once as the maximum (case (a)) and once as the mean (case (b)) over the daily Hellinger distances. Thus, we introduce the exceedance of the threshold

$\begin{equation} d_c = H_{c}^{\mathrm{norm}} - H_c^{\mathrm{anom}} \end{equation} \tag{ 15 }$

as a useful measure for detectability in cluster c. The overall detectability is then taken as the maximum over all clusters. This yields a detectability performance for each eigenvector. To determine if accounting for the non-stationarity is necessary, we perform the same analysis but without clustering, i.e. there is only one cluster. The results for these calculations are shown in figures 14 and 15. Detection is obviously easier with knowledge of the non-stationarity. Without it, some eigenvectors are not at all usable for detection (values below 0). This is especially true for case (b). With clusters and a careful choice of the correct eigenvectors, we detect the change not only in a single data point but in the average performance of at least one cluster. In general, all eigenvectors show a smaller detection performance without clusters. This comparison is not always perfectly fair as there is no warranty that eigenvector j in the no-cluster analysis corresponds to the same behavior as eigenvector j in the cluster analysis. This was neglected in the simulation study in previous sections as it does not matter in the Monte Carlo average. In figures 16 and 17 we show the same type of analysis once more, but instead of a random weekend day we have taken the start of the summer bank holidays. Here, the mentioned fact is even more evident: For the system as a whole detectabilities are higher with clusters, but for single eigenvectors it might not be so. For example, in this case eigenvector 22 in the analysis with clusters and eigenvector 22 in the one without clusters do not have the same structure, i.e. they represent different behavior patterns of the multivariate system.

**Figure 14.** Detectability of a weekend day in case (a), where detectability is defined via a single exceedance of the detection threshold during the day. The detectability is shown for each eigenvector (principal component) of the system. Results are shown with knowledge about the non-stationarity during the day, i.e. with clusters, and without.
Download figure:
Standard image High-resolution image

**Figure 15.** Detectability of a weekend day in case (b), where detectability is defined as an average exceedance of the threshold in one cluster. The detectability is shown for each eigenvector (principal component) of the system. Results are shown with knowledge about the non-stationarity during the day, i.e. with clusters, and without.
Download figure:
Standard image High-resolution image

**Figure 16.** Detectability of the summer bank holiday onset in case (a), where detectability is defined via a single exceedance of the detection threshold during the day. The detectability is shown for each eigenvector (principal component) of the system. Results are shown with knowledge about the non-stationarity during the day, i.e. with clusters, and without.
Download figure:
Standard image High-resolution image

**Figure 17.** Detectability of the summer bank holiday onset in case (b), where detectability is defined as an average exceedance of the threshold in one cluster. The detectability is shown for each eigenvector (principal component) of the system. Results are shown with knowledge about the non-stationarity during the day, i.e. with clusters, and without.
Download figure:
Standard image High-resolution image

All together, we see a much better change detection performance in this example when the non-stationarity is accounted for. Especially, if we do not want to detect the change only in single data points (case (a)) but rather with a consistently high indicator (case (b)), it is impossible to detect without clusters. This is inferred from all results for the detectability without clusters showing negative values in case (b) (compare figures 15 and 17). While of course in the case with clusters, fewer data points are used to calculate the mean per cluster, one can clearly identify the change in one of them. Without clusters there might be single points exceeding the detection threshold, but overall the Hellinger distances do not lie above the threshold.

The trend that changes are detected more easily in projections onto eigenvectors with small eigenvalues cannot be confirmed by a single example system. Which eigenvector is most suitable for detection depends on the interaction between correlation structure and system change. The aforementioned trend is therefore only true in the average over many systems.

6. Conclusion

We have studied the sensitivity for change detection of principal components in non-stationary, correlated systems with multiple time series measurements. The non-stationarity was defined as the possibility for the system to be in multiple, distinctly different normal states prior to the change. Our study was based on the one conducted by Tveten [16] for stationary time series. Accordingly, the changes that should be detected were either to correlation structure, mean values or standard deviations of the time series. For simplicity, we constricted the study to those scenarios, where the normal states differ in the same property, in which the change also occurred. We analyzed how the detectability of the change varied for each principal component depending on the knowledge of the non-stationarity.

We found that in general knowledge about the non-stationarity always increases the sensitivity for change detection. The increase is dependent on the principal component on which the data were projected. This dependence is quite similar to the one that has already been found for the pure sensitivity for change detection in a stationary system. This means that where sensitivity for change is highest the increase gained by knowledge about the non-stationarity is also highest. Usually this is for the minor components, i.e. the eigenvectors associated with small eigenvalues. This is reasonable as they represent behavior that is entirely unusual for a system in its normal state. When a change occurs, it can be seen very easily in those projections. If they are mixed up due to multiple normal states, this clear possibility to see changes is diminished. This underlines the importance of non-stationarity for direct uses of PCA for change detection and is confirmed in the traffic flow example.

Usage of PCA for dimensionality reduction by keeping only projections on major components will probably be less influenced by this. This is an interesting aspect to study in the future. Other methods depending on eigenvectors for change detection (e.g. the Mahalanobis distance) were not directly studied, but we think it likely that their sensitivity to change could also be increased by the consideration of multiple normal states. Our results are true for all three different types of changes and states. We further analyzed two scenarios: unrelated and related normal states. We found that the sensitivity increase is usually greater for unrelated states, which is reasonable as non-stationarity has a stronger effect in this case. We want to point out once again that the use of multiple normal states for change detection in real applications is only possible if a criterion can be found to identify which normal states new data should be compared with. The purpose of the present study was mainly to develop the concepts and to provide the necessary tools. The traffic flow data example shows that it is possible to transfer the idea onto real world data. This opens up applicability in many systems where PCA and related methods are used for change detection. While traffic is among these systems [12], other prominent examples for fault detection are chemical plants [43] and industrial machinery [9, 44]. A first step to include multiple operational conditions in PCA-based failure detection for heat pumps was already undertaken by Zhang et al [45]. We intend to test the proposed method on wind turbines, where multiple operational normal conditions based on correlation matrix clustering have already been found [33] and PCA is being used for fault detection [10].

Our results clearly show that non-stationarity should be taken into account if one undertakes change, novelty or failure detection using PCA.

Acknowledgment

We acknowledge fruitful conversations with Joachim Peinke, Matthias Wächter, Christian Phillipp, Timo Lichtenstein and Anton Heckens. This study was carried out in the project 'Wind farm virtual Site Assistant for O&M decision support—advanced methods for big data analysis' (WiSAbigdata) funded by the Federal Ministry of Economics Affairs and Energy, Germany (BMWi). One of us (H M B) thanks for financial support in this project.

Sensitivity of principal components to system changes in the presence of non-stationarity

Article metrics

Author e-mails

Author affiliations

Author notes

Dates

Abstract

1. Introduction

2. Analysis concept

3. Theory and setup