The HLS-II alarm system optimization for removing nuisance alarms

Hefei Light Source II (HLS-II) is a vacuum ultraviolet synchrotron light source. The HLS-II alarm system is responsible for monitoring the alarms of the process variables and distributing the alarm events in time. Nuisance alarms reduce the functionality, credibility and trustworthiness of the alarm system. This paper proposes a method for design alarm deadband and delay timers to remove the nuisance alarm events to the expected ratio. An optimal deadband width is calculated to reduce the alarm events while balancing the effects of reducing the occurrence number of alarm events and increasing the duration of alarm events. If the expected ratio is not met after using the optimal deadband width, the delay timers are set additionally. The Bayesian estimation approach is used to estimate the probability that the delay timers eliminate the alarm events. This method is based on statistical properties of the process variables and can effectively remove nuisance alarm events. Two examples of the HLS-II alarm system are provided to illustrate the proposed method. In the two examples, the optimal deadband removed 99.27% and 88.92% of nuisance alarms respectively.


Introduction
Hefei Light Source II (HLS-II) is a VUV and soft X-ray synchrotron light source, it consists of an 800 MeV linac, an 800 MeV storage ring and a transport [1,2].Experimental physics and industrial control system (EPICS) is the most widely used development platform for the controls of the large particle accelerators for creating distributed soft real-time control system.The control system of HLS-II is a distributed system based on EPICS [3].Alarm system is an indispensable component of the control system, it detects the alarms of the various devices and distributes the alarm events in time [4].
However, nuisance alarms are often observed due to noise and disturbance, distracting the operators from noticing real abnormality and reducing the functionality of the alarm system.The HLS-II alarm system has been deployed since November 2021 and contains 1125 process variables [5].As of June 2023, 1382740 alarms have been stored in the alarm historical database, including 1046638 threshold alarms.The average number of alarms generated per day in the range of 1500-2000, exceeding the standard of 150 per day in "Recommended Alarm System Key Performance Indicators" [4].These alarms are sent to multiple upper-layer software through the distributed messaging platform, including the real-time alarm GUIs, the alarm historical database and the message distributor.In the upper-level software, these alarms are processed, displayed and stored.Due to the large number of process variables and monitoring noise interference, there are a large number of nuisance alarms in these alarms, such as chattering alarms, fleeting alarms.It is significant to analyze the raw alarms based on statistical properties of the process variables and provide the real alarms to upper layer software to build the high performance alarm system [6].
Alarm deadband is a hysteresis field of process variable in EPICS.Alarm deadband reduces excessive nuisance alarms caused by process variables hovering above alarm thresholds.As shown in figure 1(a), the alarm clears only if the process variable falls below the green deadband interval."m-sample" refers to  consecutive samples.The value of  is 5 in figure 1(b).The m-sample delay timer raises (clears) alarms if and only if m consecutive samples are in the alarm (non-alarm) state [7].Alarm deadband and delay timer have been widely used in practice to remove nuisance -1 -alarms in the alarm systems of industrial process facilities [8].But the alarm deadband field is not currently used in the HLS-II alarm system, and the delay timer is set empirically.Some statistical methods and techniques have been developed to design the alarm deadbands and the delay timers in the alarm systems of industrial process facilities [9][10][11].Figure 1(b) shows the alarm duration and alarm deviation for a continuous alarm.Alarm duration and alarm deviation are normalized and then compared [9].The comparison reflects the effectiveness of the deadband elimination nuisance alarms.A Generalized delay timer was design by generating and clearing an alarm based on  1 out of  consecutive samples and additional alarm thresholds [10].Figure 1(b) illustrates the alarms with single alarm threshold and fixed  value.The Generalized delay timer sets multiple alarm thresholds and variable delay values.Serial alarm system is composed of alarm deadband serially followed by delay timer to remove false alarms [11].Despite the differences between industrial process facilities and accelerator facilities, general principles and processes for the management of alarm systems are similar.The IEC 62682/ISA 18.2 industrial standard on Management of Alarm Systems for the Process Industries describes internationally recognized good engineering practice for control system alarm management [12,13].Some particle accelerators embrace the industry standard in alarm management, such as Canadian Light Source, ISIS Neutron and Muon Source and ALBA Synchrotron [14][15][16][17].The alarm statistical methods can be applied to HLS-II accelerator facility to improve the alarm system performance.
This paper proposes a method for designing alarm deadband and delay timers to remove the nuisance alarm events to the expected ratio.Compared with the delay timer, the alarm deadband has no latency of the abnormal condition detection.But alarm deadband delays the time of abnormal condition recovery.The deadband design considers both effects and then obtains the optimal value.The users set an expected ratio of alarm events to be removed.If the ratio is not met after using the optimal deadband width, the delay timers are set additionally.The Bayesian estimation approach is used to estimate the probability that the delay timers eliminate the alarm events.
This paper establishes the mathematical model of the alarm systems based on EPICS, laying the foundation for the analysis of historical alarm data.This method is applicable to various types -2 -process variables, such as pressure, flow, and temperature.In addition, it does not require the restrictive identically distributed assumption.
The rest of the paper is arranged as follows: section 2 describes the problem to be addressed.Section 3 presents details of the proposed method.Section 4 provides HLS-II alarm system examples as illustration.The paper ends with a conclusion at section 5.

Problem description
Given an analog process variable with continuous values and its historical data samples {()}  =1 . is the sampling index of the interval sampling period ℎ.The alarm variable   () is set to '1' for alarm state and '0' for normal state.For a high-alarm threshold  th , the alarm variable is If  th is a low-alarm threshold, the counterpart of eq. ( 2.1) can be obtained analogously.In the sequel, the high-alarm threshold is considered for the simplicity of notations.Alarm deadband is a hysteresis field used to avoid generating too many nuisance alarms.If () is under normal condition and approaches the high-alarm threshold  th , an alarm occurs when () reaches  th .And then, the alarm clears only if () drops below  th by more than the alarm deadband width .EPICS provides the HYST field to configure the deadband width.The alarm variable configured with the deadband width  is Obviously, the alarm deadband delays the time for   () to return '0'.The alarm duration is the time between the occurrence and the clearance of an alarm event.Therefore, alarm deadband reduces the number of the alarm events and results in longer nuisance alarm duration in {()}  =1 .The m-sample delay timer raises(clears) an alarm if and only if  consecutive samples of Expected detection delay (EDD) means the latency of detecting abnormal conditions and increases with the delay timer value.The m-sample delay timer causes EDD to increase.Thus, the alarm urgency should be considered when using the delay timer.Alarm deadband takes an analog process variable as input and a digital variable as output, while delay timers can handle both analog and digital variables as input.For the analog process variable, if the alarm deadband is insufficient -3 -for removing nuisance alarms, the serially-connected delay timers will be adopted.For the digital variable, the delay timers can be used alone.
The operators usually need to acknowledge the occurrence of an alarm event rather than each data sample   () in the same alarm event.Thus, the ratio of remove alarm events number is the main concern of operators.Each alarm event corresponds to an alarm occurrence and an alarm clearance.Alarm occurrence  , () is the event that alarm state   () switches from '0' to '1', that is, As the counterpart of alarm occurrences, alarm clearance is The number of alarm events in With using deadband and delay timers, the left alarm events in =1  ,,, ().Thus, the left alarm events ratio in The value of (, ) is in the range [0, 1].If (, ) is close to 0, then the most of nuisance alarm events are removed by alarm deadband and delay timers.In this context, data samples {()}  =1 are known to be under normal conditions, so all alarms are nuisance alarms.
For illustration, figure 2(a) shows the {()} 50 =1 configured with the high-alarm threshold  th and the alarm deadband width . Figure 2(b) presents 7 alarm occurrences in this data samples without the alarm deadband and the delay timers.Figure 2(c) shows that if the alarm deadband width  is configured, there are 4 alarms left.If the alarm deadband width  and the delay timers  = 5 are set together, there is 1 alarm left in figure 2(d) and the left alarm events ratio (, ) = 1 7 = 14.29%.The objective of this paper is to reduce alarm events so that the left ratio of nuisance alarm events number in {()}  =1 is lower than  0 . 0 is a user-selected bound and  0 = 0.05 is set in this paper.In order to achieve this objective, alarm deadband and delay timers will be designed.It is assumed that the statistical characteristics of process variables remain consistent.

The proposed method
The main idea and the steps of the proposed method are presented in this section.Reducing EDD leaves more time for operators to handle abnormal conditions.Thus, alarm deadband is the preferred method for removing nuisance alarms.The loss function was designed to balance the effect of reducing the number of the alarm events and increasing the alarms duration.An optimal deadband width  opt is obtained from the loss function.( opt ) is verified to whether the result meets the users expected ratio  0 .If the ratio  0 is not met, the delay timers are set additionally.

Optimal alarm deadband design
The alarm system usually provides Human Machine Interface (HMI) to display the abnormal state process variables to facilitate the operator to track the alarms.Alarm deadband may increase the time that abnormal state process variables are displayed on the HMI.We consider this negative impact and design the loss function.
The performance index  () is defined as the ratio of the alarm events duration to the total sampling duration.For the alarm variable without the alarm deadband in eq.(2.1), the initial ratio is The counterpart ratio with the deadband width  is -5 -So  () increases with  and represents the negative effect bring with alarm deadband.In order to calculate  within a reasonable range, we determine the maximum value of the alarm deadband as . Sort the data set {() − x}  =1 in ascending and take the 99th percentiles of the data set as  0 .The 99th percentiles is used to avoid the influence of outliers. 0 represents the maximum value of process variable fluctuation, and the alarm deadband should not exceed this value theoretically.
The positive performance index () and the negative performance index  () are considered together.The optimal deadband width is designed as The notion of arg min  returns the value of  which minimizes ().The physical meaning of eq.(3.3): () is the left ratio of the number of alarm events in  ({ , ()}  =1 ) with using deadband .So () decreases with the increment of  and shows the positive effect. () is the ratio of the alarm events duration with alarm deadband  to the total sampling duration. ()  ( 0 ) increases with  increases and represents the negative effect bring with alarm deadband . is the balance factor and its range is (0,1). value depends on the user's demand scenario for the alarm system.In this context, there are no preferences, so  = 0.5.
Then, ( opt ) will be verified.If ( opt ) <  0 , users expected ratio  0 is met and alarm deadband is effective to solve the nuisance alarms problem of the process variable.If ( opt ) ≥  0 , the alarm deadband and delay timer need be joint designed.

Joint design of alarm deadband and delay timers
With using the alarm deadband field  opt , the alarm variables { , opt ()}  =1 are entered into the delay timers.The alarm events with the alarm durations less than  will be removed by the delay timers.The optimal delay timer value is selected whose joint ratio is closest to ( opt ,  opt ) <  0 .Eq. (3.4) is the probability that the delay timers remove the alarms.It is an approximate value based on frequency.In order to verify the reliability, we use Bayesian estimation method to estimate () in eq.(3.4) and its corresponding confidence interval.
In Bayesian estimation, () is a continuous random variable in the range [0,1] denoted as Θ  .  is defined as the number of alarm events remaining after delay timer. is the alarm event number in { , opt () }  =1 .We consider that the remaining number of alarm events follows a binomial distribution with the successful probability Θ  .Since alarm variables are mutually independent, () takes the binomial distribution with the probability of success as Θ  among  independent trials.Given the specific value   of Θ  , the conditional probability of   based on   is Assuming there is no additional prior information, the prior function of the continuous random variable Θ  is generally assumed to be a uniform distribution.The probability density function is According to the chain rule of conditional probability and joint probability distribution, the joint probability distribution of   and Θ  is For the given   , Bayesian posterior distribution is The estimated value of Θ  is its expectation.According to the posterior distribution function, the estimated value of For a given confidence level of 0.05, the confidence interval [ θ − , θ + ] can be decided as

Application in HLS-II
This section provides two examples of HLS-II to illustrate the proposed method for designing alarm deadband and delay timers.The data for the examples comes from the HLS-II historical database [18].
The first example uses only alarm deadband and the second example adopts the joint design of serially-connected alarm deadband and delay timers.

Example 1
Process variable () is the vacuum pressure of the in-vacuum undulator of HLS-II storage ring in the unit Pa.It is configured with a high-alarm threshold  th = 1e−07 Pa. Figure 3(a) shows some historical data samples of () on September 27, 2022.Due to noise and disturbance, () fluctuates above and below the  th , resulting in a large number of nuisance alarms.There are many nuisance alarms occur in the alarm variables   () as shown in figure 3(b).The optimal deadband width  opt is calculated as eq.(3.3). 0 = 3.96e−08 is the maximum deviation value in the historical data samples.The calculation range of  is [0, 7e−08], and the calculation step of  is 1e−09.As shown in figure 4(a), the blue solid curve is 0.5 • () indicating the effect of nuisance alarm events reduction with .The yellow solid curve is 0.5 •  ()  ( 0 ) which shows the ratio of alarm events duration to the total duration with .According to the loss function, the optimal value  opt = 3e−09 is determined, which is associated with  () = 0.316.Users expected ratio  0 = 0.05 and ( opt ) = 0.033,  0 ≥ ( opt ), so alarm deadband is effective to solve the 2024 JINST 19 P04007    In order to verify the reliability of  opt , the proposed method is applied to ()  =1 for different sample size . starts from 0 and increases in steps of 200, with the maximum value of 10,000.We have calculated the optimal deadband value  opt for each sample size .As shown in figure 4(b), when the  exceeds 1200, the optimal deadband value stabilizes at 3e−09.Hence, the proposed method is reliable and  opt = 3e−09 is trustworthy.For comparison, the delay timers are applied to ().( = 38) = 0.048 is close to  0 .( = 56)=0.032is close to ( opt = 3e−09)=0.033.So compared to the delay timers, the alarm deadband reduces EDD by 56 seconds.
The  opt = 3e−09 is configured to HLS-II historical data samples of 12 months of 2022.Table 1 lists the comparison of alarm events after configuring the  opt .The number of total alarm events in 2022 is 429652.With using alarm deadband  opt = 3e−09, there are 3132 alarm events left and about 99.27% of nuisance alarms have been removed.The users expected ratio  0 = 0.05 is met.

Example 2
Process variable () is the temperature measurement value of the HLS-II short lens magnet in the unit • .
-9 -   First, the optimal deadband width  opt is calculated.There are 16268 raw alarm events in the historical data samples.With the parameters  0 = 0.73,  calculation range [0, 0.8] and the calculation step 0.01,  opt = 0.11 is obtained from () as shown in figure 7(a).After alarm deadband  opt = 0.11, there are 1802 alarm events left in the alarm variables, and about 88.92% of nuisance alarms have -10 -  been removed.( opt ) = 1802 16268 = 11.08% dose not met the users expected ratio  0 , so the joint design method of alarm deadband and delay timers is adopted.
The proposed method is also applied for different sample size . starts from 0 and increases in steps of 200, with the maximum value of 20,000.As shown in figure 7

Conclusion
This paper proposed a new method to design alarm deadband and delay timers to remove nuisance alarms to the expected ratio at HLS-II.The mathematical model was established to analyze historical data of the HLS-II alarm system.An optimal deadband width was obtained to achieve the best balance between the number of alarm events and the duration of alarm events.Alarm deadband reduces nuisance alarm events without increase EDD to leave more time for operators to handle abnormal conditions.If optimal deadband value cannot meet the expected alarms removal ratio, the joint design of serially-connected alarm deadband and delay timers was adopted.The proposed method is a system level method that can be used in the EPICS-based alarm systems of particle accelerators.

Figure 1 .
Figure 1.Process variable and alarm variable, (a) configured with alarm deadband, (b) configured with delay timers.

Figure 2 .
Figure 2. (a) Process variable () with a high alarm threshold  th , (b) 7 alarm events, (c) 4 alarm events from the alarm deadband , (d) 1 alarm events from the alarm deadband  and  = 5 delay timers.

Figure 5
repeats the samples of figure3and compares the alarm variables after configuring the  opt .

Figure 6 (
a) shows some historical data samples of () on June 27, 2022.There are many nuisance alarms occur in the alarm variable   () as shown in figure 6(b).

Table 1 .
Numbers of alarm events for 12 mouths in 2022.Month Alarm events Alarm events ( opt )