Cyberattack intensity forecasting on informatization objects of critical infrastructures

In regulatory documents of recent years in the field of information security, much attention is paid to information systems of critical infrastructures. This, in turn, justifies the need for scientific research on the development of new methods of protection against cyberattacks on such information systems. For this task, interval forecasting is recommended based on a probabilistic neural network with dynamic updating of the smoothing parameter. As benchmarks for comparing the interval forecasting results, the naive Bayesian model and the probabilistic cluster model were chosen.


Introduction
In last years, in Russia and World, much attention has been paid to the security of critical infrastructures. In accordance with the federal law «About security of a critical information infrastructure of Russian Federation» adopted in 2017 [1], the information systems (IS) are the important objects of protection. These objects fall under the Decree of the President of the Russian Federation 15.01.2013 N31s «On creation of the state system of detection, prevention and liquidation of consequences of computer attacks to information resources of Russian Federation». In a development of this Decree, in December 2014, the President of the Russian Federation approved a Concept of a state system for Russia's information resources for detecting and preventing computer attacks, and mitigating their consequences. In accordance with this Concept, the main functions of the system are: to identify signs of computer attacks, to determine their sources and other related information, to forecast a situation in the field of information security of the Russian Federation, to collect and analyze information about computer attacks on information resources of the Russian Federation, and to react to attacks and eliminate their consequences [2].
In 2016, the «Information security doctrine of the Russian Federation» was adopted, where it is noted that «the state of information security in the field of state and public security is characterized by a constant increase in complexity, an increase in a scale and growth of cyberattacks to objects of critical infrastructures» [3]. The federal law «About security of a critical information infrastructure of Russian Federation» [1] notes that there is a mandatory requirement for an implementation of a state system for detecting and preventing cyberattacks on IS of critical infrastructures, and mitigating their consequences. This once again confirms the importance and relevance of issues of cybersecurity of One of the most promising research directions for solving the task of protecting against cyberattacks on IS of critical infrastructures is to create cyberattack intensity forecasting methods based on machine learning [4,5]. Note that the cyberattack intensity is the total number of these attacks per unit time. If a forecast is received that the cyberattack intensity on IS exceeds a predetermined value, additional protection measures can be taken; for example, the more detailed analysis of traffic. It should be noted that in the federal law «About security of a critical information infrastructure of Russian Federation», as well as in «Concept of the state system for detecting, preventing and eliminating of consequences of computer attacks on the information resources of Russian Federation», the need for using forecasts in the cybersecurity field is underlined. Thus, in research related to protection against cyberattacks, in addition to assessing different risks and using traditional protection systems, attention should be paid to cyberattack intensity forecasting [6].
In the past few years, there has been an increasing interest of researchers in probabilistic forecasting [7,8]. This can be explained by the fact that probabilistic forecasts make it possible to obtain not only forecasts of future events, but also probabilistic estimates of these events. One type of probabilistic forecasting is interval forecasting (IF) [9][10][11]: this involves forecasting of an interval (from two predetermined intervals) in which a future value of an indicator will be located. Probability estimates are used for this purpose. A dividing bound of these intervals is determined by a calculation method based on statistical characteristics of the indicator.
In this paper, for forecasting cyberattack intensity on IS of critical infrastructures it is recommended carry out IF based on a probabilistic neural network with dynamic updating of the smoothing parameter value (PNN) [10,11]. As a standard for comparing the results of IF, a naive Bayesian model (NBM) and probabilistic cluster model (PCM) were selected [12].

Description and formalization of a cyberattack intensity indicator
Given that information about the cyberattack intensity on IS is confidential, we have used another public indicator of cyberattack intensity. This indicator is the cyberattack number per day that occurred from 1998 to 2015 in South Korea [14]. This indicator has a large volume of values, suitable for constructing various machine learning models for IF. On the other hand, the chosen indicator is non-stationary with respect to the location and the scale parameters, which underlines its dynamic statistical characteristics [10]. Thus, if this indicator shows good results of cyberattack intensity IF, then we can more confidently draw similar conclusions with regard to IS.
This indicator was formalized as the time series:  (1), then is the threshold of the cyberattack intensity ( min ≤ ≤ max ). The threshold of the cyberattack intensity is a value for which the probability that ≤ equals α. Thus, is the quantile of a probability distribution function of (1) for a given probability α.
Note that in a future scenario, with respect to the selected indicator, it is sufficient to take only the integer part of , since the cyberattack intensity is always an integer value.
Further, it is proposed to perform the following completely reversible transformation of the original indicator (1): Here is the value of the indicator at the discrete moment of time ; ∈ ; = {1, … , }; is the number of values; and is the threshold of the cyberattack intensity.
This conversion is useful for several reasons: 1) The values of the initial indicator (1) are in a very wide range, and some (extreme) values significantly exceed the others. Logarithmic transformation helps to improve visual work with such data and their associated graphs; 2) The indicator (2) contains both positive and negative values, in contrast to the indicator (1). Some forecasting models (including PNN) are sensitive to the sign of predictors and demonstrate better IF accuracy after such transformations; 3) The equivalent of for the indicator (2) is always 0. That is, the distribution of positive and negative values relative to for indicator (1) and relative to 0 for the obtained indicator (2) is identical, and this slightly simplifies the formalization of IF for the indicator (2) without distorting the essence and interpretation of the obtained results. Figure 1 shows the graph of the obtained indicator (2). Thus, it should be noted that the transformation of the indicator (1) into the indicator (2) is an integral part of the implementation of IF.
For obtaining some statistical characteristics of this indicator, its class was determined by the method described in [10]. This indicator is an indicator of the first class, non-stationary in terms of the location and scale parameters, which indicates its distinct statistical nature among indicators of other classes [9,10].

Formalization of cyberattack intensity interval forecasting
At time = − 1 it is necessary to identify the interval (3) in which the future (unknown) value + will be located. The following estimates of probabilities are required: + + and + − where = 1, … , is the look-ahead period, + + is the probability that the indicator future value + ∈ + ; and + − is the probability that the indicator future value + ∈ − ; + + + + − = 1. Let � + + and � + − be probability estimates of + + and + − . The interval forecasting is carried out according to the following rules: the future value + ∈ + if � + + > � + − ; and the future value

Formalization of training set for learning of probabilistic models
It is necessary to consider some features of the formation of a training set for the implementation of IF.  , … , ). Let + be a dependent variable (or a response) the true value of which is unknown and that can take only two possible values: + = 1 if + ∈ + and + = −1 if + ∈ − . Performing IF using requires making a forecast of + based on probability estimates that + ∈ + and + ∈ − . Recall that if � + + > � + − , then + = 1, else + = −1. Next, create a training set based on the values of (1) for = 1, … , , where = − − + 1 (this value is chosen so that the responses' values can be calculated based on pre-history values of the indicator): Here is the matrix of dimensions × (a training set); is a vector of responses of size (these responses are calculated based on the pre-history values of the indicator); and is the number of training samples.
Each row of the matrix corresponds to the response of (4): → y i . Using the training set (4), it is possible to build and train some forecasting model, and also implement the IF.
Often the matrix of predictors is used not in pure form, but in the transformed one. For example, for PNN each row of is transformed so that the sum of squares of each row of values is equal to 1. For NBM and PCM, this is not necessary.

General algorithm of interval forecasting of cyberattack intensity
The IF algorithm in its general form consists of the following stages: • Prepare initial data: (1); • Set the parameter: ; • Construct a piecewise linear probability distribution function of (1) and estimate for selected value of ; • Transform (1) to (2); • Set the parameters: , ; • Create the training set (4); • Select a forecasting model and set its parameters (parameter values can be optimized based on the training set; for example, by cross-validation methods [15]); • Carry out IF. Thus, this algorithm has three parameters: is the probability with which the cyberattack intensity will be below the threshold cyberattack intensity ; is the ahead time; and is the dimension of a training set.

Interval forecasting results and prospects of their practical application
For an analysis of IF results of cyberattack intensity, several scores were used. They are considered in more detail, and reasons are provided for choosing each of them.
First of all, we are interested in an accuracy with which the forecasting of events is carried out, + ∈ + . In fact, when obtaining such a forecast, it is necessary to take additional measures to protect against increasing cyberattacks. The more accurate such forecasts, the fewer mistaken additional measures will be taken to protect against cyberattacks (i.e. false positives). The fewer false positives, the more effective the system of protection against cyberattacks will be. For estimating the accuracy of such forecasts, it is proposed to use the score: where + is the estimation of forecasting accuracy of events + ∈ + , + is the number of justified forecasts that + ∈ + , and + is the total number of forecasts that + ∈ + , 0 ≤ + ≤ 1.
Also, we are interested in the accuracy with which the forecasting of events + ∈ − is carried out. Here, when we get a forecast that + ∈ − , the system of protection against cyberattacks continues to work in its regular mode. The more accurate such forecasts, the less likely situations will arise when, in fact, additional measures of protection from cyberattacks were required, but this was not done (i.e. false negative). This also affects the effectiveness of protection systems against cyberattacks. To estimate the corresponding accuracy of such forecasts, the following score was used: where − is the estimation of forecasting accuracy of events + ∈ − , − is the number of justified forecasts that + ∈ − , and − is the total number of forecasts that + ∈ − , 0 ≤ − ≤ 1.
Thus, the larger the both values + and − , the better. It should be noted that the forecasting model should forecast both variants of events: + ∈ + and + ∈ − . For example, the model that gives the result + = 0.75 and − = 0.80 is preferable to the one that gives the result + = 0.55 and − = 0.95. This allows to determine the final score characterizing the accuracy of the IF based on any selected model: Here + is the estimation of forecasting accuracy of events + ∈ + (5), and − is the estimation of forecasting accuracy of events + ∈ − (6). The larger the value of (7), the more accurate IF.
The testing of the selected models was carried out as follows. The training set (4) was divided into two parts. The first part included the even rows of and elements of . This part was used for training and optimizing models. The second part with odd rows of and elements of is used for obtaining forecasts. Subsequently, the second part was used for training and optimizing selected models, and the first part for obtaining forecasts. Next, the values (5-7) were estimated by the cross-validation method for two blocks [15].
The estimates of the scores (5-7) were carried out for different values of from 0.20 to 0.80 with the step 0.1. In all cases, the parameter was fixed and equal to 1. At a fixed value a sequential search of the parameter from 1 to 10 values was carried out (this parameter is common for PNN, CPM, and NBM). For NBM, for each new value of , the values of the smoothing parameter of a nonparametric density function of predictors are changed from 0.1 to 1 with the step 0.1. Among all the estimates obtained (7), such a model was chosen in its class, for which the value of (7) was maximal. All algorithms were implemented using the R language [16][17][18]. Table 1 shows the results obtained. can be done experts. It should be noted that the range of from 0.20 to 0.80 is quite sufficient for solving practical problems. It is not advisable to specify larger or smaller values of , as this will lead to a serious «imbalance» in the training set, and the results of IF can consequently be unstable and inadequate.
It is possible that the additional measures to protect against cyberattacks should not be applied at the first hit of the future value in the interval + , but after several of hits. More research is needed in this direction.

Conclusion
As follows from the results of this work, interval forecasting of cyberattacks intensity on IS of critical infrastructures is a necessary and important practical task. Experiments showed that interval forecasting of cyberattack intensity based on a probabilistic neural network for the selected indicator is more accurate than other models.
Given that information about cyberattack intensity on IS of critical structures is confidential, the number of cyberattacks per day that occurred from 1998 to 2015 in South Korea was considered as an alternative indicator in this work [14]. This indicator was chosen because it is publicly available and it has a large volume of data, suitable for constructing various models of machine learning for the purpose of IF. On the other hand, the selected indicator is non-stationary in terms of the location and scale parameters, which underlines its dynamic statistical nature. Since this indicator showed good results of interval forecasting of cyberattack intensity, similar conclusions can be drawn with respect to the IS of critical infrastructures.