Using an ensemble of neural networks trained on an unbalanced sample to classify the state of Internet of Things devices

The paper considers an approach to identifying anomalous situations in network segments of the Internet of Things (IoT) based on an ensemble of classifiers and proposes an ensemble of classifying algorithms for detecting an anomalous situation. Classification objects are represented by multiple parameter tuples. Classifying algorithms are tuned for different types of events and anomalies using training samples of different composition. The use of an ensemble of algorithms allows increasing the accuracy of the results due to collective voting. The experiment performed using three neural networks identical in architecture is described. The variety of classifiers in the ensemble when analyzing the state of IoT devices was formed on the basis of a training sample. The results of the assessment were obtained both for each classifier separately and with the use of the ensemble. Despite the fact that the training sample had an imbalance in relation to classes, the test results using averaging of values by the ensemble of classifiers showed an accuracy value of more than 99 %.


Introduction
The rapid development of network solutions based on the concept of the Internet of Things necessitates solving a number of issues related to ensuring information security of devices and nodes. In this regard, the tasks of detecting network attacks, recognizing abnormal traffic, analyzing the state of information security of the IoT devices themselves are being solved. The elements of the Internet of Things are mainly located outside the controlled area, providing the solution of a large number of domestic and industrial tasks in an automatic mode without human intervention. The growth in the number of nodes and devices raises the issues of supporting their life cycle, ensuring information and functional security, associated with the complexity of tracking their state.
There are problems of ensuring secure network access, receiving and transmitting messages, machine-to-machine communication, routing, intelligent data processing in the face of constant changes in the network structure, software versions, and integration of devices from different manufacturers into network segments. Most of the solutions to such problems are based on the accumulated statistical data that appear during operation. We consider network segments, corporate networks and search for traffic anomalies caused by any destructive influences. Machine learning methods are used, for which approaches based on various classifying algorithms are used. For intrusion detection, decision trees, deep neural networks are used, and genetic algorithms for detecting anomalies can be built [1][2][3]. The incidents are prevented on the basis of models that assess the state of information security using tuples of features [3,4]. The information security events are detected using approaches aimed at analyzing the causeand-effect relationships of changes in states occurring in the system. The processes occurring in the system are considered, transition graphs are constructed, which allows predicting the development of the situation and prevent the events preceding negative consequences [5,6].
The huge variety of elements of the Internet of Things, constant changes in the network architecture, the arbitrary appearance and disconnection of devices by end users and the simultaneous use of different generations of machine-to-machine communication protocols necessitate constant improvement of information security systems, where one of the main areas is the use and adaptation of machine learning methods.
The study of the qualitative indicators of various classifying algorithms depends on the parameters of the training sample and the structure of the information it contains. Typically, the issues of the formation of training samples are associated with their balance, separation and correct interpretation of background and significant patterns, the absence of training objects of a certain type, elements of the feature system, inaccurate ranges of values of variables, imbalance, the appearance of external patterns associated with the conditions for the formation of a training set. However, if it contains information of dissimilar events, a number of problematic aspects of interpreting the data structure arise [7]. In the problems of classifying states, examples are associated with a set of labels { , , … , } ⊆ . There are several events that are grouped as related, for example, to safe or dangerous states.
Classification algorithms identify events, which records are labeled with label from a set of disjoint labels , | | > 1. If | | = 2, then binary classification is carried out; and if | | > 2 , a multi-class classification is implemented [8].
Thus, when constructing classifiers, methods are used that transform a classification problem with several labels either into single, or several classification problems with one label, and methods that expand the possibilities for processing data with multiple labels.
The work considers a finite number of labels of set . Each label belongs to subsets or of set , defining dangerous and safe classes of the set of states .
There is some initial labeled sample × . Known pairs ( , ) are defined on the training sample.
It is necessary to construct a classification algorithm for the input vector of values , which, according to the input tuple , i X X , maps state Z into set C of event classes: : a X L C × → .

Materials and Methods
Due to the presence of different design features, interfaces and communication protocols for IoT devices, individual events that will be monitored on some nodes cannot be identified in other segments. This may be due to different characteristics of devices, different software versions, manufacturing peculiarities, use of incompatible interfaces. One of the approaches to solving this problematic situation may consist in independent training of classifying algorithms, according to the data of certain types of devices and network segments of the Internet of Things, with their subsequent combining into ensembles. The choice of classification algorithms is determined by the need to achieve the required values of quality indicators. In the case of the appearance of complex structures, a large dimension of the processed data in the problems of classification, forecasting, to achieve a given quality, various compositions of individual algorithms are used [9][10][11]. The use of compositions of classifying algorithms presupposes the fulfillment of a number of conditions for achieving a diversity (difference) of classifiers in an ensemble. The solutions to the indicated problematic issue are primarily related to the fact that algorithms of "different nature" are used, compositions of "weak" classifiers are formed [11][12][13]. However, there is no unequivocal opinion on the issue of diversity (difference) of classifiers [14][15][16][17].
The proposed approach to achieving a variety of classifiers is associated with the formation of a training sample. Under the conditions of the functioning of the system, the appearance of destructive influences, the emergence of specific conditions of the external and internal environment, it is not always possible to know in advance all the events that may occur. Therefore, the training sample may not fully reflect the information about the analyzed events.
To neutralize this effect, it is proposed to form a training sample in such a way that it contains a limited subset of events for each classifier.
Each classifier specializes only in a certain part of the classification objects, marked with appropriate labels. The trained classifier may not "know" about events in one class that are marked with different labels. The effect of under-training is artificially created, which is solved by using the ensemble.
At the first stage, classes of objects and their attribute space are determined. At the second stage, the parameters of objects of each of the classes are set. The set of parameters depends on the algorithm, the limitations of the infrastructure segment environment, where it will identify the state as the main one. The third stage is the construction of a training sample for each classifier. The fourth stage is the formation of the aggregation function and the decision rule.
The difference in the approach is that not just a weak classifier for the ensemble is formed, but a strong one on a limited set of classes. The variety of destructive influences, their uniqueness leads to the fact that in order to achieve diversity, an ensemble is formed not from just weak classifiers, but from strong ones in determining the initially specified events.

Results and Discussion
In practice, situations arise when a dataset record about an event can belong to several classes at the same time, for example, a DOS-attack and parasitic traffic, which makes it necessary to combine quite different events into one class. Achieving the difference between the algorithms included in the model, their training occurs independently of each other on randomly selected subsets of the training sample. The initial training set of records of X tuples is split into subsamples, where it is initially assumed that each classifier "specializes" in a certain class of events.
For the experiment, the Matlab package was used, where the basic algorithms were two-layer neural networks [12].
where T is the number of neuron of the hidden layer, wjt is the bond weight between the j-th feature and the t-th neuron of the hidden layer, wt is the bond weight between the t -th neuron of the hidden layer and output neuron, δ is the activation function.
The ensemble of basic algorithms is described by the expression [12] (2): where Φ is the decisive rule.
Let us create a training sample of each classifier in such a way that the algorithm "specializes": on certain classes.
To do this, we artificially divide it in such a way that the set of training examples for each classifier contains records about some events, but information on others is absent or minimized.
We consider a finite number of labels li of set L, the information of which is contained in tuples Xi. Known pairs (Xi, li) are determined on the training sample. Each label li is included in subsets L1 or L2 of set L that define dangerous C1 and safe C2 classes of states. The training sample is divided into subspaces equal in terms of the number of training examples for each classifying algorithm. A single subspace may not contain tuples labeled with specific labels from subset L1 or L2. There is an independent training of classifiers, where each algorithm recognizes the labels of its subset.
The test set is fed to an ensemble of classifiers consisting of networks of the same structure ( Figure 1).

Figure 1. Ensemble of classifiers
Below are the results of the classification of three identical two-layer neural networks after training on different training samples. A vector from the test set was fed to the input, and a binary classification was carried out at the output. The size of the training and test sample was 70:30. The training set was divided into three parts. The training sample for the first and second networks was implemented by random selection. For the third network, 90 % of the entries belonged to set C1.
The results obtained after processing the test set for each network and the entire ensemble are shown in Figure 2.
The test results using averaging of values by an ensemble of classifiers show an accuracy value of more than 99 %, despite the fact that the training sample had an imbalance in relation to classes.  Figure 2. Results of testing networks separately and the overall result of the classifier

Conclusion
The increase in the number of new types of attacks and destructive effects on IoT devices necessitates the analysis of a large number of parameters of the system's functioning. Most of the events that occur in the system under the influence of the external environment are unique and can manifest themselves in different ways in various devices, nodes and elements.
The article shows that the accuracy of class recognition can be improved by combining a number of classifying algorithms into an ensemble. An approach to the formation of an ensemble of classifiers is presented, where each classifying algorithm included in it, by forming a training sample that "specializes" on certain events. As a result, in order to ensure the difference between the aggregated classifying algorithms, an ensemble is obtained not of weak classifiers, but of those strong in their field.
The approach was evaluated using an ensemble of two-layer neural networks trained on samples containing selected event records, which increased the identification accuracy up to 99.9 %.