Poisoning attack of the training dataset for on-line signature authentication using the perceptron

The study of the dependences of the effectiveness of a poisoning attack for the task of on-line signature authentication using a perceptron to make a decision. The dependences on the structure of the neural network, the volume of poisoning sets and the noise of the training sample were studied. Conclusions are drawn on the lack of the ability to detect such attacks by changing the overall accuracy of the system.


Introduction
Attacks on machine learning algorithms, and in particular on neural networks, can lead to various vulnerabilities in systems using these algorithms, in particular, an attacker can deceive the recognition algorithms of certain images [1, 2] or system developers there will remain the possibility of unauthorized access to the system [3]. Considering how important the component of the modern information space is machine learning [4], the tasks associated with the study of attacks on machine learning algorithms are now more than ever an urgent topic. Watching over development trends of modern information technologies we can observe significant increase areas of application to machine learning systems. Now system of speech recognition [5] or complex system of security systems [6] are created on neural networks basis.
As examples of such attacks, adversarial attacks [7,8] and attacks at the training stage [9] can be distinguished.
The process of preparing a training sample is one of the key points of machine learning that determines its outcome. For this reason, providing counteraction to attacks while ensuring the security of the system based on machine learning methods is relevant and counteracting poisoning attacks of a training dataset [10] is important.
The poisoning attack to dataset is analyzed [11]. The essence of this attack is that when training a neural network, an attacker injects his data into the training set. Within the framework of the system under study, an attacker can embed his data sets in the data of legal users in order to increase the percentage of statistical errors of the second kind and gain potential access to the system under the guise of a legal user [12]. IOP Publishing doi:10.1088/1757-899X/1069/1/012031 2 As part of a review aimed at identifying typical set poisoning attacks and evaluating their effectiveness, we can distinguish such works as [13][14][15][16][17][18]. Consider them, the methods used within the system and the results obtained in more detail.
In [13], a study is proposed of counteracting the methods of selecting features for training. Selection methods such as LASSO, ridge regression and the elastic net are considered. The general methods of machine learning are analyzed, it is shown that poisoning of 5% of the training dataset of the class leads to a complete disruption of the system.
The work [14] explores the stability of federated learning systems against training set poisoning attacks. The possibility of an attacker generating new training examples and using them to poison a training sample is shown. It is shown that this approach to attack allows achieving accuracy up to 80% for the attacker.
The vulnerability of the least squares method for the algorithm uLSIF, which is used to calculate the importance ratio for the adaptation of a covariant shear region to poisoning attacks, was shown in [15]. The results allow us to talk about the applicability of training dataset poisoning for the method under consideration.
The work [16] shows the possibility of the combined effect of the poisoning attack of the training sample regarding the integrity and accessibility of information. Attack efficiency for logistic and linear regression methods is shown.
In [17], the effectiveness of an attack for an intrusion detection system is considered. The Edge Pattern Detection (EPD) algorithm is analyzed to develop a new poisoning method that attacks several machine learning algorithms used in IDS.
In [18], the effectiveness of pollution attacks of training sets for intrusion detection systems, in particular, for mobile applications, is shown. A solution has been proposed to increase the detection efficiency of such attacks by at least 15%.
The analysis of the work allows us to talk about the relevance of the study of poisoning attacks of the training set based on new research in this direction over the past 2 years. It is seen that these attacks can be applied to various systems, in particular, intrusion detection systems and identification systems. Various characteristics, such as voice [2, 19], keyboard handwriting [20], off-line [21] and on-line [22] signature can be used as input for authentication systems using biometric characteristics. This work is devoted to the study of the effect of a poisoning attack on a training dataset on one of the systems not represented in existing works, namely, an identification system based on signature dynamics.

Description of the existing on-line signature authentication system
The system without attack is a script on a mathematical package MATLAB [23] that includes: 1) extraction of data sets of signatures from the database; 2) creation of a neural network with specified parameters; 3) splitting the signature datasets into three partsfor neural networks training, for selection decision threshold and for primary network testing with retraining; 4) random generation of new sets based on existing sets with a deviation of 1% for increasing the number of input data for training; 5) neural networks training; 6) selection decision threshold; 7) testing a trained neural network; 8) obtaining and correct presentation of various statistical errors [24]. Input data for neural networks training is signature dataset, several dozen for each user. Signature data sets are presented in the form of 144 signature parameters [22] that provide complete information about the dynamics of signature writing.
The study was conducted on dataset of 2445 parameterized signatures from 22 users. User is identified in system by a unique number for each user from 1 to 22. 13 users are legitimate on the basis of which the network will be trained. The rest 9 users are not registered in the system, and they will be used for testing system. For all users, a process is provided for a tenfold increase in the number of signatures by generating nine signatures based on one with making a random change for each of the 144 signature parameters, not exceeding 1% of the parameter value.
By default, created neural network takes 144 parameters to input layer, have 30 neurons in intermediate layer and return vector of result. Number of elements in this vector is equal to the number of legitimate users in the system and displays to whom the entered signature belongs (Figure 1).
Signatures from legitimate users are divided into three sets: training set, set for selection decision threshold and set for primary training, in the ratio of 60%, 20% and 20%. Based on the results of testing the trained neural network, data is formed on the sum of the errors of the first and second kind, the system error and the total error of the first and second kind with a tenfold priority of the error of the second kind.

Variants of the investigated attack parameters
As a part of a research on machine learning with biometric authentication and attacks, we conducted studies with gradual increase of various parameters of the neural network learning.
Parameters were increased within predetermined intervals.
In particular, we studied the effect on the output of the following parameters: -number of neurons in intermediate layer; -the percentage of inclusion of alien set under the guise of a legitimate user; -number of users in system; -coefficient of noise.   As graphs show the percentage of inclusion doesn't influence on systems efficiency. It means poisoning attack is difficult to detect by system performance analysis and by user data. We received value of dependence of the error on attacked user. The error value reflects the probability of an intruder penetration in system as legal user.

Dependence on the number of neurons in the intermediate layer
The dependence of the effectiveness of the attack on the percentage of poisoning is shown in the figure 6. Figure 6. Dependence of changes in the statistical error of the second kind to percentage of include As seen from the dependency graphs percentage of poisoning sets influences user errors. Therefore, by using a set poisoning attack, an attacker is theoretically able to gain access with minimal risk of being detected.

Influence of the number of users on the system operation
We changed the number of users in training set for attacked system and received graphs of dependence of errors sum at tenfold priority of second kind error to the probability of a successful attack.

Influence of the percentage of noise on the operation of the system
Increase the training set by generation of new sets with a noise on the basis of available is a typical method for enhance learning sample representativeness and the quality of the system [25].
We conducted a study of influence of multiplicity of noising on the system operation with attack on it and without attack.
Noise is generated based on training set with introduction of entropy of 1%. We changed number of automatically generated sets for the neural network and received graphs of dependence of errors sum at tenfold priority of second kind error and attack efficiency on multiplicity of noising. Experiments were carried out for system without and for attacked system. It is seen, increase multiplicity of noising over 5 doesn't affect error rates which are also near some constant indicator

Discussion of results
The obtained dependences were analyzed and an attempt was made to explain them. Summing up the study we inferred optimal number of neurons, significant different from which leads to a decrease of system efficiency. Change in the number of neurons doesn't affect system efficiency for legal users that in the presence of attack in the absence of. And extremum has blurred position IOP Publishing doi:10.1088/1757-899X/1069/1/012031 8 boundaries, although compared to the previously recorded level error value and rate don't change. This shows the problematic nature of attack detection for system. It's impossible to detect using methods of changing system parameters. Also efficiency of attack doesn't depend number of neurons. Also it`s seen, system efficiency doesn't change under the influence of introduced sets. We can see increased attack efficiency from percentage inclusion. And minimum level of inclusion (1%) increase attack efficiency to 30 % which is significant for system efficiency. Inclusion of 10-20% of sets make access to the system for attacker.
Study of depend errors to numbers of neurons shows that general index doesn't change in investigated range, and probability of successful attack with fixed level of poisoning set decreases for a single user. This can be explained by two factors: firstly, increase of number of users is an increase of total number of sets, as a result, total share of included sets is decrease. Secondly, total number of classes is increase, and all of them can output for system. Attacker needs to be identified only as attacked user for a successful attack. Therefore, increase of total number of classes decrease probability of this match. However, corresponding to attack users output is stay more possible. Similar in general concept attacks to the neural network were carried out by Jiang, Wenbo, et al. [26]. This neural network was created to recognize road signs. Jiang, Wenbo, et al modeled situation in which attacker included own set in training set. Inclusive set is similar set of attacked user. Inclusive set is modificated to visually practically do not differ to legal set. This inclusion doesn't allow the neural network to correctly identify inclusive set. Varying of inclusion degree was make possible to strike dependence of system accuracy to varatios of number of inclusive set to total set. As a result, system accuracy decrease from 95 to 33 for one user set and from 97 to 63 for another user when varying of poison rate from 0% to 10%.
Similar on approach to poisoning, but different in poisoning set generation method study was created by Biggio, Battista, et al [27]. Varying of poison rate from 0% to 10% was in their publication too. Biggio, Battista, et al used different from the proposed Jiang algoritm to generate poisoning set and received increase of error from 0-4 to 20-28 to different trainig sets. Unfortunately for us, systems in the presented publications don't have a calculating statistical errors. But we can compare received by us values of statistical second kind errors on a specific user with system efficiency and percentage of test error, which is received in the presented publications. Increase of percentage of inclusive set rate from 1% to 10% changes value of statistical second kind error from 30% to 87% in our system. We can make a conclusion that attack in this study has a high impact on the system. But the attack can be easier to detect because it involves more aggressive intervention in the system.
Another aspect of neural network operation namely, the effect of noise on network training, was studied Y. Hayakawa, A. Marumoto, and Y. Sawada [28]. They studied method of "white" noise, which consist in no correlation of the generated random variable with training set. This method is comparable to used in our system method.

Conclusion
The study was presented on the effect of changing various parameters of neural network training to final classification accuracy.
The study was carried out for the task of on-line signature authentication using a perceptron to make a decision.
Results of the study confirm the inherent for similar attacks on machine learning methods sensitivity to poisoning inclusion.
Even 1% of include is enough to multiple reduction in quality of system operation in relation to the detection of an attacking user. Further studies in countering to attack on machine learning methods is relevant for such areas of information security as a authentication and detecting invasion [24].