Research on the Security Technology of Federated Learning Privacy Preserving

With the emergence of data islands and the popular awareness of privacy, federated learning, as an emerging data sharing and exchange model, can realize multi-party collaboration under the premise of protecting data privacy and security because the data distributed in multiple devices cannot be sent locally. To achieve benefits for all parties involved, it has been widely used in many fields such as finance, medical care, and education. However, FL also has various security and privacy issues. Starting from the overview of federated learning, this article describes in detail the threat model and existing security issues, including replay attacks, poisoning attacks, reasoning attacks, etc., and then makes a certain analysis of FL privacy protection security technologies. Compared with SMC and HE, differential privacy is excellent in terms of efficiency. Finally, we discussed the challenges of privacy protection and security issues and future research directions.


Introduction
With the development and popularization of artificial intelligence technology, the rise of big data, the gradual improvement of policies and regulations, and the strengthening of public privacy protection awareness, considering user privacy and commercial secrets, it is difficult for enterprises to exchange ICCBDAI 2020 Journal of Physics: Conference Series 1757 (2021) 012192 IOP Publishing doi:10.1088/1742-6596/1757/1/012192 2 original data and services between various independent industries, so there is a problem of "data islands" in cross-enterprise collaborative governance cooperation.
Regarding the above problems, in 2017, Google proposed federated learning, an encrypted distributed machine learning technology that can be used to a large extent for mobile devices to complete data model training without infringing on user privacy [1]. The specific implementation strategy of federated learning today is to establish a virtual shared model, which is similar to the optimal model built by aggregating data. Under this federated mechanism, the identity and status of the older brother participants are the same, realizing multi-party cross-data Collaborate and benefit together.
However, there are still huge security risks in federated learning. This article analyzes the possible security problems of federated learning, to ensure the safety of federated learning and promote its further development and popularization.

Definition
Federal learning refers to a security solution that meets the requirements of privacy protection and security supervision [1]. That is to design a machine learning framework so that the original data distributed on multiple devices does not need to be shared, and the training results are trained locally and encrypted [2].
As shown in Fig.1 below, the definition of federated learning: [3] Assume that there are N data set owners {F1，…，FN}, and they all hope to train machine learning by merging their respective data sets {D1，…，DN} model. The traditional approach is to put all the data together and use D = D1∪…∪ DN to train the model MSUM. Federated learning is the process of each data owner cooperating to train the model MFED. In the model training process, the data owner Fi does not need to send its own data Di to other data holders. At the same time, the accuracy VFED of the model MFED should be very close to the performance of the model MSUM. Finally, assume that δ is a small positive number, such as |VFED --VSLM|<δ, that is, the accuracy error of the federated learning algorithm is δ.

Classification
According to the distribution characteristics of the data, as Fig.2: 1) Horizontal federated learning: It is suitable for the situation where the user characteristics of party A and party B overlap more, but the users of both parties overlap less, and the data set is segmented according to the user dimension, and the user characteristics of both parties are the same but the users are not exactly the same. That part of the data for training; 2) Vertical federated learning: It is suitable for the users of A and B to overlap more. According to the fact that the characteristics of the users of both parties are less overlapped, we divide the data set according to the vertical (feature dimension) and extract the same users of both parties. The part of the data with not exactly the same features is trained; 3) Federated transfer learning: Both A and B have relatively few users features and user overlaps, and the data is not segmented, and transfer learning [4] is used to help customers with insufficient data or labels Happening.

Security issues
As a new training model with neural networks, the adversary can track and obtain the privacy of participants from the shared gradient, federated learning still faces various security and privacy threats.
3.1. Three main attack types of machine learning 1)Completeness: detect errors and detect the intrusion point as normal. 2)Usability: classification errors (side negative and side positive), which is a broader type of attack. 3)Confidentiality: Leakage of sensitive information.

Reconstruction Attacks
In the local model training phase, if the data structure is consistent, the gradient information is likely to be used, thereby leaking additional information about the training data. You should avoid using machine learning models that store and display feature values [5].

Reasoning attack
Even if the data is independently trained locally, if there is a malicious participant, the hidden content of other partners can be derived through known shared parameters. Literature [5]. An attack is implemented that can infer the type of traffic used in the model building process. The training set information inferred by the model reverse attack can be whether a certain member is included in the training set or some statistical characteristics of the training set.

Poisoning attack
According to the source of the poisoning model update, the poisoning attack can be divided into data poisoning and model poisoning: Model poisoning: Model poisoning does not modify the training data, but directly generates model poisoning according to some predefined rules. For example, [7] believes that the poisoning model update can be sampled from the Gaussian distribution. In addition, an attacker can manipulate a benign model update into a poisoned model update. In addition, in [4], [8], and [9], the attacker uses a pre-designed damaged model deliberated to make a poisoning model update.

Counter Attack
Adversarial attacks refer to maliciously constructing input samples, causing the model to output wrong results with high confidence. This input sample generated by adding disturbance to the original sample is called an adversarial sample [10].

Secure multi-party computation
Secure Multiparty Computing (SMC) is a solution to the problem of collaborative computing, which protects the privacy of a group of untrusted parties without trust between trusted third parties. SMC must ensure the confidentiality, independence, and accuracy of the information of all parties in the calculation.
The SMC security model naturally involves many parties, and all parties do not know anything except their inputs and outputs to ensure a complete zero-knowledge security certification [10].

Homomorphic encryption
In [11], Hardy, S., Henecka, W et al., adopted a homomorphic encryption method to protect user data through parameter exchange using an encryption mechanism. Such as literature [12]. The homomorphic encryption scheme is used to achieve the aggregation of gradients on the honest and curious server, and to ensure that the system reaches the corresponding depth of training on the joint data set with all participants. The same accuracy as the learning system. The formula of differential privacy: A randomized algorithm A(D) satisfies∈-differential privacy if for all datasets D and D', that differ by a single record, and for all sets S∈R, where R is the range of A, P r[A(D)∈S] ≤ e ∈ P r[A(D')∈S] where ∈ , a privacy parameter, is a non-negative number. That is, the output distribution of the algorithm is less affected by any record in the data set [13][14].

Differential privacy
In this way, data is usually achieved by adding noise to the updated parameters by a trusted third party to prevent curious participants from inferring training data [15].

Confrontation training
The real samples and adversarial samples are used as the training set to train the final model. Adversarial training is suitable for a variety of supervision problems [16]. It can make the model learn the characteristics of adversarial samples during the training process and improve the robustness of the model.
Adversarial attacks can help malware evade detection and generate poisoned samples. In terms of attack environment, adversarial attacks can be divided into black-box attacks and white-box attacks. According to the intensity of interference, it can be divided into an infinite norm attack, two norm attacks, and zero norm attack. Nowadays, we are gradually studying attack methods against attacks, such as Least-Likely-Class Iterative Methods [17], DeepFool [18], etc.

Future Directions
Federated learning can realize the training and calculation of feature data on the cloud while maintaining the local confidentiality of the data of all parties and can be widely used in environments such as telemedicine in the future [19].
Privacy-protection algorithm: Both theoretically and empirically, privacy protection algorithms should be designed. In addition, the trade-off between privacy level and convergence speed needs further study [20].
Communication security: Such as (DoS) and interference attacks due to the exposed characteristics of wireless media [21].
Heterogeneous model: The model system is different due to different application scenarios and the size of the data volume [22].

Conclusion
In this article, we outline federated learning, which is a machine learning paradigm that uses a decentralized method to directly learn from user data while protecting the privacy of local data. The statistical model is based on a distributed network. The fringe was trained. However, there are also data privacy protection issues. We describe the security technical issues of privacy protection on a large scale. Finally, we discussed the challenges of privacy protection and security issues and future research directions.
Federated learning is a new paradigm in big data application and a new way to solve the problem of data privacy protection. With the advent of the 5G era, the future can further unite financial institutions, the Internet, medical care, education, smart cities, etc., to highlighting the advantages of