Encrypted machine learning of molecular quantum properties

Large machine learning models with improved predictions have become widely available in the chemical sciences. Unfortunately, these models do not protect the privacy necessary within commercial settings, prohibiting the use of potentially extremely valuable data by others. Encrypting the prediction process can solve this problem by double-blind model evaluation and prohibits the extraction of training or query data. However, contemporary ML models based on fully homomorphic encryption or federated learning are either too expensive for practical use or have to trade higher speed for weaker security. We have implemented secure and computationally feasible encrypted machine learning models using oblivious transfer enabling and secure predictions of molecular quantum properties across chemical compound space. However, we find that encrypted predictions using kernel ridge regression models are a million times more expensive than without encryption. This demonstrates a dire need for a compact machine learning model architecture, including molecular representation and kernel matrix size, that minimizes model evaluation costs.


I. INTRODUCTION
The global amount of information has grown exponentially over time.Seagate, a large data storage company, projects it to reach 181 zettabytes by 2025 1 .Countless machine learning (ML) applications are based on this wealth of data, as reflected in a rapidly growing number of ML publications 2 .Still, especially sensitive data is not publicly accessible, preventing innovative ML innovations in these fields.The main issue is that the evaluation of ML models is not double-blind: The user querying the ML model can gather information about the training set and discloses all information about the query.The holder of the database can accumulate huge amounts of querying user data posing a threat if the server is under attack from a third party.This is especially relevant considering the fast-growing use of cloud computing 3 and number of cyber attacks.End-to-end encryption cannot solve this issue as data is usually processed in unencrypted form.
Decisions based on knowledge derived from protected data without revealing any data would warrant immediate benefits: Potentially relevant fields include modeling health data or sharing predictions evaluated on protected databases.Furthermore, double-blind ML evaluation may reduce customers' hesitations to send sensitive medical data to the cloud, allowing, for instance, more personalized health recommendations -without giving away private information.From the viewpoint of chemistry and medical sciences, a potential application is commercial data from pharmaceutical companies, since * anatole.vonlilienfeld@utoronto.ca a considerable amount is invested in various screening approaches (in vivo and in vitro).While the collected data sets are relevant for developing new pharmaceuticals, they are generally not published.Currently, nondisclosure agreements are the only way for the chemical industry to provide academia with protected data.However, this comes with legal and economic risks as well as bureaucratic barriers.To give an idea of the value of privacy: A ballpark estimate of post-approval R&D costs of a new drug ranges between $1.8 to $2.8 billion where much of these costs are clinical trials [4][5][6] .In particular, toxicological assays are critical for the development of drug candidates 7,8 , new nanomaterials 9 , or pesticides, but can take many years and millions of dollars 10,11 to complete.These time and cost constraints are a substantial bottleneck for the innovation of new substances.An additional aspect is that the emergence of ML has made modern research more dependent on access to highquality datasets.Compiling new impactful datasets is only possible with the required funding to perform lab experiments.Public data often originates from several different sources, resulting in inconsistencies 12 .Access to predictions based on secret but single-origin high-quality measurements could help mitigate these problems.Driven by this vision, a growing number of institutions is considering using ML models that allow hiding training data, as can be seen in recent projects such as the melloddy initiative 13 .Multiple computational approaches for privacy-preserving ML have been developed; a popular example is federated learning [14][15][16][17][18][19][20] where several data sets from different data holders are used to compute a local gradient for subsequent updates of a global model.In particular, Zhu et al 21 have implemented federated learn-FIG.1. Bob the user B, working in health, company or academic or public sector, wants to predict properties of query molecules: First Alice A, holder of a large database, and B agree on a molecular representation and the EML protocol.Next A trains EML on her local node with unencrypted calculations.Subsequently, both parties jointly compute the query prediction by exchanging encrypted information without disclosure.Finally, the prediction is revealed to B while A will never see the query data.B vice versa cannot extract any information about the hidden reference data owned by A. All protocol steps are privacy-preserving and both parties effectively form an encrypted double-blind ML oracle.
ing for molecular properties.Despite many advantages, if not properly addressed, federated learning can show several security risks.This is particularly the case when participants are allowed to deviate from the predefined machine learning protocol (in a malicious adversary setting).When training a federated learning model, each potentially malicious participant can send false data on purpose 22 to prevent learning of the global model 23,2425 .Furthermore, in an iterative procedure, any participant could compare the last global model with the previous state.This allows probing where the update of other data holders had the greatest impact to detect points that likely exist in the other data sets.In certain scenarios federated learning models allow unnoticed extraction of training data 26 .
In this study we achieve double-blind ML prediction of molecular quantum properties by competitive cooperation, a.k.a.coopetition: Two competitive parties that do not trust each other cooperate in exchanging encrypted pieces of information to evaluate the ML model as illustrated in Fig. 1.Our encrypted ML (EML) protocol ensures that the data holder maintains access control to the model at all times.More specifically, we have considered a two-party setting with Alice A the data holder and Bob B the user querying the machine learning model.We will keep this color highlighting consistently throughout the following.Neither A nor B reveal private data when querying the oracle.
To summarize the key properties and the threat model of the algorithm: No central server is needed since oblivious transfer removes the need for a central entity as required in federated learning.Only the party that owns the data has access to weights, and we do not provide variance in predictions because we only use a single model split.The algorithm is based on encryption and is safe against a dishonest majority 27 .The amount of data that can be recovered from a single prediction depends on what an adversary already knows about the individual whose privacy is at risk.Reconstruction attacks on training data have had limited success and researchers have focused mainly on membership inference 28 , which can be used as a basis for reconstruction attacks.It is also possible to extract memorized private information from deployed language models 29 .Regarding our approach, we assume that there is a secure communication channel between the two parties.Given the security of the oblivious transfer protocol, the data owner cannot learn anything about the query.While we have not found an example of such an attack in the literature the querying party, might send designed queries in an attempt to reconstruct the decision boundary of the kernel-ridge regression algorithm based only on the predicted values.However, we cannot rule out that it is possible to construct an attack in this manner.
Our solution to double-blind evaluation consists of an encrypted ML oracle based on kernel ridge regression 30,31 and gives a single scalar value per query.Testing encrypted predictions for molecular properties reveals that the results of unencrypted calculations are exactly reproduced.We find that compact ML representations are superior in terms of cost per prediction and show higher numerical stability.

A. Oblivious transfer versus fully homomorphic encryption
Ideas of encrypted calculations of arbitrary functions were first hypothesized in the late 70s 35 .The archetypal problem solved in this context was Yao's Millionaires' problem: Two wealthy individuals with money amount x 1 and x 2 want to know if x 1 > x 2 is true or false without revealing exact amounts 32,33 x 1 and x 2 .However, the extension allowing encryption of any calculation took until 2009 when the first algorithm for a fully homomorphic encryption scheme was described.This allows fully encrypted addition and multiplication of any number and as such evaluation of any real function f .To explain what is meant by computation on encrypted data, we illustrate the addition of numbers The addition is performed with encrypted E representations or ciphertexts c 1 and c 2 of numbers with a public key p k .The first ciphertext is c 1 = E(p k , x 1 ) and the second c 2 = E(p k , x 2 ).The decryption D of the addition using the secret key s k results in the correct number as follows, ( Such encrypted calculations are called fully homomorphic encryption 36 , fully because any function can be evaluated, and homomorphic meaning same shape because fully homomorphic encryption conserves relations between numbers in the encrypted space.A benefit of fully homomorphic encryption is that it does not require communication between the parties that own the private data.Computations are performed offline.This may also be viewed as a disadvantage since parties cannot query discovery-based requests where ad hoc access to results is necessary.A downside of fully homomorphic encryption is that computations are quite expensive.Furthermore, a central server is needed to perform the calculations only after receiving all encrypted information at once.An alternative method for privacy-preserving function evaluation is multi-party computation 32,33,37 .In the case of two parties, multi-party computation reduces to two-party computation (s.Fig. 2).To perform an en-crypted calculation with public function f parties Alice and Bob exchange encrypted chunks of data without disclosing anything about their private inputs B and A.
The exchange of data packages is conducted via oblivious transfer 34 .An oblivious transfer protocol consists of at least one sender and a receiver.The sender sends an oblivious amount of information packages to the receiver i.e. much more information than necessary for each round of communication.To the sender, it remains oblivious which bit of information was obtained by the receiver.
During the oblivious transfer evaluation of the function, the roles of the receiver and sender are frequently interchanged, but at no point has any party enough information to reconstruct intermediate results.Remarkably, this rather unintuitive way of exchanging information allows encrypted evaluation of any real function 38 .The key advantage of two-party computation via oblivious transfer and in particular of the protocol called malicious arithmetic secure computation with oblivious transfer 27 (MASCOT) is the small computational cost compared to fully homomorphic encryption and other implementations of multi-party computation.MASCOT provides security against a dishonest majority of attackers with malicious intent.As for all multi-party computation algorithms, continuous communication between all involved parties is needed, so the transfer of data is the main computing bottleneck.In MASCOT, floating-point numbers are translated into a finite integer representation.To avoid overflow errors the numerical precision P (s.detailed explanation of P in SI. sec.C) can be increased to allow representing larger numbers with better resolution.

Encrypted kernel ridge regression
Alice A holds secret training data and collaborates with Bob B the user by providing encrypted ML (EML) predictions to his queries.B should not be able to learn anything about the training set, A should not learn anything about the query of B. Only the prediction is sent to B while the calculations cannot be inspected or manipulated by either party.We address this problem by encrypting the ML predictions using the MASCOT protocol discussed in the previous section.All following mathematical expressions are colored according to access to the respective data before, during, or after the encrypted prediction.Setting up the ML oracle can be separated into three steps shown in Fig. 3: First, both parties agree on the same mathematical form to represent the data.In the case of movie preferences, this could be a vector that contains location and age.For cloud-based services, it could be private data such as heart pressure, blood sugar, or pulse.For secret new drug-like molecules, we use molecular representation vectors such as the Coulomb matrix 39 (CM), or the FCHL19 40,41 that require threedimensional nuclear coordinates and charges.Note that FCHL19 is a local representation that allows one to compare atomic environments between different molecules with each other.However for demonstration purposes and because it allows direct timing benchmark comparisons we will treat FCHL19 as a flattened global representation vector like the CM.Secondly, both parties agree on an ML protocol f , here kernel ridge regression 30,31 .Kernel ridge regression is a supervised learning method in which for each prediction the features of the query instance are compared against all training instances and weighted by regression coefficients.In the next phase, A trains a hidden ML model on her local machine.A locally computes the input representation vectors X i that can correspond to any set of labels that show good correlation with the quantity of interest y.The kernel ridge regression weights α are obtained by solving a system of equations, All quantities in the upper equation are known to A notably the values y of hidden data.The elements of the training kernel matrix K are computed with Gaussian functions, where the elements i, j are contained in the hidden training set and ||.|| 2 denotes the euclidean norm.The hyperparameter σ is shared with B while λ is kept private.Next, B calculates the representation vector X Q of the query entities on his local machine.Next, the training weights E(α), input representation vectors E(X i ) and the query representations E(X Q ) are encrypted (recall that E is encryption and D decryption).This process takes place during a prepossess phase after establishing a secure communication channel between the two parties.In practice, we perform all calculations using a virtual network on a single machine.Subsequently, the following expression for encrypted kernel ridge regression prediction is evaluated, It is essential that the kernel values are not known to either party.Otherwise, participants could probe kernel elements k by repeatedly querying the oracle to obtain the compound space spanned by the training molecules.For the same reason, the distances between the training set and the query molecule 42 are never disclosed.Next B may evaluate a few encrypted samples to validate the consistency of the hidden predictions.If found to be necessary, A can increase the training set size or data diversity in hope of improving the accuracy of the model.In the prediction phase, f is evaluated via oblivious transfer without disclosing E(α i ), E(X i ), E(X Q ).Finally, the decrypted plaintext predictions are send to B while A could obtain a reward in form of a payment for the prediction provided.Effectively, both parties are part of an ML oracle with a true black-box character.
We use learning curves to quantify the error of the predictions w.r.t. the reference values measured as the mean absolute error (MAE) as a function of the size of the training set N .Learning curves are helpful to understand the efficiency of ML models and are generally found 30 to be linear on a log-log scale, where I is the initial error and S is the slope indicating the improvement of the model given more training data.

III. RESULTS AND DISCUSSION
A. Encrypted Kernel predictions: malicious security for computational chemistry Next, we demonstrate encrypted ML predictions for fictitiously confidential chemical data.Predicting the stability of molecules is a key problem in computational chemistry and is well described by solving the Schrödinger equation and atomization energies, the energy contained in all bonds of a molecule.However, solving the Schrödinger equation comes at high computational costs: For instance, costs for solutions of a density functional theory calculation scale to the cubed power with the number of atoms.To give a very rough estimate, computing a molecular dataset with ∼ 20000 molecules of the size of aspirin with coupled cluster singles and doubles 43 scaling with the seventh power of system size would consume 20000 CPU hours 44 -even for a relatively small basis set such as def2-SVP 45 .Such high computational costs underline the value of high-level computational data.As a potential scenario, we consider company A providing encrypted ML predictions and an industrial customer B with interest in 20 secret molecules but without time, experience or access to software to perform calculations.We have computed learning curves of atomization energies using encrypted predictions with the QM9 database 46 of organic molecules with up to N = 8192 compounds.The resulting learning curves using the CM 39 and FCHL19 40,41 representations are shown in Fig. 4. The deviation from the unencrypted case only amounts to numerical noise and cannot be identified visually in the learning curves.Hence, we find that EML accurately reproduces unencrypted predictions.As expected, time t, as well as data traffic D per prediction, increases linearly with the number of training points N (s.Fig. 5).Furthermore, there is a striking difference between the FCHL19 40,41 representation with L = 18720 entries that takes more than twice as long (1 hour at N = 128) for a single prediction than the CM 39 (L = 351).Contrary to FCHL19 the CM representation contains less information i.e. no angles or local environments resulting in a larger MAE.Overall, data transfer D between parties is the main computational bottleneck for prediction 27 explaining the nearperfect correlation between D and t (s.Fig. 5).Consequently, compact representations such as the CM reduce the prediction time by reducing D. The role of compact representations becomes more evident when studying QM9 learning curves (s.Fig. 4) for lower numerical precision settings corresponding to faster predictions.For high numerical precision (P = 42) there is hardly any visible difference between the EML and kernel ridge regression learning curves (as in Fig. 4).At P = 15, we find that the FCHL19 EML learning curve shows a dramatic deterioration for N ≥ 128 while the CM learning curve only begins to deviate at N > 2000.Although compact representations include less chemical information, they allow for larger training set sizes, given the same target accuracy as well as high numerical stability.If using representation vectors such as FCHL19 cannot be avoided because predictions with high accuracy w.r.t. the test set is needed the numerical precision P can be increased to avoid numerical instabilities.Fortunately, there exists an optimal P with minimal computational cost and sufficient numerical precision.This is because t increases only quadratically with P, while the numerical deviation decays exponentially (SI Fig. 3d).

B. Limitations and attack scenarios
The user B can query the oracle with points for which reference values are known.A small error for the predicted values would suggest that similar points exist in the hidden training set.This attack will probably not be a threat, as it may require more points as contained in the hidden training set.On the other hand, this procedure can reassure B that the hidden model provides reasonable predictions and that A has not deliberately added incorrect training points.If B knew the scaling rule of the kernel ridge regression ML oracle and the time needed per prediction B might be able to guess the number of hidden training molecules.To address this issue A could artificially increase the training set by adding a random number of duplicate training points.It is important to note that the ML oracle can only be trained and evaluated using a single training set split.Otherwise, the evaluation of ML models with different splits would leak the variance in addition to the predictions and which would enable attacks 47,48 .
An inherent problem of neural networks trained with hidden data is that the loss function gradient vanishes for training set points.In addition, generative adversarial networks are used to reverse engineer points in the training set 26,49,50 .Although our approach guarantees safety, this comes with increased computational costs compared to unencrypted calculations.In turn, we find that honestbut-curios neural network predictions are orders of magnitude faster since the prediction speed does not depend on the number of training points (s.SI sec.IV).However, the neural network protocol we have considered in the SI may not be safe against malicious attacks.Our MASCOT implementation of kernel ridge regression was the exact opposite in these two regards: Evaluation is relatively slow but secure.We find that encrypting ML predictions is a trade-off between security and computational efficiency.

IV. CONCLUSION
The main advantage of our protocol is its safety against attackers with malicious intent, as it is impossible to extract any molecular information, either from training or query instances, solely by evaluating encrypted predictions.
The protocol eliminates the need for a trusted third party or central server, as required by fully homomorphic encryption.Instead, it requires only a secure communication channel between the two parties.Since the protocol is online no transfer of all the encrypted data to a single server is needed, contrary to fully homomorphic encryption.This also allows live predictions for new query molecules.The latter aspect is important for Bayesian exploration of chemical space, e.g. in the context of self-driving laboratories 51 that would require ad hoc predictions.We demonstrated that encrypted predic-tions of molecular properties based on EML are possible cf.Fig. 4. EML can be adapted to various properties and chemistries with negligible adaptation of the encrypted kernel ridge regression protocol.Since EML does not require molecular representations as input, it may also be applied to pharmaceutical and private data from healthcare or finance.Our implementation was only possible thanks to recent developments in multi-party computation protocols 27 .We note that added security comes at substantial additional computational costs with data transfer being the main computational bottleneck.Consequently, the compactness of the ML model, in the case of EML the kernel and the molecular representation play a crucial role.More specifically, we have demonstrated that verbose molecular representation vectors such as FCHL19 40,41 allow for more accurate predictions than the more compact Coulomb matrix 39 .As a result, users have to trade off the cost, accuracy, and security of the protocol.Our ball-park estimates indicate that a single molecular EML prediction is a million times more expensive than kernel ridge regression implemented in python code (s.Fig. 5).For instance, approximately 250 GB of network traffic is needed for a single prediction at a modest training set size of N = 512 using the extended-connectivity fingerprint 52,53 which is often used in cheminformatics.Since there is a growing interest in maintaining privacy in ML it can be expected that future implementation of oblivious transfer will become much more efficient.
A goal of encrypted predictions is to enable decisions based on hidden data as if the knowledge leading to their actions was obtained by inspecting the secret data.However, since the prediction is encrypted, it is impossible to explain the actions that are solely based on the predictions.This lack of transparency may be problematic, as the model could have biases that cannot be explained by users unable to inspect the training data.It is an open question how encrypted predictions can be rationalized without inspecting the training set.One possible approach might be to understand the general behavior of the encrypted model without access to the underlying data, providing insight into the factors influencing the system's predictions.

V. SUPPLEMENTAL MATERIAL
See supplemental material (SI) for more information on the numerical stability of the EML protocol and how numerical noise can be mitigated (SI.Fig. 1, 2).We also measure the quantitative influence of the representation length on the prediction time (SI Fig. 4).In the SI we also discuss encrypted neural network predictions of quantum properties based on weaker security (s.SI Fig. 5, 6, 7, 8).

VI. DATA AND CODE AVAILABILITY
The EML code to the high-level interface to multiparty computation protocol MASCOT 27,54 is available in the public GitHub repository at https://github.com/janweinreich/EML/.The repository also contains scripts for preparing the molecular quantum data for use with the EML protocol and an example of a neural network implementation of molecular property predictions using oblivious transfer.It also contains a permanent link to the input data to reproduce learning curves and encrypted predictions.

A. Dataset
To demonstrate the protocol for molecular property prediction, we use 30000 random molecules of the QM9 46 data set with a random split into training and a random test set of 20 molecules.We predict the atomization energies, which measure the total energy necessary to dissociate a molecular compound into individual atoms.Hyperparameters are optimized with five-fold cross-validation for the largest training set size using unencrypted calculations with the quantum machine learning code 55 .The encrypted oblivious transfer calculations were performed in a local network with Intel(R) Xeon(R) E5-2650 v4 @ 2.20GHz CPUs.The number values for the reported timings may differ depending on the hardware.

VIII. THIRD PARTY MATERIAL
In Fig. 1,2,3 we included icons and modified them with permission under the license https://fontawesome. com/license.

IX. DECLARATION OF CONFLICTING INTERESTS
We have no conflicting interests to declare.

X. SUPPLEMENTAL INFORMATION A. Numerical Error and Precision
To control the numerical error the precision of the encrypted calculations can be controlled by the precision of the integer representation of floating-point numbers of the oblivious transfer protocol.The precision P is defined by two parameters 56 : The first is the bit length of the decimal part P. The second is the whole bit length of the fixed-point number M. For simplicity, we summarize both in a single parameter by setting, as a single numerical precision parameter.Generally, the average error over all encrypted predictions y w.r.t. the test labels must be minimized.Due to the additional numerical error of encrypted calculations, it must also be guaranteed that all predictions agree with python kernel ridge regression predictions.To measure the numerical error only due to the encryption protocol we define the average numerical deviation as follows, In Fig. 6 we show the numerical error ∆ as a function of numerical precision P and in Fig. 7 for different training set sizes N respectively.All other parameters are kept constant.We also show how the prediction time t scales with the numerical precision P in Fig. 8.

B. Length of Input Features
We show the prediction time as a function of representation length L (s.Fig. 9) using truncated FCHL19 40,41 molecular representation vectors.We find a linear scaling between L and prediction time t.This is plausible given the element-wise distance evaluation of the euclidean norm.The numerical error ∆, defined in Eq. ( 8), for different values of L is shown in Fig. 7.

Encrypted predictions
We consider encrypted predictions in an honest-butcourious security scenario with encrypted neural networks.In contrast to the MASCOT 27 kernel ridge regression implementation, here data privacy is only guaranteed if both parties strictly follow the appointed protocol f .We use a different oblivious transfer protocol based on CrypTen 59 and PyTorch 60 .As before we test the two-party case with A and B. The workflow is the same as for the encrypted kernel ridge regression discussed in the manuscript.The main difference is that the target function f is predicted with a neural network.A neural network consists of a large number of so-called artificial neurons inspired by biological neural networks 61 .Neurons are assembled in layers and their connections are called edges forwarding information from the input to the output layer.The weights determine the signal strength between neurons.Training of a neural network reduces the error of neural network predictions of the input w.r.t. to training labels.In this process called backpropagation, the change in average error is traced back to the individual weights of the neural network.Subsequently, the weights are adjusted to reduce the overall error in an iterative process.The architecture of the neural network consists of eight fully connected rectified linear unit layers with n neurons = 250 neurons each and one linear output layer.Each feature of the BoB 57 molecule representation vector corresponds to one input neuron.Training is performed with unencrypted calculations.Subsequently, the neural network and query molecules are encrypted and can be used to predict the encrypted query molecules.Due to the higher computational efficiency of this implementation, we calculated a learning curve for the complete QM9 46 dataset.We find excellent agreement between the learning curves resulting from the python implementation and encrypted predictions (s.Fig. 13).In this case, the neural network slightly outperforms the learning curve of an unencrypted kernel ridge regression model with a Gaussian kernel function.Each encrypted prediction takes about 1.3 seconds when using n neurons = 250 neurons per layer.Testing the prediction time as a function of the number of neurons for a fixed number of layers reveals (s.SI Fig. 5a) quadratic scaling with a favorable scaling constant: A prediction with n neurons = 3000 neurons per layer (eight layers total) takes 10.4 seconds.

Secret data comparison
A and B want to measure the chemical overlap between their respective secret data domains without revealing individual data points.While this might reveal some information about the chemical domain of interest, the degree of information leakage revealed by the overlap can be controlled by the number of function evaluations that both parties have to agree on.To implement the secret data comparison, we use an encrypted encoder.Autoencoders are artificial neural networks that learn dense encodings of input data and do not require data labels.Encodings usually have a much smaller dimension than the original input dimension, making autoencoders well-suited for dimensionality reduction.An encoder is made up of layers that map to latent space and a decoder network that decodes the latent representation of the input data.The output dimension of the decoder is identical to the input dimension of the encoder.Training the autoencoder aims to reduce the decoder reconstruction error.An encoder with a single two-dimensional linear layer and the mean-squared error as a loss function will qualitatively resemble the first two components of principal component analysis 61 maximizing the variance of the encoding in two dimensions.To implement the encrypted encoder, we use the same honest-but-curious multi-party computation framework for neural networks based on CrypTen 59 as before.Apart from data comparison, another possible application of such an encoder would be to reduce the dimension of the previously studied representations, leading to higher numerical stability and faster predictions by encoding into two dimensions.Since the encoder-decoder networks' goal is to maximize the variance in two dimensions, the two-dimensional encoding (s.Fig. 12a) results in a distribution of points similar to that resulting from principal component decomposition (s.Fig. 12b).

FIG. 2 .
FIG.2.Parties Alice and Bob with private data A and B respectively compute the value of a publicly known function f that depends on their private inputs.Two scenarios are considered: Alice and Bob submit all their data to a trusted third party Carol C who subsequently performs the required calculations.In the second case, two-party computation32,33 allows removing C. Instead, Alice and Bob exchange encrypted E information packages via oblivious transfer 34 to evaluate the function f .

FIG. 3 .
FIG. 3. Three-stage process of encrypted machine learning (EML) predictions for two parties, the data holder A providing training data and B the user submitting the query: Preparation (1), training (2) and evaluation (3).The dashed arrow indicates the encrypted exchange via oblivious transfer for evaluation of f on the private inputs.

FIG. 4 .
FIG. 4. Prediction error for a random subset of QM9 46 atomization energies as a function of training set size N : Mean absolute error (MAE) from encrypted machine learning (EML).Results are shown for two different molecular representations, the coulomb matrix39 (CM) and FCHL1940,41 .The numerical precision of fixed-point number representation of floating point numbers is set to either P = 15 or 42.The dashed white lines superimposed with the P = 42 EML results, show the learning curves computed with unencrypted python predictions for comparison.
.v.L. has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program (grant agreement No. 772834).J.W. acknowledges support from the Faculty of Physics and supervision by C. Dellago at the University of Vienna.J.W. also acknowledges support from J.G. Brandenburg, in particular for proofreading the manuscript.O.A.v.L. has received support as the Ed Clark Chair of Advanced Materials and as a CIFAR AI chair.

FIG. 10 .
FIG. 10.Two-dimensional encrypted neural network (NN) auto-encoding with components E1, E2 of Bag-of-Bonds 57 representation vectors for the first 20000 QM9 46 test molecules in (a).In (b) we show the first two principal components PC1, PC2 resulting from principal component analysis using the scikit-learn python package 58 .The color coding shows corresponding formation energies E f .
40,41Average numerical error ∆ between encrypted machine learning (EML) and plaintext python predictions for various accuracy settings P (c.f.Eq.7) at training set size of N = 4097.Results for two different molecular representations, the Coulomb Matrix 39 (CM) and FCHL1940,41.