Quantum Machine Learning Framework for Virtual Screening in Drug Discovery: a Prospective Quantum Advantage

Machine Learning (ML) for Ligand Based Virtual Screening (LB-VS) is an important in-silico tool for discovering new drugs in a faster and cost-effective manner, especially for emerging diseases such as COVID-19. In this paper, we propose a general-purpose framework combining a classical Support Vector Classifier (SVC) algorithm with quantum kernel estimation for LB-VS on real-world databases, and we argue in favor of its prospective quantum advantage. Indeed, we heuristically prove that our quantum integrated workflow can, at least in some relevant instances, provide a tangible advantage compared to state-of-art classical algorithms operating on the same datasets, showing strong dependence on target and features selection method. Finally, we test our algorithm on IBM Quantum processors using ADRB2 and COVID-19 datasets, showing that hardware simulations provide results in line with the predicted performances and can surpass classical equivalents.


I. INTRODUCTION
The early stages in a drug discovery process are notoriously expensive and time-consuming, usually coupling wet lab with in-silico technologies. Virtual Screening (VS) is an important computational technique used to provide a quick and economical method for the discovery of novel therapeutics. In practice, VS searches digital libraries of small molecules to identify compounds that are most likely to bind to a target (e.g., a protein) and therefore exert a pharmacological effect (i.e. activity). In an ideal scenario, the pharmacological binding site is known, and the structure of the target is well characterised. Therefore, the VS could -in principle -be performed directly using a digital representation of the binding site.
However, the structure of the target is generally unknown, especially for emerging diseases such as COVID-19, caused by SARS-CoV-2, the coronavirus that arose in December 2019. In such cases, the information available for known active molecules (ligands) against a given target is commonly used to drive the design process. Starting from the chemical structure of known ligands, it is possible to infer hundreds of cheminformatic features that can be coupled with experimental evidence, when available, and then used to perform VS of a digital database. Such methods are known as Ligand Based VS (LB-VS). A conventional LB-VS strategy consists in using machine learning (ML) algorithms to train a classifier that can identify potential drug candidates from a digitalised library of compounds. The latter are usually characterized by means of a combination of theoretical and experimental features. Thanks to their proven efficiency in identifying patterns in unstructured data, ML algorithms such as Support Vector Classifiers (SVC) be- A database of SMILES encoded molecules is used to extract a variety of molecular features using RDKit. Through different feature reduction and selection methods, we refine the feature vectors, which are then passed on to train and test a Support Vector Classifier algorithm. We compare the performance of the SVC algorithm trained using a classical and a quantum kernel on both classical and quantum hardware, and we find cases in which the simulation of quantum algorithms/hardware running on a reduced number of features outperforms classical equivalents. We name this Prospective algorithmic Quantum Advantage (PQA) in drug discovery. With the advent of larger and better performing near-term quantum computers, PQA may lead to full quantum advantage on relevant problem instances.
superconducting quantum processors, using an ADRB2 benchmarking dataset as well as a novel dataset containing COVID-19 inhibitors [20] and reporting results in line with the numerical predictions obtained with classical numerical simulations. The proposed approach is not tied to a specific use case and can be applied to screen any digital database of active/inactive compounds against a desired target, using any third-party cheminformatics package.
Inspired by these promising results and confident in the scalability of the quantum approach, we also introduce the new concept of "Prospective Quantum Advantage" (PQA) obtained when a quantum algorithm simulated on a classical computer or executed on state-of-the-art hardware (or any combination of the two) can provide at least in some relevant instances -a tangible advantage compared to the best known classical algorithm operating with the same data sets (input information). In addition, PQA implies no evident restrictions on its extension to larger problem sizes, for which the quantum algorithms cannot be efficiently simulated on classical computers, while keeping the same -or even extendthe observed advantage. When demonstrations on larger quantum processors will become feasible, PQA will lead to a full effective Quantum Advantage.
The demonstration of PQA for a drug discovery workflow based on LB-VS is one of the main outcomes of the results presented in this work.

II. METHODOLOGY
We design a target-agnostic computational workflow to provide heuristic algorithmic evidence of quantum advantage over classical methods in LB-VS. More specifically, we use classical data to perform VS using a quantum Support Vector Classifier algorithm, hence evaluating its performance with respect to its classical counterparts. A like-for-like comparison between classical and quantum methods is fundamental to be as fair as possible in assessing the actual potential for quantum advantage.

A. Dataset
The general nature and the purpose of our classification workflow both require a realistic benchmarking dataset specifically dedicated to applied machine learning in virtual screening. Specifically, the dataset should [21][22][23][24] (1) imitate real-world screening libraries and guide ML methods to discriminate active from inactive compounds; (2) contain molecules with unambiguous experimentally measured activity against a target, i.e. with either potent/moderate activity or inactivity; (3) have a realistic ratio between active and inactive molecules; (4) contain active and inactive compounds with comparable molecular properties; (5) be chemically unbiased.
The LIT-PCBA dataset [22] encompasses all the previous points for fifteen well-characterised biological targets, and it is a natural choice for this project. Furthermore, and with respect to prior efforts [21,[25][26][27], the LIT-PCBA dataset was created specifically to test the performance of ML tasks in VS of molecules. The dataset was prepared by removal of potential false positives as well as including experimentally confirmed inactive molecules rather than seeding the database with decoy molecules (i.e. molecules assumed to be inactive).
The targets in the LIT-PCBA datasets have a realworld active-to-inactive molecules ratio, meaning that in some cases the number of confirmed active molecules is strongly imbalanced. As ML methods can be severely affected by class imbalance in a dataset [28,29], we balanced active-to-inactive ratios of each target dataset. However, for some targets in the LIT-PCBA dataset, the number of active molecules is astoundingly small (less than 30), with thousands of confirmed inactive molecules. In such cases, we padded the dataset with a 1:6 active-toinactive ratio to balance the dataset (see Appendix A).
Following the above procedure and considerations, we also choose to assess the performance of our method by screening a novel dataset containing known active and inactive COVID-19 inhibitors [20]. This dataset is particularly challenging for ML classification tasks due to the fact that more than 30% of the active or inactive molecules do not share a common scaffold or other recurrent structural features, showing how different and diverse the active molecules are, making challenging the classification based on a purely structural basis (i.e. using a fingerprint-based approach).

B. Support Vector Classifier and Quantum Kernel
Support Vector Machines (SVMs) [1,[3][4][5] are among the most widely used supervised machine learning methods for LB-VS. In its basic version, a Support Vector Classifier (SVC) [30], is a SVM-based linear model that, given a training set of the form {( x i , y i )| x i ∈ R m , y i = ±1}, finds an optimal separating hyperplane in the data space to distinguish between two given classes, labelled by y i . Support Vector Machines can be turned into nonlinear classifiers by lifting the input data into a higher dimensional feature space F through a feature map φ( x). Indeed, while in its elementary form a SVC relies on the computation of similarities between data points through inner products x i , x j in R m , in the more general case this is replaced by the corresponding feature space quantity φ( x i ), φ( x j ) , thus allowing for complex non-linear representations of the original inputs. The SVC is usually trained by solving a constrained quadratic optimization problem, with a target function of the form subject to i α i y i = 0 and 0 ≤ α i ≤ 1/(2nC). Here n is the cardinality of the training set and C is a regularization parameter controlling the hardness of the margin. Moreover, the so-called kernel trick [31] allows one to replace all scalar products with a symmetric positive definite kernel function k( x i , x j ), such that, in general, the feature map φ is only specified in an implicit manner. Some popular choices for classical SVCs are represented by polynomial kernels k( x i , x j ) = ( x i · x j + r) d , which reduce to the linear case for d = 1, and the gaussian radial basis function (RBF) kernel k( The family of quantum kernel methods [13,14] employs a quantum computer to extend the class of available kernel functions, particularly to those instances which are believed to be hard to define and compute classically. Here, the complex-valued Hilbert space of quantum mechanical states plays the role of the feature space through a mapping of form x → |φ( x) φ( x)| and a natural kernel choice is provided by k( The specific nature and properties of the quantum kernel depend on the explicit choice of the embedding function |φ( x) , which is typically realized as |φ( x) = U ( x)|0 with a unitary operation U acting on a reference state |0 and depending parametrically on the classical input data.
Some important remarks are in order here: on the one hand, it is not difficult to see how carefully designed U can in principle lead to classically intractable kernels. On the other hand, it has also been argued that the mere computational hardness of U does not a priori guarantee quantum advantage in speed or accuracy for machine learning tasks. In fact, complexity considerations in this context are influenced by both the availability of training data [17] and the inherent structure of the data sets [16,32]. Although a rigorous quantum speed-up in supervised learning with quantum kernel methods has been proven theoretically in at least one specific example [16], the quest for quantum advantage on naturally occurring data sets remains open. Here we provide strong empirical evidence towards such a goal, working from the assumption that good candidate models for practical quantum advantage are those that, using a classically hard quantum kernel, exhibit superior performances with respect to all classical counterparts in at least some instances of the problem under study.
In the following, we will make use of a unitary template, known as the ZZ feature map with linear entanglement [14,33], to embed classical feature vectors into quantum states. For a 4-dimensional classical input (2) where P (θ) is the single-qubit phase gate,x i ≡ π − x i and is called the depth of the feature map. Notice that the number of qubits in the register matches the size of the classical inputs, and the generalization of the parametrized unitary transformation to larger classical inputs is straightforward. This design is derived from the one presented in Ref. [14] and yields computationally hard kernels under suitable complexity theory assumptions. In the following, we will set = 2.
For every pair of classical inputs x, x , the quantum kernel is evaluated on a quantum processor according to the definition In practice, after initializing the quantum register in the reference state |0 , one applies the unitary U ( x ) and the inverse U † ( x) = U −1 ( x), which is easily obtained by reversing the order of operations appearing in Eq.
(2) while also replacing P (θ) → P (−θ) in all single-qubit phase gates. Finally, the value of the quantum kernel corresponds to the probability of observing the register in the reference state |0 after a measure in the computational basis.

C. Molecular Descriptors
Rather than performing VS based on a similarity search of molecules in a database -using for example molecular fingerprints, like most conventional VS methods -we trained the classifier using generic features (chemical descriptors), which can be computed for every dataset, regardless of the target type and experimental features. Chemical descriptors exist in their thousands and can be quickly obtained using specific thirdparty cheminformatics software tools, both open-source and commercial. Our methodology can be coupled with any third-party package capable of producing such descriptors. Here, we use RDKit [34] to extract molecular properties starting from molecular structures in a SMILES format. As also described in Refs. [7,[35][36][37][38], the chosen features can be grouped as (i ) atomic species; (ii ) structural properties; (iii ) physical-chemical properties; (iv ) basic electronic information; (v ) molecular complexityi.e. graph descriptors. The full list of descriptors can be found in Appendix B.

D. Descriptors Selection Methods and Quantum Encoding
As described in Sec. II B, employing a ZZ feature map implies the use of a parametric type of encoding. Under this scheme, each classical data point must be represented as a vector of real numbers -in our case the chemical features or molecular descriptors -that, after a suitable normalization procedure, are used as angles in single-qubit quantum gates (see Eq. 2). These operations, in combination with 2-qubit entangling operations, eventually create a quantum superposition state, i.e. a linear combination of multi-qubit basis states or strings, whose complex coefficients have a non-trivial dependence on the classical input and represent its embedding into the feature Hilbert space.
While in principle very powerful and versatile, this data encoding scheme is rather qubit-intensive, with typically one assigned feature per qubit. As a matter of fact, and despite continued and steady progress in quantum computing technology, implementing a quantum LB-VS workflow featuring hundreds of descriptors is currently demanding even for the largest publicly available quantum processors (127 qubits, IBM Eagle -late 2021 [39]). Similarly, this precludes the use of conventional 2048-bit molecular fingerprints, although reduced representations such as the ones proposed by Batra and co-workers [40] may offer viable intermediate-scale solutions.
To achieve a meaningful yet manageable problem size, we therefore extracted, starting from SMILES representations, 47 chemical descriptors as described in Sec. II C. First, each descriptor was normalised using standardization methods. Second, the descriptors were compressed into N variables using Principal Component Analysis (PCA) [41] to match a set of N qubits used by the QSVC algorithm. Alternatively, the best N features were selected using the analysis of variance methodology (ANOVA) [42]. More in detail, PCA -which is widely adopted for feature reduction in drug discovery [36,43] was applied via singular value decomposition to obtain a lower-dimensional representation of the data as described in Ref. [44]. Conversely, the ANOVA f-test [44] enabled us to pick the most descriptive N features, as also suggested in Ref. [45], leveraging the numerical structure of the inputs and the binary categorical nature (i.e. active/inactive) of the classification output. Finally, the resulting feature vectors were used to parametrically initialise the corresponding quantum states on which the quantum kernel entries were evaluated.

E. Evaluation of Performance (ROC)
The most intuitive way of assessing the performances of binary classification algorithms in both classical and quantum cases is to calculate the AUC-ROC (Receiver Operating Characteristic) metric [44,[46][47][48]. The ROC score provides an indication of the capability of the method to distinguish accurately between true positives and false positives. The score ranges between 0 and 1, where 1 is a perfect classifier capable of retrieving true positives only.

III. RESULTS
To demonstrate the theoretical benefits of a quantuminspired LB-VS method, first we ran our experiments simulating noiseless quantum hardware with the pythonbased Qiskit [33] statevector simulator. Then, we performed simulations introducing statistical and gate noise of actual IBM processors (Qiskit qasm and mock hard-ware simulators). In parallel, we ran identical experiments using the best classical SVC (CSVC) kernel method with optimal hyperparameters (i.e. the regularization factor C and the RBF kernel non-linearity γ) obtained via a thorough grid search using the scikit-learn package (see Appendix A) [44]. To further substantiate our heuristic claims, we performed LB-VS of the same benchmarking datasets using two other state-ofthe-art classical methodologies whose performances for generalised VS tasks have been acknowledged in the literature [49][50][51][52][53][54]. Specifically, we used a Random Matrix Discriminant (RMD) algorithm [55], as implemented in PyRMD [56], and a Deep Learning (DL) approach called Directed Message Passing Neural Network (MPNN) [57]. We compared the results of the following methodologies "out-of-the-box" to have reproducible results: • D-MPNN with 2D CDF-normalized 200 additional features from RDKIT or with binary 2048-bit Morgan fingerprints [58]; • PyRMD with 2048-bit MHFP6 [59], ECFP6 [60] and RDKIT fingerprints.
Finally, we measured the performance of our QSVC method using actual quantum hardware (IBM Quantum Montreal and IBM Quantum Guadalupe) screening the ADBR2 and the COVID-19 datasets.

A. Results from Numerical Simulations
Following the methodology described in Section II, all targets of the LIT-PCBA and the COVID-19 datasets were screened using our algorithm. To assess the average method performance, each simulation involving both classical and quantum SVC algorithms was run ten times for each feature selection method and per number of features. Similarly, the RMD and DL classical simulations were also run 10 times. In the interest of fairness, we quantified the standard deviation of the results using 10fold cross-validation for all simulations. In addition, we used identical test and train subsets for all SVC methods, in order to have a true like-for-like comparison between quantum and classical algorithms. The full set of numerical results is reported in Tables II and III of Appendix C. Consistently with classical SVC methods, QSVC statevector results suggest that the overall performance of our method is influenced by i ) the class balance and the number of the actives in the dataset, ii ) the feature selection method, iii ) number of features and iv) the value of regularization parameter C. This trend is captured in Figs. 6-7 of Appendix C. The class balance of the target is the first major factor to affect the performance of our method, similarly to other classical ML/DL approaches. A small number of active molecules in a dataset (e.g. 17 actives vs 312'483 inactives for ADBR2) leads to large standard deviation values (up to ca. ±0.3 of AUC ROC). On the other hand, a near negligible standard deviation for target ALDH1, which has the largest number of actives (5'386 actives plus 103'474 inactives), confirms the importance of class balance in terms of stability and performance of all methods. The feature selection method is the second major factor to have an effect on the overall performance of all SVC methods described in this paper. Due to its very essence, feature selection involves the inevitable loss of information, and this may have a positive or negative effect depending on the dataset/target and the number of features selected. Throughout this work, we observe that our QSVC method tends to outperform CSVC and the other classical methods when then number of features is 8 or higher, which also corresponds to a higher number of qubits. In some cases, such as the ALDH1 target shown in Fig. 2a, there is a near-linear correlation between performance and the number of features/qubits used. Finally, the performance of QSVC method is also affected by the choice of the regularization factor C. This is expected as SVC based methods have an inherent dependency on the C value to determine the misclassification rate of the method. Our results suggest that a better QSVC performance is achieved when using a default value C = 1.
A more detailed overview is represented in Fig. 2, where the comparison of ANOVA and PCA QSVC(statevector)/CSVC, RMD and DL trends is reported for the ALDH1, COVID-19 (see Sec. II A) and GBA (166 active vs 296052 inactive molecules) targets. Notice that the trend lines of RMD and DL methods are flat since the methods do not make use of an explicit number of features or feature selection methods. Overall, the performances of both classical and quantum SVC classifiers tend to increase with an incremented number of features used for training, with the steepness of increase and maximum ROC values depending on the features selection method used. In other words, the amount of information used to train the algorithm has a major impact on determining the quality of the classifier. Fig. 2a shows that the QSVC tends to perform significantly better with 8+ features if compared to its classical equivalent (up to 13% better when using 16 features), suggesting that the QK method finds a better hyperplane separating active and inactive molecules in the dataset. As previously discussed, the stability of the method on datasets with optimal class balance such as ALDH1 is suggested by little standard deviation values. Conversely, class-imbalanced datasets such as COVID-19 and GBA (Fig. 2c-d) typically show higher standard deviations. Nevertheless, the QSVC method consistently keeps higher average accuracy with increasing number of features for at least one choice of the feature selection method (PCA for COVID-19 and ANOVA for GBA).
In Figure 2 we also report the mean ROC values obtained following VS of the ALDH1, COVID-19 and GBA targets using the RMD and MPNN methodologies, trained using various fingerprint encoding. The performance of our quantum classifier seems to outperform the classical VS methodology of RMD with up to ca. 20% better mean quantum ROC values when compared to the best value obtained with PyRMD (trained with RD-KIT or ECFP6 fingerprints). Conversely, the MPNN method generally provides better mean ROC values if compared to RMD, and it seems to be outperformed by our methodology only when using a larger number of features. The standard deviation values of both RMD and MPNN methodologies are in agreement with our CSVC and QSVC methods, confirming the role of class balance in defining the stability of the method. In Fig. 2c-d we include data points obtained for 8 features using two different quantum simulator backends as implemented in Qiskit, namely the noiseless qasm and the mock Montreal backend simulators. The former adds statistical noise to the simulation, mimicking the probabilistic quantum measurement process, while the latter simulates both the statistical and hardware noise that one would get on the actual IBM Quantum Montreal processor, whose calibration data are used as parameters in the noise model. Due to memory and time limitations, we run simulations with qasm and/or gate noise for 8 features/qubits. These results show that the addition of noise is not significantly detrimental to the performance of the QSVC method for the COVID-19 and GBA targets. Furthermore, the standard deviation for the simulations are also in line with the noiseless statevector case. Summarizing, the simulation of our quantum LB-VS methodology provides, in a like-for-like comparison, a performance that either compares to other existing classical methodologies or that significantly outperforms them for a sufficiently large number of selected features, sometimes showing evidence of potential linear scale-up. This implies that an increase in the number of features/qubits used to train the algorithm can, in principle, further improve the performance of the method. Finally, it is important to mention that the prospective quantum advantage suggested by this study is also supported by the geometric test recently introduced in Ref. [17]. In fact, the data sets shown in Fig. 2 noticeably passed such test, yielding g CQ ∝ √ n, where g CQ is the geometric difference between the quantum kernel and the best classical CSVC counterpart, and n is the number of the data points.

B. Results from Quantum Hardware
In this work, we have so far established that the QSVC method in some cases outperforms an equivalent binary classifier or other classical methods used in VS of databases of molecules, highlighting its dependencies from dataset size, number of features and so forth. We also showed that the addition of statistical and simulated noise to the quantum algorithm does not considerably reduce the performance of the method, and it lies within the standard deviation of statevector (i.e. noisefree) simulations. The next logical step is therefore to verify whether the quantum algorithm can indeed perform well on actual quantum devices and if the numerical performance is a reliable tool for assessing the potential benefits coming from our method applied to LB-VS on quantum computers. To this aim, we run simulations on IBM Quantum processors using 8 qubits for ADRB2 on IBM Quantum Montreal (Fig.3a) and for the COVID-19 dataset on IBM Quantum Guadalupe (Fig.3b).
The two datasets underwent the same preparation and simulation methodology as the other datasets described in the previous section. In Fig. 4 we report the best results obtained with the optimal feature selection methods (ANOVA for ADRB2, PCA for COVID-19) along with qasm, mock hardware (Montreal and Guadalupe backend) and hardware results. Also in this case, QSVC statevector numerical simulations for both ADRB2 and COVID-19 datasets provided better results if compared to classical equivalents, including RMD and MPNN methods. Remarkably, in our experiments we found that training and testing our QSVC algorithm on IBM Quantum Montreal provided an AUC ROC result for ADRB2 comparable to the results of all the other (quantum) numerical simulation methods, with outcomes lying well within the error bar of statevector simulations (see Table II). Similarly, running the QSVC algorithm on IBM Quantum Guadalupe for the COVID-19 dataset also provided an AUC ROC that is comparable to the best (quantum) numerical simulations. Most importantly, quantum hardware results for both targets greatly outperform the other classical methodologies, thus confirming that quantum kernel methods are indeed a well suited tool for potential active/inactive molecules classification against ADRB2 and COVID-19 targets.
We can conclude that, for these two instances, simulated backends have correctly predicted the respective hardware results, hence providing a truthful insight into what to expect from real quantum hardware. These results are encouraging, suggesting that although it might not yet be possible to perform large scale simulations on currently available quantum hardware, promising proofof-principle demonstrations can be achieved, confirming the reliability and predictive power of numerical simulations.

IV. DISCUSSION AND CONCLUSIONS
In this paper, we have shown that, when using the same data sets and training conditions, a quantum classifier can in some cases outperform the best available classical counterparts. This implies that the analysis of cheminformatics data can already benefit from the application of quantum technologies, which have the potential of enhancing the success rate of, e.g., structure classification for material design and drug discovery. We need to stress that for this particular investigation, we reduced the problem size (number of features) to a level that is currently affordable for quantum calculations, using both classical simulations of quantum algorithms and the execution on state-of-the-art quantum hardware. This implies for the moment the use of relatively small data sets and the selection of a limited number of representative features for the characterization of the compound properties. In doing that, the proposed quantum classifiers are still classically accessible in simulations, hence our approach cannot yet lead to direct quantum advantage -at best, it can be qualified as an interesting new quantuminspired classical algorithm. However, and this is the main message of this paper, there are currently no evident reasons to doubt that the same advantage observed in the proof-of-principle applications exposed in this work will not withstand the scaling-up to a larger number of descriptors (features). These will correspond to a number of qubits (more than 30-50) that will remain inaccessible to classical simulators while becoming manageable on the near-term quantum computers of the coming generations [39]. To refer to the class of algorithms featuring this property, we introduce the concept of Prospective Quantum Advantage (PQA), as an impactful intermediate step -with potential value for small and intermediate size instances -towards full quantum advantage in large scale applications. In this sense, we conclude that quantum information algorithms have already an attractive potential for the chemical and pharmaceutical industry, as they bring new value to research workflows and provide solutions superior to previously accessible classical ones.

ACKNOWLEDGMENTS
This work was supported by the Hartree National Centre for Digital Innovation, a UK Governmentfunded collaboration between STFC and IBM. IBM, the IBM logo, and ibm.com are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. The current list of IBM trademarks is available at https://www.ibm.com/legal/copytrade.