Scientific intuition inspired by machine learning-generated hypotheses

Pascal Friederich; Mario Krenn; Isaac Tamblyn; Alán Aspuru-Guzik

doi:10.1088/2632-2153/abda08

1. Introduction

Machine learning (ML) recently has become a widely used tool with many applications in the physical sciences [1], ranging from chemistry (for example, prediction of quantum chemistry properties [2], solving Schrödinger's equation [3], predicting reactions [4], materials discovery [5] or inverse materials design [6, 7]) to physics (for example, identification of phases of matter [8], astronomical object recognition [9], or validation of quantum experiments [10]) and biology (for example, prediction of protein structures [11] or drug design [12, 13]). Some open challenges regarding the application of machine learning models in natural sciences include the accessibility, homogeneity, amount and quality of available data, as well as a lack of machine learning models which inherently include physical laws, limiting the interpretability of the models' predictions. While ML models are successfully used and optimized to accelerate numerical predictions or to recognize or generate patterns in existing data, it is rarely inquired how the machine finds solutions, i.e. which patterns and correlations it detected and exploited. Thus, the scientific insight obtained by the model is not directly transferred to human scientists. First attempts to use artificial intelligence in physical sciences aimed to directly answer scientific questions, e.g. determine the location of protein encodings in the genome [14]. Further attempts to employ machine learning models to obtain insight and help scientists to develop theories were focused on rediscovering solutions to already solved problems, e.g. to rediscover the coordinate transformation in astrophysical [15] and non-linear dynamical systems [16], or to detect symmetries and conservation laws [17]. The methods used in these cases enforce information bottlenecks or interpretable transformations in the ML model that then can inspire scientific understanding [18]. However, to our knowledge such methods were mostly applied to solved problems and have not been used yet to obtain novel insight and answers to questions that are not well understood yet.

In this work, we propose to use machine learning and systematic data analysis to automate further the process of generation of interpretable scientific hypotheses. We demonstrate the applicability of the approach using two questions in the natural sciences — a rediscovery task of chemistry knowledge (hydrophobicity and molecular energy levels in simple as well as application relevant molecules) and the discovery of new intuitions in physics (quantum optics). We show that our approach 'rediscovers' but also extends known chemical rules of thumb for solubility and energy levels of organic molecules with application in organic photovoltaics and organic light-emitting diodes, and helps us to better understand the entanglement created in quantum optical experiments.

Our model represents its findings in a graph representation which is directly related to chemical or physical instances in the specific scientific domain. The results are statements regarding distinct subgraphs that can easily be comprehended and therefore, scientifically interpreted and understood by experts. This is in stark contrast to conventional machine learning models where the internal representations are only indirectly connected with the real physical entities and thus hard to impossible to interpret.

2. Method

2.1. Computer generated hypotheses

We suggest an automated workflow for ML-based generation of human-interpretable scientific hypotheses as illustrated in figure 1(a). The workflow is based on a reference database of calculated (potentially also measured) data points with graph-based structure and with corresponding target properties. A binary feature vector describing presence/absence of automatically generated subgraphs [20] is used to train a tree ensemble method, e.g. gradient boosting [19] or random forest regression/classification [21, 22], that allows for the quantification of feature importances. Based on the features with the highest importance, a list of hypotheses is generated. Each hypothesis has the human understandable form

Feature i leads to an increase/decrease of the target property with strength s

where i is the index of the corresponding feature (subgraph) in the input and strength s quantifies the degree of correlation between feature i and the target property. High feature importance does not necessarily correspond to a high direct correlation with the target feature. In many cases, multiple features have to be combined in order to become predictive, even if the single features individually do not help in the predicting the target property. Therefore, important features are combined using logical operations (and, xor,...) to automatically generate combined features which, especially in presence of higher-order correlations, can be directly interpreted by researchers.

2.2. Input representation and experiments

In this work, we test this workflow on two experiments in chemistry and physics. The first experiment targets the automated generation of intuitive rules that determine molecular properties, whereas the second aims at hypothesis generation for entanglement properties of quantum optical experiments. In both cases, we can describe the data points as graphs (molecules and quantum optical experiments), where nodes are chemical elements or optical instruments while edges are chemical bonds or photon paths travelling through the setup. This allows us to use fingerprinting techniques to generate input representations (bit-vectors), e.g. using the algorithm for circular extended-connectivity fingerprints [20]. This iterative algorithm generates a unique representation of each node, including its local environment. In each iteration, hashing functions are used to aggregate the information (predefined node and edge features) of the next nearest neighbors of each node, thus implicitly integrating information of one additional neighbor shell in each iteration. In the end, a hashing function is used to map all subgraphs found in the graphs to bit-vectors. Each entry in these bit-vectors encodes the presence or absence of a certain subgraph. A similar approach has been used in Lopez et al [23] to determine molecular substructures in molecules for organic solar cells that lead to high power conversion efficiencies. Other models that link the presence of subgraphs (or more generally features) in the input data to properties can potentially be employed in our workflow (see e.g. Duvenaud et al [24] where molecular fragments are identified that correlate with toxicity, the Grad-CAM method by Selvaraju et al [25] for convolutional neural networks or the GNNExplainer by Ying et al [26]). In contrast to this work, some of these approaches depend on the analysis of single samples and thus only indirectly allow to conclude about an entire data set. Furthermore, these approaches assign importance indicators to single nodes or edges of a graph, which are not necessarily binary numbers, which complicates the direct interpretation. Due to their general applicability to all graphs where node and edges can be represented by one or multiple categorical features, we focused on automatically generated circular fingerprints in this work.

3. Results

To test the automated hypothesis generation workflow, we performed experiments in two scientific domains, molecular chemistry (section 3.1) and quantum optical experiments (section 3.2). We computed physical properties of these graphs and used the generated data sets and the workflow described in figure 1 to automatically generate hypotheses that can be either compared to a collection of widely known chemical rules of thumb or that can help to better understand entanglement in quantum optical experiments for designing future experiments.

3.1. Chemical intuition for solubility, energy levels

In case of the chemistry experiment, we used two prototypical target properties — the water–octanol partition coefficient which describes the solubility of molecules in water (polar) vs. octanol (non-polar) as well as the energy of the highest occupied molecular orbital. Both properties are of high relevance for the application of molecules as pharmaceuticals or in electronic devices, e.g. for organic solar cells, organic light-emitting diode (OLED) displays or organic flow batteries. We furthermore analysed existing application-specific data sets, namely a data set of thermally activated delayed fluorescent (TADF) molecules as emitter molecules for OLEDs [27], the Harvard Clean Energy project data set [28, 29] and a data set of non-fullerene acceptor molecules for organic solar cells [23]. Solubility and energy levels are relatively well understood and for both properties there exist several widely known rules of thumb, often described as chemical intuition, which describes how certain functional groups influence them. Our experiment aims to test whether the automated hypothesis generation method can 'rediscover' those rules and potentially add new or refined rules. For frontier orbital gaps reported in the Harvard Clean Energy data set and the non-fullerene acceptor data set as well as for singlet-triplet energy splittings reported in the TADF data set, there exists less chemical intuition on how to influence and tune them.

Figure 2 shows two solubility related hypotheses that were generated using our workflow. Without prior knowledge, the algorithm predicts two widely known chemical groups/motifs for increasing solubility in polar solvents (carbonyl group in figure 2(a) and to increase solubility in non-polar solvents (conjugated carbon chain in figure 2(b)). Figure 3 shows an overview of molecular subgraphs that positively and negatively influence the HOMO energy of a molecule. To our surprise, five of the nine groups shown in the figure can directly be found in chemistry textbooks or Wikipedia when searching for electrophilic aromatic directing groups which can change the energy levels of molecules through the inductive effect and the mesomeric effect. Specifically, the oxido (O⁻) group that shows the strongest positive influence on the HOMO is well known for a strong resonance donating and a strong inductive effect which both leads to an increase in HOMO energy. Furthermore, heterocycles that contain nitrogen, as well as amine (NH₂) groups are also known for lifting the HOMO level to higher energies. On the other hand, the nitrile group (C≡N) is one of the most widely known electron-withdrawing groups that lowers the HOMO energy of molecules due to its resonance withdrawing and inductively withdrawing nature.

**Figure 3.** Hypotheses about molecular energy levels. Molecular subgraphs with a positive (left) and negative (right) influence on the HOMO energy. The groups 'discovered' by our automated workflow are widely known activating (resonance donating or electron donating) and deactivating groups, such as oxido/amino groups and nitrile groups.
Download figure:
Standard image High-resolution image

The patterns found to be relevant for small HOMO–LUMO gaps in the Harvard Clean Energy data set as well as in the non-fullerene acceptor data set are mostly related to extended aromatic systems and fused aromatic rings (see figures 5(a) and S1(a) (available online at stacks.iop.org/MLST/2/025027/mmedia)). This finding is well-understood by chemists due to the widely known relation between the size of an aromatic system (i.e. the degree of delocalization of π-electrons) and the frontier orbital gap [30]. In the limit of infinite delocalization (e.g. in graphene), the HOMO–LUMO gap closes completely. This relation was also exploited in the development of conductive polymers, which was awarded with the Nobel Price in Chemistry in 2000 and which created the field of organic electronics [31].

However, we additionally found several interesting and surprising patterns both in the photovoltaic data sets (figures 5(b) and (c)) and in the TADF dataset (figure 4). In case of the Harvard Clean Energy data set, we find that aromatic heterocycles with sulfur (e.g. thiophene rings) as well as silicon heteroatoms (e.g. silole rings) significantly reduce the HOMO–LUMO gap. While the former are widely used in organic electronics to control energy levels and reduce HOMO–LUMO gaps, silole rings are more unusual.

**Figure 4.** Hypotheses singlet-triplet splittings in the TADF data set [27]. The data-driven algorithm finds the well known and widely exploited structure-property relation of triarylamines and small single triplet gaps ( $\lt$ 0.5 eV, upper panel). However, it finds an additional, less known motif of alternating single-double-bond bridges that are related to increased singlet triplet gaps ( $\gt$ 0.5 eV, lower panel).
Download figure:
Standard image High-resolution image

**Figure 4.** Hypotheses singlet-triplet splittings in the TADF data set [27]. The data-driven algorithm finds the well known and widely exploited structure-property relation of triarylamines and small single triplet gaps ( $\lt$ 0.5 eV, upper panel). However, it finds an additional, less known motif of alternating single-double-bond bridges that are related to increased singlet triplet gaps ( $\gt$ 0.5 eV, lower panel).
Download figure:
Standard image High-resolution image

In the non-fullerene acceptor data set (see figure 5(c)) we found that thiophene rings connected by double bonds (i.e. forming a quinoid structure instead of aromatic systems) also significantly reduce the HOMO–LUMO gap, which is a know relation first described by Brédas [32]. However, such systems require a specific functionalization in the periphery of the molecule to enforce the quinoid structure of the two thiophene rings, which intrinsically is less stable and thus higher in energy than the aromatic structure.

In case of the TADF data set (see figure 4), we found expected patterns such as triarylamines that correlate with decreased singlet triplet gaps (S1–T1 gaps) as well as rather unexpected patterns (e.g. conjugated bridges) that are identified by our workflow as chemical groups that highly correlate with large singlet triplet gaps. Low singlet–triplet splittings in TADF molecules are typically achieved by decoupling electron donating and electron accepting parts of a molecule to reduce the exchange interaction between the frontier orbitals which would otherwise lower the triplet state compared to the singlet state and open an undesired singlet–triplet splitting. The decoupling of the fragments can be achieved by introducing twist angles close to 90^∘ between the fragments. One way to accomplish this are triarylamines bridges between the fragments. We expect that the conjugated bridges between fragments have precisely the opposite effect: They lead to a planar alignment of the adjacent fragments and thus an enhanced exchange interaction, reduced triplet energies and finally increased singlet–triplet splittings.

3.2. Physical intuitions for quantum experiments

As a second example, we use quantum optical experiments for producing high-dimensional, multipartite quantum entanglement [33, 34]. These experiments grow in interest as they allow the investigation of fundamental physical properties — such as local realism [35] — in laboratories. Furthermore, such quantum states are the key resources for large and complex quantum communication networks [36, 37], which are on the edge of commercial availability. The experimental setups that we consider consist of standard optical components that are used in labs, such as non-linear crystals for the creation of photon pairs, single-photon detectors, beam splitters, holograms or Dove prisms. Under approximations that are closely resembled in experiments, the final emergent quantum state can be reliably calculated [38].

A key challenge lies in the design of experiments which creates certain desired quantum systems. The difficulty arises from counter-intuitive quantum phenomena, which raises the question of whether human intuition is the best way to design new experiments. Several studies have therefore developed automated and machine-learning augmented approaches for the design of experiments [39–44]. The goal in our approach is to tackle this challenge in a completely different way, namely by improving the scientist's intuition about these systems.

Specifically, we are investigating optical setups with three-photon entanglement in high dimensions, using a fourth photon as a trigger. The experimental setups can be represented as graphs where vertices represent optical elements, and edges correspond to the photon paths connecting these elements. Analogously to chemical elements, the optical elements can have one to four connections. For example, a beam splitter has four input-output modes, while a detector has only one input. As a measure of entanglement, we use the overall size of the involved Hilbert space in terms of involved qubits, $n_{\textrm{Q}} = \log_2(d_1 d_2 d_3)$ , where d_i stands for the rank of density matrix after tracing out photon i [45, 46].

We used the same fingerprint-based graph representation as in section 3.1 and trained a gradient boosting regression model to predict $n_{\textrm{Q}}$ . Using the algorithm outlined in figure 1, we form a list of hypotheses of subgraphs features that influence $n_{\textrm{Q}}$ most. This computer-generated list was analysed and interpreted by a domain expert.

The two features which influence $n_{\textrm{Q}}$ most negatively contradict the intuition in the field, see figures 6(a), (b) and S2. Surprisingly, both of them represent subgraphs that are core elements of two experimental setups which have produced high-dimensional multipartite entanglement in the laboratory [47, 48]. Specifically, if the outputs of two non-linear crystals (both crystals produce entangled photon pairs in the same 3D mode space) are connected directly via a beam splitter or interferometer, the entanglement of the resulting state is predicted to be comparably low. This can be interpreted in the following way: the photons from the two different crystals need to combine at some point, otherwise, they remain bi-separable. However, if they combine directly after their generation, the equal mode spaces mix in such a way that it is difficult to increase their dimensionality subsequently. It is therefore explicitly enlightening that several of the features that positively influence $n_{\textrm{Q}}$ correspond to elements which shift the entire mode space by plus or minus three before or after the beam splitters or non-linear crystals. The insight for a human researcher now is to shift the mode space by three (as the local dimension is three), before combining photons from different non-linear crystals to achieve a high $n_{\textrm{Q}}$ . This leads to mode spaces of twice the original size and thereby increasing the probability for large overall entanglement dimensionalities.

**Figure 6.** Hypotheses about quantum optical experiments. Experimental substructures leading to a decrease in the overall size of the Hilbert space of involved qubits (n_Q) are shown in (a) while substructures with positive influence are shown in (b).
Download figure:
Standard image High-resolution image

A different feature which was used in the two experimentally demonstrations, but significantly negatively influences $n_{\textrm{Q}}$ is the following: one output of a non-linear crystal is directly connected to the detector. For human designers, this leads to the convenient fact that it simplifies the initial state (as double-emissions from one crystal can be ignored in this case). However, the entanglement of this photon with the other two photons can never be larger than three (as the local mode space is three). A similar, negatively influencing feature is a certain interferometer, which sorts the parity of the involved modes, directly connected to a detector. This acts as a filter, thus reducing the mode space of the incoming photon by half, thereby reducing the overall possible entanglement significantly.

3.3. Logically combined features

We can logically combine graph features, as described in 2, and find the most significant macro-features for quantum experiments. In figure 7(a), two small sub-experiments are combined with a logical and, i.e. the feature is the combination of both structures. Individually, the presence of the first feature has a negative influence on $n_{\textrm{Q}}$ . The second feature, a parity sorter followed by two detectors, influences $n_{\textrm{Q}}$ positively. Surprisingly, their combination has a significant negative influence on $n_{\textrm{Q}}$ and can be seen as an almost sufficient condition for $n_{\textrm{Q}}\approx 4$ . This behaviour can be interpreted using the Klyshko advanced wave-picture for quantum correlations in quantum optics [49]. The detector after the photon pair creation heralds a specific quantum state in the other photonic path. If those photons deterministically split at the parity sorter, the ability to mix with the photons from the other input ports (thus from the other crystal) vanish. From this insight, the human designer can learn that a heralded single-photon should be combined in a probabilistic way with the photons of the other crystal, using beam splitters instead of parity sorters.

**Figure 7.** Logically combined hypotheses about quantum optical experiments. Combining single-subgraph hypotheses with logical operations leads to intuitively interpretable relations, which is illustrated here with two examples. The upper panel shows the logically combined feature and its correlation with $n_{\textrm{Q}}$ , while the lower panels show the correlations of the isolated subgraphs.
Download figure:
Standard image High-resolution image

A second macro-feature, figure 7(b), combines two insights that we gained in figure 6. The macro-feature in figure 7(b) shows that the absence of either three positive or three negative mode shifters in front of a beam splitter has a very negative impact on the $n_{\textrm{Q}}$ . Thereby, the algorithm has discovered that both increasing or decreasing helps to have very positive influence on the final entanglement, and thereby suggests that one can be agnostic about the shift direction, and the importance lies in the actual increase of the local Hilbert space before the mixing. This features clearly shows how logical combinations can simplify the interpretation of scientific data.

4. Conclusion and outlook

We presented a data-driven machine learning workflow for automated generation and verification of hypotheses about observations in natural sciences. We presented examples from chemistry and physics, but our method is directly applicable to most applications, where structures can be represented as graphs, e.g. to DNA/RNA data in biology [50, 51], chemical reaction networks [52, 53] or graphs in social sciences. In chemistry, the workflow 'rediscovers' widely known relations regarding solubility and electronic properties of molecules (often referred to as chemical intuition). In physics, the algorithm discovers rules to generate highly entangled three-photon states in quantum optical experiments. These rules are interpretable by human experts in retrospect, yet not known or postulated before, and even contradicting some of the field's current understanding. Finding such rules will not only help researchers to understand complex scientific relationships and thus design better experiments, but also reduce unavoidable and often undetectable bias generated by prior knowledge and anticipations.

4.1. Hypothesis testing

In addition to automated hypothesis generation, protocols for testing of the postulated hypotheses would be beneficial. In case of the chemistry experiment, a possible hypothesis testing protocol would generate mutations of each molecule in the training set to test the hypotheses on molecules with similar representations, where (ideally) only the relevant feature is changed. In case of the quantum optical experiments, not all random mutations will lead to maximally entangled states between all photons, which is a requirement to compute the entanglement of the quantum state. We currently see two options for automated hypothesis verification both of which we are currently implementing. The first follows the same procedure of mutation and computation as in the chemistry experiment, with the caveat that only a small fraction of the mutations will lead to useful results, potentially making the procedure computationally costly. The second option is based on finding other experimental setups within the whole database that are as similar to the reference experiment as possible, with the exception of the feature that is currently analysed. This procedure is computationally costly as well but does not require new computations.

Acknowledgments

PF acknowledges funding the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie Grant Agreement No. 795206 (MolDesign). MK acknowledges support from the Austrian Science Fund (FWF) through the Erwin Schrödinger fellowship No. J4309. IT acknowledges NSERC and performed work at the NRC under the auspices of the AI4D and MCF Programs. AA-G thanks Anders G Frøseth for his generous support. AA-G acknowledges the generous support of Natural Resources Canada and the Canada 150 Research Chairs program.

Data availability statement

The data that support the findings of this study are available upon reasonable request from the authors.

Scientific intuition inspired by machine learning-generated hypotheses

Article metrics

Submit

Author e-mails

Author affiliations

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction