Semi-equivariant conditional normalizing flows, with applications to target-aware molecule generation

Learning over the domain of 3D graphs has applications in a number of scientific and engineering disciplines, including molecular chemistry, high energy physics, and computer vision. We consider a specific problem in this domain, namely: given one such 3D graph, dubbed the base graph, our goal is to learn a conditional distribution over another such graph, dubbed the complement graph. Due to the three-dimensional nature of the graphs in question, there are certain natural invariances such a distribution should satisfy: it should be invariant to rigid body transformations that act jointly on the base graph and the complement graph, and it should also be invariant to permutations of the vertices of either graph. We propose a general method for learning the conditional probabilistic model, the central part of which is a continuous normalizing flow. We establish semi-equivariance conditions on the flow which guarantee the aforementioned invariance conditions on the conditional distribution. Additionally, we propose a graph neural network architecture which implements this flow, and which is designed to learn effectively despite the typical differences in size between the base graph and the complement graph. We demonstrate the utility of our technique in the molecular setting by training a conditional generative model which, given a receptor, can generate ligands which may successfully bind to that receptor. The resulting model, which has potential applications in drug design, displays high quality performance in the key ΔBinding metric.


Introduction
Data consisting of sets of three-dimensional points, sometimes also referred to as point clouds, appear in a number of scientific and engineering settings. Examples include molecular chemistry (Beddell et al 1976, Lapatto et al 1989, Miller et al 1989, Davie et al 1991, Blundell 1996, Varghese 1999, Campbell 2000, Berman et al 2003, Hardy and Malikayil 2003, Congreve et al 2005, Zardecki et al 2016, Wu et al 2018, Jing et al 2020, in which the points are the atoms making up a molecule; high energy physics (Qu and Gouskos 2020, Kansal et al 2021, Pata et al 2021, Shlomi et al 2021, Duarte and Vlimant 2022, Qu et al 2022, in which the points represent locations of particles produced in a collider; and computer vision (Liu et al 2019b, Bello et al 2020, Guo et al 2020, in which the points are derived from the output of a three-dimensional sensor such as Lidar or Time-of-Flight. In many cases, these point sets can be effectively described by a 3D graph; that is a graph whose vertices correspond to points in R 3 . A particular problem of interest can be framed as follows. We are given one 3D graph, which we refer to as the base graph, and which we denoteĜ; given this base graph, we are then interested in the characteristics of a second 3D graph, which we refer to as the complement graph, and which we denote G. In particular, we are interesting in learning a conditional distribution of the form p(G|Ĝ), which tells us which complement graphs have a high likelihood of being associated with a given base graph.
There are several interesting applications of the above problem in science and engineering. For example, in the molecular scenario, the base graph will represent a receptor / target and the complement graph will represent a ligand (Drotár et Fu et al 2022, Peng et al 2022, Ragoza et al 2022, Schneuing et al 2022, Danel et al 2023. Another example is the setting of shape completion in computer vision: the base graph will represent the part of the point cloud that we have been given, and the complement graph will represent the completion of this point cloud (i.e. the missing part of the point cloud) (Pauly et al 2005, Kazhdan et al 2006, Zhao et al 2007, Kazhdan and Hoppe 2013, Sung et al 2015, Li et al 2016, Dai et al 2017. In general, the probabilistic aspect of our generative model is important, as it enables the generation multiple potential candidates; diversity is useful in this context, as not all candidates will be equally suitable in practice. For example, in the molecular scenario some candidates may possess less toxicity than others (Gillette et al 1974, Swinney 2004. In this work, we present a general method for learning the conditional probabilistic model p (G|Ĝ). In particular, we implement the central part of this model as a continuous normalizing flow. The work of Satorras et al (2021a) pioneered this approach in the unconditional setting. In the conditional setting, a number of important changes must be made, and the conditioning variable (the base graph) must enter the flow in a very particular way. Specifically, we know that the probabilistic model must capture natural invariances, to rigid motions and permutations of the vertices. The invariance to rigid motions is a kind of 'conditional invariance' , which is expressed jointly in terms of the base graph and the complement graph. Due to the fact that the base graph is a conditioning variable, this leads to a novel form of semi-equivariance of the flow, which we prove. An additional issue which arises is the vastly different sizes of the two graphs: in many applications, the base graph may be 1-2 orders of magnitude larger than the complement graph. We introduce an architecture for implementing the flow which takes this into account, and enables effective learning despite the size disparity.
We demonstrate the utility of our technique in the molecular setting. We train a conditional generative model which, given a receptor, can generate ligands which may successfully bind to that receptor. Alternatively, given a receptor and a set of potential ligands, the model is able to rank the ligands according to the quality of their binding with the receptor. This kind of conditional generative model is very useful in the context of drug design, in which one often has a target (receptor) in mind, and the goal is then to find drugs (ligands) which will bind to the target. We train our method on the CrossDocked2020 dataset (Francoeur et al 2020) and attain high quality performance.
In summary, our principal contributions are as follows: • The formulation of the problem of learning p(G|Ĝ), the conditional generative model of a complement graph given a base graph. • The derivation of conditions on a continuous normalizing flow which allows for joint invariance of the complement graph and the base graph to both rigid motions and permutations. • The design of an architecture which implements the conditional distribution of the complement graph given the base graph, enabling effective learning despite the size disparity between the graphs. • A demonstration of the effectiveness of the method on the problem of target-aware molecule generation.

Related work
Continuous normalizing flows. Normalizing flows transform an initial density distribution into a desired density distribution by applying a sequence of invertible transformations. They are easy to sample from and can be trained by maximum likelihood using the change of variables formula (Rippel and Adams 2013, Dinh et al 2014, 2016, Rezende and Mohamed 2015, Kingma and Dhariwal 2018. Chen et al (2018a) introduced a continuous flow based on ODEs, which was then extended in Grathwohl et al (2018) for maximum likelihood training with an unbiased stochastic estimator (Hutchinson 1990). Further works aimed to improve the accuracy of the model (Zhuang et al 2021, Zhang and; deal with discrete data (Ho et al 2019, Hoogeboom et al 2021; and improve stability and deal with ODE stiffness issues by adding a regularization term (Finlay et al 2020, Kelly et al 2020, Onken et al 2021, Yang and Karniadakis 2022, using augmentation (Dupont et al 2019), or by randomly sampling the end time of the ODE during training (Ghosh et al 2020). Stiffness in ODEs relates to the presence of widely varying timescales or significant differences in magnitudes. This make the ODEs difficult to solve using traditional numerical methods, which may become unstable or inaccurate, and requires specialized numerical methods to accurately capture the dynamics of both fast and slow processes while maintaining stability (Curtiss andHirschfelder 1952, Byrne andHindmarsh 1987). Köhler et al (2019) and Rezende et al (2019) presented equivariant normalizing flows to respect the natural symmetries of the probability density. Liu et al (2019a) introduced graph normalizing flows to obtain generative models of graph structures. Satorras et al (2021a) and Köhler et al (2020) constructed an equivariant graph normalizing flows, by incorporating equivariant graph neural networks (GNNs) into an ODE framework to obtain an invertible equivariant function.
Probabilistic graph-generative models. Probabilistic graph generation methods learn the distribution of the observed set of graphs, p(G), and allow new realistic graphs to be generated by sampling from the learned distribution. Sequential methods generates the nodes and edges one after another in a sequential decision making process, while one-shot methods generate all nodes and edges at once. Examples of sequential-based architectures include generative recurrent neural networks (RNNs) (You et al 2018) and autoregressive flow-based models (Shi et al 2020), whereas examples of one-shot architectures include variational autoencoders (Simonovsky and Komodakis 2018), generative adversarial networks (Fan and Huang 2019), normalizing flows (Madhawa et al 2019) and diffusion-based models (Fan et al 2023) (for a comprehensive review, please refer to Zhao 2022, Zhu et al 2022). Sometimes, we would like the generative process to be conditioned on the presence of an additional graph,Ĝ, which we refer to as the base graph. Unlike graph transformation methods, which translate an input graph in the source domain to the corresponding output graphs in the target domain (Anguelov et al 2004, the goal is learn a conditional distribution over another such graph, G, which we refer to as the complement graph, in the presence of the base graph. The conditional distribution, p(G|Ĝ), is obtained by learning the probabilistic dependencies between the number of nodes, the set of nodes, the set of edges, and any global graph properties of a set of complement graphs, conditioned on the corresponding base graphs; the learning process should account for intricate dependencies and relationships of the entire complex consisting of both graphs. The ability to capture graph-conditional distributions has many applications. In the particular family of 3D graphs, for instance, it allows the generation of new drug candidates conditioned on the structure of a given 3D graph representing the receptor binding site. This specific task has been addressed in a recent line of work, which we refer to below, though with varying mathematical formulations.
3D molecular design conditioned on a receptor binding site. GNNs (Wu et al 2020 are a type of neural network designed to process and analyze data that is structured as graphs, and are particularly effective for tasks that involve relational data. GNNs are well-suited to problems in molecular modeling, with atoms and bonds represented as vertices and edges, respectively (Chen et al 2018b, Yang et al 2019, Flam-Shepherd et al 2021, Jiang et al 2021. GNN methods were shown to be useful for detecting receptor's binding site (pocket) and predicting the receptor pocket-ligand or protein-protein interactions (Gainza et al 2020, Sverrisson et al 2021; as well as the drug binding structure or the rigid body protein-protein docking structure (Ganea et al 2021, Stärk et al 2022. Several generative methods to produce molecules were previously demonstrated, for the unconditional setting (Gebauer et al 2019, Satorras et al 2021a, Hoogeboom et al 2022, Trippe et al 2022. The binding site-conditioned molecular design line of research is newer than the general (unconditioned) GNN-based methods described above. Ragoza et al (2022) is a pioneering work in this area, which uses a conditional VAE on an image-like discretized 3D atomic density grid; a post-sampling step converts this grid structure into molecules using an atom fitting algorithm. A more recent class of works (Drotár et al 2021, Luo et al 2021, Peng et al 2022 use an auto-regressive approach by treating the generation of molecular structures as a sequential process, and thereby learning the probability distribution of each atom given the previous atoms; thus, the molecule is sampled from the joint distribution step by step, with each step generating a new part of the molecule based on the previous parts inside the binding site. Luo et al (2021) derive a model which captures the probability that a point 3D space is occupied by an atom of a particular chemical element. Liu et al (2022) propose the GraphBP framework, which eliminates the need to discretize space; to place a new atom, they generate its atom type and relative location while preserving the equivariance property. Peng et al (2022) include bond distributions and the corresponding bond prediction during the generation process. Lin et al (2022), Schneuing et al (2022)employ a diffusion model, while Schneuing et al (2022) develop an equivariant approach. By contrast, Fialková et al (2021), Li et al (2021), Thomas et al (2021), Fu et al (2022)use approaches based on reinforcement learning. Other works include fragment-based ligand generation, in which new molecular fragments are sequentially attached to the growing molecule (Powers et al 2022); abstraction of the geometric interaction features of the receptor-ligand complex to a latent space, for generative models such as Bayesian sampling (Wang et al 2022b) and RNNs (Zhang and Chen 2022); and use of experimental electron densities as training data for the conditional generative model (Wang et al 2022a).

Semi-equivariant conditional normalizing flows
Objective. We refer to graphs whose vertices correspond to points in R 3 as 3D graphs. We are given one 3D graph, which we refer to as the base graph, and which we denoteĜ; given this base graph, we are then interested in the characteristics of a second 3D graph, which we refer to as the complement graph, and which we denote G. In particular, we are interesting in learning a conditional distribution of the form p(G|Ĝ), which tells us which complement graphs have a high likelihood of being associated with a given base graph. In the molecular scenario, the base graph will represent a receptor / target and the complement graph will represent a ligand.
Note that to be physically plausible, the distribution p(G|Ĝ) must be invariant to certain groups of transformations: rigid motions applied jointly to both graphs, as well as permutations applied to each of the graphs separately. Enforcing these invariances provides a strong inductive bias to our learning algorithm.
Organization. This section proposes a method for learning such a conditional distribution p(G|Ĝ). Section 3.1 introduces notation. Section 3.2 decomposes the distribution into four subdistributions using a Markov decomposition, and shows how the invariance properties apply to each subdistribution. Section 3.3 is an intermezzo, describing two types of Equivariant Graph Neural Networks (EGNNs); these EGNNs are then used in sections 3.4 and 3.5 and in appendix A to propose forms for each of the subdistributions with the appropriate invariance properties. The main result appears in section 3.5, which shows that the vertex subdistribution can be implemented as a particular type of continuous normalizing flow.

Notation and goal
Notation. We use the following notation. A 3D graph is given by where N is the number of vertices; V is the list of vertices; E is the list of edges; and A is the set of global graph properties, i.e. properties which apply to the entire graph. The vertex list 1 is in which x i ∈ R 3 is the position of the vertex, and h i ∈ R d h contains the properties of the vertex. More generally, this may include continuous properties, discrete ordinal properties, and discrete categorical properties (using the one-hot representation); h i may be thought of a concatenation of all such properties. The edges in the graph G are undirected, and the edge list E is specified by a neighbourhood relationship. Specifically, if η i is the set of vertex i's neighbours, then we write E = e ij i<j:j ∈η i . The vector e ij ∈ R de contains the properties of the edge connecting vertices i and j; more generally, e ij may contain a concatenation of various properties in a manner analogous to the vertex properties h i as described above. Finally, the graph properties are given by K individual properties, i.e. A = (a 1 , . . . , a K ). A given property a k can be either continuous, categorical or ordinal.

Rigid transformations.
The action 2 of a rigid transformation T ∈ E(3) on a graph G is given by TG = (TN, TV, TE, TA) where and the other variables are unaffected by T; that is TN = N, TE = E, and TA = A.
Permutations. The action of a permutation π ∈ S N on a graph G with N(G) = N is given by and the other variables are unaffected by π; that is, π N = N and π A = A.
Goal. We assume that we have both a base graph and a complement graph, each of which is specified by a 3D graph. We denote (Note: if there properties of the entire complex consisting of both the base graph and the complement graph, these are subsumed into the complement graph properties, A.) Our goal is to learn a conditional generative model: given the base graph, we would like to generate possible complement graphs. Formally, we want to learn p(G|Ĝ).
We want our generative model to observe two types of symmetries. First, if we transform both the complement graph and the base graph with the same rigid transformation, the probability should not change: Second, permuting the order of either the complement graph or the base graph should not affect the probability: Motivation. A natural question arises: what is the value of designing such an invariant distribution? It is clear that the distribution should be invariant with respect to the symmetries presented by the physics of the problem; that is, no particular rigid motion should be favored, as any given rigid motion is arbitrary. The same is true of permutations. Thus, building this kind of invariance directly into the architecture is an important inductive bias, which should facilitate effective learning. There is a long history of embedding such inductive biases into neural network architectures, beginning with convolutional neural networks (which are translation-invariant or -equivariant) and GNNs (which are permutation-invariant or -equivariant), and continuing on with methods which aim at more complex invariances or equivariances, such as Greydanus et al (2019)

Decomposition of the conditional distribution and invariance properties
Decomposition. We may breakdown the conditional generative model as follows: We refer the four terms on the right-hand side of the equation as the Number Distribution, the Vertex Distribution, the Edge Distribution, and the Property Distribution, respectively. We will specify a model for each of these distributions in turn. First, however, we examine how the invariance properties affect the distributions.
Invariance properties. Using the relations for rigid body transformations in (2), we have that Similarly, using the relations for permutations in (3), we have that Comparing equations (8) with (9) and (10), the following conditions are sufficient for conditional rigid body invariance (6) and conditional permutation invariance (7): The conditional probabilistic approach is illustrated in figure 1. Top row: given a base graph (on the leftmost side), denoted aŝ G, we wish to conditionally sample a second 3D graph, which we refer to as the complement graph (represented by the enlarged vertices and edges), denoted as G. The Markov decomposition of the conditional distribution, which ultimately allows us to obtain the conditional probability p(G|Ĝ), is illustrated by its successive stages from left to right. Specifically: N is the number of vertices; V is the list of vertices; E is the list of edges; and A is the set of global graph properties (light green marking). Bottom row: we illustrate the invariance to rigid motions applied jointly to both graphs and permutations applied to each of the graphs separately.

Intermezzo: two flavors of EGNNs
In order to incorporate the relevant invariance properties, it will be helpful to use Equivariant GNNs, also known as EGNNs (Satorras et al 2021b). EGNNs are an extension of GNNs (Wu et al 2020 and similarly propagate information across the graph by exchanging messages between connected vertices. The key idea behind the EGNNs is the message passing mechanism, which is equivariant to rotations, translations, reflections and permutations. At each vertex, the input positions and properties of the vertex and its neighboring vertices are combined, along with any additional necessary properties of the graph, to compute an updated representation for the vertex positions and properties. This process is repeated iteratively, allowing the information to propagate throughout the entire graph. The updating step in an EGNN involves two main components: a message function (m ℓ ij /m ℓ ij in equations (15)/(17), which determines how information is passed from a vertex to its neighbors; and an aggregation function (m ℓ i /m ℓ i in equations (15) (15)/(17) are than residually updated while preserving the relevant invariance properties. We now introduce two separate flavors of EGNNs, one which applies to the base graph alone, and a second which applies to the combination of the base graph and the complement graph.
Base Graph EGNN. This is the standard EGNN which is described in (Satorras et al 2021b), applied to the base graph. As we are dealing with the base graph we use hatted variables: The particular Base Graph EGNN is thus specified by the functionsφ e ,φ b ,φ x ,φ h . In practice, prior to applying the EGNN one may apply: (i) an ActNorm layer to the vertex positions and properties, as well as the edge properties and graph properties; (ii) a linear transformation to the vertex properties.

Conditional EGNN.
It is possible to design a joint EGNN on the base graph and the complement graph, by constructing a single graph to capture both. The main problem with this approach is that in many applications, the base graph can be much larger than the complement graph; for example, in the molecular scenario the receptor is 1-2 orders of magnitude larger than the ligand. As a result, this naive approach will lead to a situation in which the complement graph is 'drowned out' by the base graph, making it difficult to learn about the complement graph. We therefore take a different approach: we compute summary signatures of the base graph based on the Base Graph EGNN, and use these as input to a Complement Graph EGNN. Our signatures will be based on the embedding variablesĥ ℓ j from each layer ℓ = 1, . . . ,L. These variables are invariant to rigid body transformations by construction; furthermore, we can introduce permutation invariance by averaging, that In section 3.5, we will see that the vertex distribution is described by a continuous normalizing flow. In anticipation of this, we wish to introduce a dependence (crucial in practice) on the time variable t of the ODE corresponding to this flow. Thus, the base graph at layer ℓ of the complement graph's EGNN is summarized by the signatureĝ ℓ which depends on both {ĥ ℓ av } and t: These invariant base graph signatures {ĝ ℓ } L ℓ=1 are then naturally incorporated into the Conditional EGNN as follows: The particular Conditional EGNN is thus specified by the functions

The number distribution: p(N|Ĝ)
Construction. Given the invariance conditions for the number distribution p(N|Ĝ) described in equation (11), we propose the following distribution. Let ζ N indicate a one-hot vector, where the index corresponding to N is filled in with a 1. Based on the output of the base graph EGNN, compute where the outer MLP's (multilayer perceptron) last layer is a softmax of size equal to the maximum number of vertices allowed. Due to the fact that we use theĥ L i vectors (and not thex L i vectors), we have the first invariance condition, as Tĥ L i =ĥ L i . Due to the fact that we use an average, we have the second invariance condition. Note that we can choose to make the inner MLP the identity, if we so desire.
Loss function. The loss function is straightforward here-it is simply the negative log-likelihood of the number distribution, i.e.
3.5. The vertex distribution: p(V|N,Ĝ) via continuous normalizing flows General notation. Given the invariance conditions for the vertex distribution p(V|N,Ĝ) described in equation (12), we now outline a procedure for constructing such a distribution. We begin with some notation. Let us define a vectorization operation on the vertex list V, which produces a vector v; we refer to this as a vertex vector. The We denote the mapping from the vertex list V to the vertex vector v as the vectorization operation vec(·):

v = vec(V)
and We have already described the action of rigid body transformation T and permutations π on the vertex list V in equations (2) and (3). It is easy to extend this to vertex vectors v using the vec operation; we have Tv = vec(Tvec −1 (v)) and πv = vec(πvec −1 (v)).
Given the above, it is sufficient for us to describe the distribution p vec (v|Ĝ) from which the vertex distribution p(V|N,Ĝ) follows directly, p(V|N,Ĝ) = p vec (vec(V)|Ĝ). Note that we have suppressed N in the condition in p vec (·), as v is a vector of dimension d N v , so the N dependence is already implicitly encoded.
Complex-to-complement mapping and semi-equivariance. Let γ be a function which takes as input the complex consisting of both the complement graph G and base graphĜ, and outputs a new vertex list V ′ for the complement graph G: We refer to γ as a Complex-to-Complement Mapping. A rigid body transformation T ∈ E(3) consists of a rotation and translation; let the rotation be denoted as T rot . Then we say that γ is rotation semi-equivariant if where, as before, the action of T on a graph is given by equation (2). γ is said to be permutation where, as before, the action of the permutation on a graph is given by equation (3). Note in the definitions of both types of semi-equivariance, the differing roles played by the complement graph and base graph; as the equivariant behavior only applies to the complement graph, we have used the term semi-equivariance.
. Then the following ordinary differential equation is referred to as a Conditional Flow: where the initial condition z ∼ N (0, I) is a Gaussian random vector of dimension d N v , and the ODE is run until t = 1. u(1) is thus the output of the Conditional Flow.
Vertex distributions with appropriate invariance. We now have all of the necessary ingredients in order to construct a distribution p vec (v|Ĝ) which yields a vertex distribution p(V|N,Ĝ) that satisfies the invariance conditions that we require. The following is our main result: Theorem (Invariant vertex distribution). Let u(1) be the output of a Conditional Flow specified by the Complex-to-Complement Mapping γ. Let the mean position of the base graph be given byx av = 1 N N i =1x i , and define the following quantities Suppose that γ is both rotation semi-equivariant and permutation semi-equivariant. Then the resulting distribution on v, that is p vec (v|Ĝ), yields a vertex distribution p(V|N,Ĝ) = p vec (vec(V)|Ĝ) that satisfies the invariance conditions in equation (12).

Proof. See section 4.
Designing the complex-to-complement mapping. Examining the theorem, we see that the one degree of freedom that we have is the Complex-to-Complement Mapping γ. For this, we choose to use the Conditional EGNN, where the output of the Complex-to-Complement Mapping V ′ = γ(G,Ĝ) can simply be the final layer, i.e. the i th vertex of V ′ is given by . In practice, we use a slightly modified version of the foregoing, in which we displace the coordinates of the final layer by the initial input coordinates to the EGNN, that is We have found the latter to converge well empirically. The rotation semi-equivariance of γ (both versions-ordinary and displaced) follows straightforwardly from the rotation semi-equivariance of EGNNs and the rotation invariance of the base graph signatures Similarly, the permutation semi-equivariance follows from the permutation semi-equivariance of EGNNs and the permutation invariance of the base graph signatures.
Loss function. The vertex distribution is described by a continuous normalizing flow. The loss function and its optimization are implemented using standard techniques from this field (Grathwohl et al 2018, Chen et al 2018a, 2018c. As the vectors of vertex properties can contain discrete variables such as the atom type, then techniques based on variational dequantization (Ho et al 2019) and argmax flows (Hoogeboom et al 2021) can be used, for ordinal and categorical properties respectively. This is parallel to the treatment in (Satorras et al 2021a).

Proof of the invariant vertex distribution theorem
In this section, we prove the invariant vertex distribution theorem presented in section 3.5. We begin with two lemmata, after which the proof of the theorem is presented.
where ⊗ indicates the Kronecker product. Given the following mapping: Let the inverse mapping be denoted by Γ 1Ĝ , i.e. u = Γ 1Ĝ (v).

For any rigid transformation T ∈ E(3), which consists of both a rotation and a translation, denote the transformation consisting only of the rotation of T as T rot ∈ O(3). Then
Furthermore, for any permutations π ∈ S N andπ ∈ SN, then Proof. The mapping u = Γ 1Ĝ (v) is given by Let us denote the parts of u corresponding to the coordinates and the properties as x u and h u , respectively; and use similar notation for v. Then we have that and Breaking down this last equation by vertex gives where x v av is the average coordinate position of x v , andx v indicates the average of all vertices in the entire complex, i.e. taking both the complement graph and the base graph together. Now, let us examine what happens when we apply the rigid transformation T to both v and the base grapĥ G; that is, let us examineũ In the case of the properties h, they are invariant by design; thus where the last line follows from equation (29). In the case of the coordinates, the transformation is as follows: where R ∈ O(3) is the rotation matrix, and t ∈ R 3 the translation vector, corresponding to rigid motion T. As we apply T to the base graphĜ, this has the effect of applying this transformation to each of the base graph vertices, and hence to their mean and the mean of the entire complex: Thus, following equation (31), and substituting Tv and TĜ in place of v andĜ, we get Combining equations (33) and (36), we have thatũ Since , as desired. In the case of the permutations, let us now set It is easy to see thatπ has no effect; the only place the base graph enters is through the quantitiesN andx av , both of which are permutation-invariant. For the properties, we now have That is, the properties are simply reordered according to π. With regard to the coordinates, we have that The coordinates are also therefore simply reordered according to π. Summarizing, we have that This is exactly equal to Γ 1 πĜ (πv) = π Γ 1Ĝ (v) which concludes the proof.
Plugging the above three results into the flow forũ in equation (45) yields But this is precisely identical to the flow described in equation (44); thus, we have that Butũ(t) = T rotȗ (t) so thatũ(t) = T rot u(t), and in particularũ(1) = T rot u(1). Comparing with equation (43) completes rigid motion part of the proof. Let us now turn to permutations; the proof is similar, but we repeat it in full for completeness. Our goal is to show that Γ 2 πĜ (π u) = π Γ 2Ĝ (u). Let us define FĜ to be the inverse of Γ 2Ĝ , and let u = FĜ(z). Then Thus, it is sufficient to shows that FπĜ(π z) = π FĜ(z). For convenience, we shall set In this case, u(1) is defined by the ODE whereasũ (1) is defined by the ODE dũ dt = vec γ(Gũ,πĜ) withũ(0) = π z.
Plugging the above three results into the flow forũ in equation (51) yields But this is precisely identical to the flow described in equation (50); thus, we have that Butũ(t) = πȗ(t) so thatũ(t) = π u(t), and in particularũ(1) = π u(1). Comparing with equation (49) completes the proof.

Suppose that γ is both rotation semi-equivariant and permutation semi-equivariant. Then the resulting distribution on v, that is p vec (v|Ĝ), yields a vertex distribution p(V|N,Ĝ) = p vec (vec(V)|Ĝ) that satisfies the invariance conditions in equation (12).
Proof. The Graph-Conditioned complement graph Flow maps from the Gaussian random variable z to the variable u(1). As this flow is a normalizing flow, it is invertible, so let us denote the inverse mapping by Γ 2 : Note that the dependence on the base graphĜ is denoted using a subscript, as the invertibility does not apply to the base graph, but only to the complement graph. Equation (55) maps from the variable u(1) to the variable v; let us denote its inverse mapping by Γ 1 : In this case, we have that Now, our goal is to show that the following condition holds: Using the p vec notation, this translates to p vec (Tv|TĜ) = p vec (v|Ĝ).
Now, from equation (58), the fact that Γ is invertible, and the change of variables formula, we have that where p z (·) is the Gaussian distribution from z is sampled; and J ΓĜ (·) is the Jacobian of ΓĜ(·). Since ΓĜ = Γ 2Ĝ • Γ 1Ĝ , this can be expanded as using the chain rule, and the fact that determinant of a product is the product of determinants. Plugging this into equation (60), we must show that A rigid transformation T ∈ E(3) consists of both a rotation and a translation. For brevity, denote the transformation consisting only of the rotation of T as T rot ∈ O(3). Now, from lemma 1, we have that From lemma 2, we have that Combining equations (64) and (65) gives that where the last line follows from the rotation invariance of the Gaussian distribution. Note that Now, T rot can be represented by the d N v × d N v block diagonal matrix given by where R ∈ O(3), the top-left block corresponds to the coordinates x and the bottom-right block corresponds to the property vector h. Thus, where the second line follows from the fact that the determinant of a product is the product of determinants; the third line from the fact that the determinant of a block diagonal matrix is the product of the determinants of the blocks; and the fourth line from the fact that the determinant of a rotation matrix is ±1.
To simplify J Γ 2 TĜ (Γ 1 TĜ (Tv)), note that where we have used equation (65). Thus, We wish to plug in u = Γ 1 TĜ (Tv). Note that from equation (64), Γ 1 TĜ (Tv) = T rot Γ 1Ĝ (v). Thus, where in the second line we substituted equation (71). Taking determinants gives where in the second line, we used the fact that permuting the order of a matrix multiplication does not affect the determinant. Combining equations (66), (69) and (73), we finally arrive at: which is exactly equation (63). Thus, we have shown that p(TV|N, TĜ) = p(V|N,Ĝ), as desired. Let us turn now to the permutation case, which is quite similar. Similar to equation (63), we need to show From lemmata 1 and 2, we have that Thus where the last line follows from the permutation-invariance of the Gaussian distribution p z (·).
In a manner parallel to the derivation of equation (67), we can show that where π now indicates the permutation matrix associated with the permutation π. Thus, we have that where we have used the fact that a permutation matrix has determinant of ±1. Similarly, in a manner parallel to the derivation of equation (72), we can show that Combining equations (77), (79) and (81) yields equation (75); completing the proof.

Setting
We now apply the theory we have developed to the problem of target-aware structure based molecule generation. The design of new molecules is an important topic, with applications in medicine, biochemistry, and materials science. Recently there have been quite a number of promising directions for applying machine learning techniques to this problem, including those which produce a generative model of molecules. Many approaches have focused on the unconditional setting, in which the goal is simply to produce molecules without regard for a more specific purpose; such techniques can, for example, effectively produce 'drug-like' molecules. By contrast, we are interested in the conditional setting. In particular, we are interested in learning a conditional distribution of the form p(G|Ĝ), which tells us which complement graphs are more likely to be associated with a given base graph. In this setting, the base graph plays the role of the receptor, while the complement graph plays the role of the ligand; vertices are atoms and edges are bonds. Specifically, we assume that we are given a target receptor molecule; the aim is then to generate ligand molecules that may successfully bind to this receptor. This kind of conditional generative model is very useful in the context of drug design, in which one often has a target (receptor) in mind, and the goal is then to find drugs (ligands) which will bind to the target. Note that our approach of using a probabilistic generative model, from which one can generate multiple samples, is quite important from a practical perspective. This is because not all ligand candidates will be equally suitable experimentally, due to considerations such as toxicity; thus, it is important to generate multiple candidates. Furthermore, given a both a receptor and a ligand, the conditional probability p(G|Ĝ) provides a useful indicator of the quality of the binding interaction between the two molecules, which can be used as a kind of scoring function. In what follows, we will explore both the generation and scoring aspects of the learned conditional distribution p(G|Ĝ).

Experimental setup Molecular dataset.
Datasets with a large number of receptor-ligand complexes are critical to our endeavor. Many models have relied on the high quality PDBbind dataset which curates the Protein Data Bank (PDB) (Liu et al 2017); however, for the training of generative models, this dataset is relatively small. CrossDocked2020 (Francoeur et al 2020) is the first large-scale standardized dataset for training ML models with ligand poses cross-docked against non-cognate receptor structure, greatly expanding the number of poses available for training. The dataset is organized by clustering of similar binding pockets across the PDB; each cluster contains ligands cross-docked against all receptors in the pocket. Each receptor-ligand structure also contains information indicating the nature of the docked pair, such as root mean squared deviation (RMSD) to the reference crystal pose and Vina cross-docking score (Trott and Olson 2010) as implemented in Smina (Koes et al 2013). The dataset contains 22.5 million poses of ligands docked into multiple similar binding pockets across the PDB. We use the authors' suggested split into training and validation sets (Francoeur et al 2020). This dataset contains docked receptor-ligand pairs whose binding pose RMSD is lower than 2 Å. Based on considerations of training duration, we keep only those data points whose ligand has 30 atoms or fewer with atom types in {C, N, O, F}. We also keep only data points with potentially valid ligand-protein binding properties, i.e. whose ligands do not contain duplicate vertices and whose predicted Vina scores (Trott and Olson 2010) are within distribution. The refined datasets consist of 132 863 training data points and 63 929 validation data points (Rozenberg and Freedman 2023).
Properties. The ligand properties that we wish to predict include the atom type ∈ {C, N, O, F} (categorical); the stereo parity ∈ {not stereo, odd, even} (categorical); and charge ∈ {−1, 0, +1} (ordinal). The receptor properties that are used are computed with the Graphein library . The vertex properties include: the atom type ∈ {C, N, O, S, 'other'} where 'other' is a catch-all for less common atom types 3 (categorical); the Meiler Embeddings (Meiler et al 2001) (continuous ∈ R 7 ). The bond (edge) properties include: the bond order ∈ {Single, Double, Triple} (categorical); covalent bond length (continuous ∈ R). The receptor overall graph properties (Â) contain the weight of all chains contained within a polypeptide structure, see Jamasb et al (2022).
Training. Training the model takes approximately 18 days using a single NVIDIA A100 GPU for 39 epochs. We proactively stopped the training procedure when the negative log-likelihood (NLL) term reached a low improvement rate of between epochs. We train with the Adam optimizer, weight decay of 10 −12 , batch size of 128, and learning rate of 2 × 10 −4 . To improve the stability of the continuous flow model and to deal with ODE stiffness issues we use the ODE regularization term described in Finlay et al (2020) with a value of 10 −3 ; and perform gradient clipping, where the clipping term is set by calculating a moving average of 50 steps of the normalizing flows gradient-norm. This is parallel to the treatment in Satorras et al (2021a). We use dopri5 (Runge-Kutta 4(5) of Dormand-Prince) as the ODE solver with relative (rtol) and absolute (atol) error tolerance values of 10 −4 .

Results
We first evaluate the utility of our technique as a conditional generative model. Given a receptor, we generate ligands which may successfully bind to that receptor. To show that our method is competitive as a generative model we compare against Liu et al (2022)-one of the recent works that showed promising results in similar settings. We then demonstrate the utility of our technique for indicating the nature of the interaction of ligands with their corresponding binding site. The trained model provides the conditional probability, p(G|Ĝ), which defines the likelihood of a binding event occurring given a ligand, based on the presence of a specific target receptor.
Conditional generative model. We perform inference using the method suggested in the baseline . Given a receptor, we sample from the learned distribution, which generates the ligands' vertices; we then then apply OpenBabel (Hummell et al 2021) to construct bonds. Evaluation follows the standard procedure , Ragoza et al 2022. First, the receptor target is computed by taking all of the atoms in the receptor that are less than 15 Å from the center of mass of the reference ligand. We then generate 100 ligands for each reference binding site in the evaluation set, and compute statistics (i.e. validity and ∆Binding, see below) on this set of samples. As in , Ragoza et al 2022, ten target receptors for evaluation; each target receptor has multiple associated ligands, leading to 90 (receptor, reference-ligand) pairs. We train the baseline technique (Liu et al 2022) on our filtered dataset. More specifically, the training continues for 100 epochs, using the hyperparameters given in the paper, with one exception: we set the atom number range of the auto-regressive generative process according to the atom distribution of the filtered dataset.
Validity. The validity is defined as the percentage of molecules that are chemically valid among all generated molecules. A molecule is valid if it can be sanitized by RDKit; for an explanation of the sanitization procedure, see (Landrum 2016). As shown in table 1(a), our model produces ligands with a validity of 99.87%, surpassing the baseline, GraphBP. We also compute the distribution of bond distances of the two methods, and compare this to distribution of the reference ligands; see figure 2. Our method's distribution is considerably closer to the reference distribution than GraphBP; some non-trivial fraction of the time, GraphBP produces unusual, very high bond distances. (In fact, we have discarded values higher than 10 Å on the GraphBP plot so as to display the distributions on similar scales.) This impression is reinforced in table 1(a) which compares the mean and standard deviation of these distributions.
Binding affinity. A more interesting measure than validity is ∆Binding, which measures the percentage of generated molecules that have higher predicted binding affinity to the target binding site than the corresponding reference molecule. To compute binding affinities, we follow the procedure used by GraphBP. Briefly, we refine the generated 3D molecules by universal Force Field (UFF) minimization (Rappé et al 1992); then, Vina minimization and CNN scoring are applied to both generated and reference molecules by using gnina, a molecular docking program (McNutt et al 2021). As can be seen in table 1(b), our result improves significantly on the baseline. Raw GraphBP attains ∆Binding = 13.45%. By playing with the minimum and the maximum atom number of the baseline auto-regressive model, we were able to improve this to 22.76%; however, note that this results in a reduction in validity from 99.75% to 99.54%. Our method attains ∆Binding = 35.7%, which is a relative improvement of 56.81% over the better of the two GraphBP scores.

Qualitative results.
We show examples of generated ligands in figure 3, along with their chemical structures. Note that the structures of the generated molecules differ substantially from the reference molecules, indicating that the model has indeed learn to generalize to interesting novel structures.
Ablation study on the number distribution. We are interested in showing how sampling from the number distribution to determine the number of atoms, N ∼ p(N|Ĝ) (section 3.4), as part of the conditional generative process pipeline, affects the quality of the target-aware molecule generation task. In this study, which is elaborated in detail in appendix B, we investigate the impact of alternative sampling methods for determining the number of atoms in the generative molecular process. Four alternative sampling methods are compared: the number of atoms given by the known reference molecule, uniform sampling over the range of possible atom numbers, uniform sampling of large molecules, and uniform sampling of small molecules. The generated molecules are evaluated using various metrics, including drug-likeness (QED), synthetic accessibility (SA), Lipinski's Rule, the octanol-waterpartition coefficient (LogP), and Diversity. The results in table 3 indicate that the sampling method for the number of atoms significantly affects the quality of the generated molecules. While sampling large molecules yields molecules with higher predicted binding affinity, it lacks good performance in other measures. Conversely, sampling small molecules produces molecules with favorable molecular properties but lower binding affinity. Uniform distribution sampling produces molecules with high validity and good properties but relatively poor binding affinity. However, when sampling from the number distribution, the generative process shows the most promising results, producing molecules with relatively high predicted binding affinity and good molecular properties. These findings highlight the importance of sampling from the number distribution to improve the generation of molecules associated with a specific receptor target while maintaining quality based on molecular statistical measures.
Binding likelihood. The CrossDocked2020 dataset (Francoeur et al 2020) includes quantitative measures indicating the quality of the binding of each docked receptor-ligand structure: (i) RMSD to the reference crystal pose; (ii) Vina cross-docking score (Trott and Olson 2010) as implemented in Smina (Koes et al 2013). Scoring functions that represent and predict ligand-protein interactions are important for applications in structure-based drug discovery (e.g. energy minimization, molecular dynamics simulations, and hit identification/lead optimization (DeWitte and Shakhnovich 1996, Charifson et al 1999, McInnes 2007), and particularly important for molecular docking where one seeks to predict the probability that binding occurs given a specified orientation and conformation (i.e. pose) of a ligand with respect to a target receptor (Wang et al 2003, Kitchen et al 2004, Warren et al 2006, Cheng et al 2009, Trott and Olson 2010, Huang and Zou 2011, Su et al 2018. We show how the conditional probability of our model may also be used as such indicator. As we have seen in section 3, our method is not trained on either of the binding quality indicators (Vina / RMSD), but instead is the result of learning a conditional probability distribution from ligand-receptor pairs; therefore it may be viewed as a complementary scoring method. Specifically, given the conditional probability p(G|Ĝ) from the trained model, its NLL may be used a scoring method. Such an approach can be considered as a member of the family of knowledge-based methods (Gohlke et al 2000, Muegge 2000, Ballester and Mitchell 2010, Durrant and McCammon 2011, Huang and Zou 2011, Hassan et al 2018, Wójcikowski et al 2019 which are constructed from entirely non-physical statistical potentials derived from known receptor-ligand complexes.

Vina and RMSD.
We split the 63 929 validation data points into three different groups according their Vina scores: (a) Vina ∈ [0, 5] (b) Vina ∈ (5, 9] (c) Vina ∈ (9, 15]. We randomly sample 10 000 receptor-ligand complexes from each group. Using the trained model, we calculate the average NLL for each group separately, where lower NLL indicates better conditional likelihood. The results in table 2(a) show good correspondence between the Vina scores (higher values describe better affinity) with the binding likelihood outcomes of our model (lower NLL is better). We then repeat this experiment, but using RMSD values instead of Vina scores. Again, we divide the validation set into three groups according to their RMSD values: (a) RMSD ∈ [0, 1] (b) RMSD ∈ (1, 1.5] (c) RMSD ∈ (1, 2] (all values in Angstroms). Again, we see a good correspondence between the RMSD values (lower values describe a better connected complex) with the binding likelihood outcomes of our model (lower NLL is better).

Rigid transformations.
The nature of binding between receptor and corresponding ligand is affected by the position of the ligand in the binding site. The ligand pose for each ligand-receptor pair in the validation data  points was refined using the UFF force-field and then optimized with respect to the receptor structure using the Vina scoring function. We randomly select 10 000 validation data points and apply a rigid body transformation solely to the minimized ligand-graph pose. Any such changes to the ligand pose with respect to the binding site potentially affect and reduce the binding affinity. The results in table 2(b) show the NLL under varying degrees of rigid transformations. The more significant the transformation, the lower the binding likelihood (i.e. the higher the NLL) gets.

Conclusions and directions for future work
We have presented a method for learning a conditional distribution of a complement graph given a base graph. The method, which is based on a continuous normalizing flow, has provable invariance properties based on semi-equivariance conditions on the flow. We have demonstrated the usefulness of the method on the application of generating ligands given a receptor; the method, which improves upon competing methods in the ∆Binding measure, promises the potential to generate previously undiscovered molecules with high binding affinity. We further show that the conditional probability model can be used as a kind of scoring function for ligand-receptor complexes. Such an approach can be considered as a member of the family of knowledge-based methods which are constructed from entirely non-physical statistical potentials derived from known receptor-ligand complexes. This method can readily be extended to other scientific and engineering settings described by a data structure consisting of sets of three-dimensional points. For example, in high energy physics, the method can be adopted as a particle flow reconstruction algorithm that allows event-level quantities to be taken into account (Pata et al 2021) by learning a conditional distribution of the form p(G|Ĝ), which indicates which generator-level particle events match given detector inputs. In computer vision, the method may be adopted as a completion approach for 3D point clouds (Liu et al 2019b, Bello et al 2020, Guo et al 2020 generated from a wide variety of sensors (stereo, Lidar, Time-of-Flight, etc); the base graph represents the part of the point cloud that has been given, and the complement graph represents the completion of this point cloud.

Data availability statement
The data that support the findings of this study are openly available at the following URL/DOI: https:// github.com/EyalRozenberg1/se_conditional_flows.
where the matrices W 1 , . . . W K all have the same number of rows. Then we set p a k a 1: in the case of categorical properties; analogous expressions exist for ordinal or continuous properties. Note that the only item which changes for the different properties k is the vector ξ a,k . It is easy to see that this distribution satisfies the invariance properties in equation (14). The corresponding loss function is a simple cross-entropy loss (or regression loss for non-categorical properties).

Appendix B. Ablation study on the number distribution for target-aware molecule generation
Given a base graph, denoted asĜ, we wish to conditionally sample a second 3D graph, which we refer to as the complement graph, denoted as G. The Markov decomposition of the conditional distribution ultimately allows us to obtain the conditional probability p(G|Ĝ), as described in equation (8) and illustrated by successive stages in figure 1. Specifically, as we sample complement graphs from the learned conditional distribution, we first sample from the number distribution to determine the number of vertices, N ∼ p(N|Ĝ) (section 3.4). In this section, we are interested in showing how sampling from the number distribution to determine the number of atoms, as part of the conditional generative process pipeline, affects the quality of the target-aware molecule generation task. We perform an ablation study, in which we substitute the number-distribution sampling method from the generative process, and use alternative sampling methods to determine the number of atoms in the generated molecule. Four sampling methods are compared: (1) using the number of atoms given by the known reference molecule, N ≡ N ref ; (2) uniform sampling over the range of possible number of atoms, In each case, all other aspects of the molecule generation processes are treated identically. The resultant molecular generative methods are compared under a set of common metrics to evaluate the quality and diversity of generated molecular structures (Polykovskiy et al 2020, Peng et al 2022, Schneuing et al 2022. In addition to the validity and ∆Binding measures (section 5.3), we use five further metrics to obtain the molecular properties and quantitatively compare the generated molecules: (i) QED: a quantitative estimation of drug-likeness, ∈ [0,1], which combines several desirable molecular properties (Bickerton et al 2012). Higher values indicate better likelihood for molecules to be viable candidates for drugs. (ii) SA: a synthetic accessibility score which indicates how difficult is it to synthesize a given molecule, ∈ [0,1]. Higher values indicate easier synthesis (Ertl and Schuffenhauer 2009). (iii) Lipinski's Rule: an empirical law of drug-likeness, following Lipinski's Rule of Five (Lipinski et al 1997, Veber et al 2002, to assess the drug-likeness of molecules. (iv) LogP: the octanol-water partition coefficient. It is a measures of lipophilicity and is commonly reported for potential drug candidates. Molecules with LogP values ∈ [−0.4, 5.6] have better potential to be good drug candidates (Ghose et al 1999, Wildman andCrippen 1999). (v) Diversity: computed as the average pairwise dissimilarity between all generated molecules for each individual pocket, using the Tanimoto similarity for comparing chemical structures. Table 3 shows the results of a study comparing the properties of molecules generated by different sampling methods for the number of atoms, along with the the evaluation set properties. For completeness, we included the baseline technique, GraphBP . For the baseline we use the two training procedures discussed in the main paper: (1) setting the atom number range of the autoregressive generative process according to the atom distribution of the dataset, N ∈ [N min , N max ]; (2) fixing the minimum and the maximum atom number range of the autoregressive generative process to improve the ∆Binding measure, N ∈ [N low ,N high ].
Results suggest that the sampling method for the number of atoms has a significant impact on the generation of high-quality molecules. We see that the method that samples large molecules produces molecules with the highest predicted binding affinity to the target binding site, although it lacks good performance in all other measures (in particular, having the lowest QED, SA, Lipinski, and Diversity values) with a large deviation in LogP with respect to the evaluation set; the method that samples small molecules produces molecules with good molecular properties (in particular, having the highest SA, Lipinski, and Diversity values), although it has a large deviation in LogP with respect to the evaluation set and lacks performance in the predicted binding affinity to the target binding site, with the lowest ∆Binding. The method that samples the number of atoms from a uniform distribution produces molecules with the highest validity and with good molecular properties, although it obtains a relatively poor predicted binding affinity. The method that sets the number of atoms as in the reference molecule produces molecules with relatively poor values in all measures. In the case where the number-distribution sampling method is used, we see that the generative process produces the most promising results, with relatively high predicted binding affinity to the target binding site and with good molecular properties, under all measures. Overall, the results of this study suggest that by sampling from the number distribution to determine the number of atoms, as part of the conditional generative process pipeline, we improve the chance of getting molecules that are better associated with a specific protein while maintaining good quality under the statistical measures of the molecular structures.