Transfer operators on graphs: Spectral clustering and beyond

Graphs and networks play an important role in modeling and analyzing complex interconnected systems such as transportation networks, integrated circuits, power grids, citation graphs, and biological and artificial neural networks. Graph clustering algorithms can be used to detect groups of strongly connected vertices and to derive coarse-grained models. We define transfer operators such as the Koopman operator and the Perron-Frobenius operator on graphs, study their spectral properties, introduce Galerkin projections of these operators, and illustrate how reduced representations can be estimated from data. In particular, we show that spectral clustering of undirected graphs can be interpreted in terms of eigenfunctions of the Koopman operator and propose novel clustering algorithms for directed graphs based on generalized transfer operators. We demonstrate the efficacy of the resulting algorithms on several benchmark problems and provide different interpretations of clusters.


Introduction
Transfer operators are frequently used for the analysis of complex dynamical systems since their eigenvalues and eigenfunctions contain important information about global properties of the underlying processes [1,2,3,4].While the eigenvalues are related to inherent relaxation timescales, the corresponding eigenfunctions can be used to identify metastable sets representing, for example, stable conformations of proteins or coherent sets describing, for instance, slowly mixing eddies in the ocean [5,6,7,8,9,10,11].Other applications include model reduction, system identification, change-point detection, forecasting, control, and stability analysis [12,13,14,15,16].Although transfer operators have been known for a long time, they recently gained renewed attention due to the availability of large data sets and powerful machine learning methods to estimate these operators and their spectral properties from simulation or measurement data.
Spectral clustering for graphs is a well-known and powerful technique for partitioning networks into groups of nodes that are well-connected internally, and poorly connected to other groups of nodes [17].It was first shown in [18] that graphs can be partitioned based on eigenvectors of the adjacency matrix, but more recently spectral clustering has become increasingly popular in the machine learning community.In particular, the spectral clustering algorithms proposed in [19,20] are commonly deployed and are capable of identifying near-optimal clusters in undirected static graphs [21].Many extensions of these standard algorithms exist, including extensions to directed graphs [22] and time-evolving graphs [23,24].Recent work includes scalable algorithms that can efficiently cluster large graphs [25], and a regularized method that is able to interpret heterogeneous structures in graphs [26,27].
The underlying technique can be derived in a number of ways, for example by viewing the problem as a graph partitioning or from a random walk perspective.In [17], it is shown that using a graph partition viewpoint allows the clustering problem to be reformulated as a relaxed mincut problem with constraints to balance the clusters.In particular, balancing the clusters by the number of vertices per cluster gives an objective function known as the RatioCut, and relaxing this problem is equivalent to standard spectral clustering using the unnormalized graph Laplacian.Relaxing the normalized cut objective function, which balances clusters by weight, leads instead to the spectral clustering formulation using the normalized graph Laplacian.
As mentioned above, random walks can also be used to reformulate the graph clustering problem.Intuitively, this corresponds to finding a graph partition such that a random walker will remain within a cluster for long periods of time and will rarely jump between clusters.This random-walk interpretation of spectral clustering methods for undirected graphs is wellknown, see, e.g., [28,17].Nevertheless, the dynamical systems perspective and relationships with transfer operators and their eigenvalues and eigenfunctions have to our knowledge neither been used to analyze existing methods in detail, nor to systematically generalize spectral clustering to directed or dynamic graphs.In [23], we defined transfer operators on graphs and showed that spectral clustering for undirected graphs is equivalent to detecting metastable sets in reversible stochastic dynamical systems.The eigenvalues of transfer operators associated with non-reversible systems, however, are in general complex-valued and the notion of metastability is not suitable for such problems anymore.This motivates the definition of so-called coherent sets, which can be regarded as time-dependent metastable sets [5,8,29].Coherent sets can be detected by computing eigenvalues and eigenfunctions of generalized transfer operators based on the forward-backward dynamics of the system.By defining these operators on graphs, it is possible to develop spectral clustering techniques for directed and time-evolving graphs, thus establishing and exploiting links between dynamical systems theory and graph theory.Applying methods originally developed for the analysis of complex dynamical systems to graphs allows for a rigorous derivation and, at the same time, also intuitive interpretation of spectral clustering techniques.
In terms of existing methods for clustering directed graphs, several proposed approaches are based on symmetrization techniques [22].A simple example is to define a new graph with adjacency matrix A + A ⊤ , where A is the adjacency matrix of the original directed graph.This is essentially equivalent to ignoring the directions of the edges.It is also possible to symmetrize the graph by defining an adjacency matrix in terms of the stationary distribution of a random walk [30].Both of these proposals have a notable drawback; they fail to recognize similar nodes with common in-links or out-links, and can only cluster based on the existing edges in the graph with directional information removed.Our clustering approach is loosely related to these symmetrization techniques, in that it also leads to real-valued spectra since the resulting operators are self-adjoint (with respect to potentially weighted inner products).However, we do not directly symmetrize the adjacency matrix or other graph-related matrices, but exploit properties of random walk processes and associated transfer operators.A more detailed comparison and numerical results will be presented below.
In this work, we will generalize the definition of transfer operators on graphs, first introduced in [23], and show how eigenfunctions of these operators are related to classical spectral clustering approaches.The main contributions are: • We study the spectral properties of graph transfer operators and their matrix representations and show that the eigenfunctions of these operators can also be obtained by solving associated optimization problems.
• We illustrate how transfer operators can be used to detect clusters or community structures in directed graphs and provide illustrative interpretations of the identified clusters in terms of coherent sets and block matrices.
• We define Galerkin projections of transfer operators, show how these coarse-grained operators can be estimated from random-walk data, and analyze the convergence of data-driven approximations.
• Furthermore, we highlight additional applications of the proposed methods such as graph drawing or network inference, apply our clustering approach to different types of benchmark problems, and compare it with other spectral clustering algorithms.In Section 2, we will define transfer operators, analyze their properties, and highlight relationships with spectral clustering algorithms.Section 3 illustrates how transfer operators can be approximated and estimated from data.Additionally, we will point out different interpretations of clustering methods.Numerical results will be presented in Section 4 and concluding remarks and open questions in Section 5.

Graphs, transfer operators, and spectral clustering
In this section, we will introduce the mathematical tools required for the transfer operatorbased identification of clusters in graphs.

Directed and undirected graphs
In what follows, we will mainly consider directed graphs, but the derived methods can, of course, also be applied to undirected graphs.Relationships with well-known clustering techniques for undirected graphs will be discussed below.
The weighted adjacency matrix A ∈ R n×n associated with a graph G is defined by where w(v i , v j ) > 0 is the weight of the corresponding edge.If the graph is undirected, then the adjacency matrix is symmetric.Additionally, we introduce the row-stochastic transition probability matrix S = D −1 ℴ A ∈ R n×n , with That is, s ij is the probability that a random walker starting in v i will go to v j in one step.
We refer to D ℴ as the matrix of out-degrees, and we will later also make use of the in-degree matrix D  , which is defined in a similar way, with

Benchmark problems
In order to generate benchmark problems with clearly defined clusters, we will use the so-called directed stochastic block model.
is a block matrix whose blocks Here, r b is the number and n b the size of the blocks.
Our definition differs slightly from the DSBM described in [31], in that we allow different probabilities for all the blocks.Undirected graphs can be constructed in a similar way.While the DSBM is a frequently used model to generate benchmark problems, real-world graphs typically exhibit a more complicated structure.An algorithm to generate weighted directed graphs with heterogeneous node-degree distributions and cluster sizes for testing community detection methods is described in [32].Briefly, the graphs in [32] are constructed by randomly assigning in-degrees to each node from a power law distribution and assigning out-degrees from a delta distribution.A mixing parameter is introduced for each of these degrees that is related to the quality of the graph partition-that is, the lower the mixing parameter the better the partitioning of the graph.Cluster sizes are also drawn from a power law distribution and a subgraph is constructed from each of these clusters.The subgraphs are finally joined randomly, with a rewiring process to ensure that the in-degree and out-degree distributions are preserved.We will use this algorithm to generate a set of graphs with different characteristics.

Coherent set illustration
The spectral clustering approach for directed and time-evolving graphs proposed in [23] is based on a generalized Laplacian that contains information about coherent sets.The detection of such coherent sets requires the notion of transfer operators, which describe the evolution of probability densities or observables.Before we introduce these operators, let us first illustrate the definition of coherence-defined in [33,8] for continuous dynamical systems-for directed graphs.Definition 2.3 (Coherent pair).Let S τ be the flow associated with a dynamical system and τ a fixed lag time.Two sets A and B form a so-called coherent pair if S τ (A) ≈ B and That is, the set A is almost invariant under the forward-backward dynamics and we call it a finite-time coherent set.To avoid that (S −τ • S τ )(A) = A for all sets A, a small random perturbation of the dynamics is required for deterministic systems, see [8].In our setting, the dynamics are given by a random walk process on the graph and thus already non-deterministic.
Example 2.4. Figure 1 illustrates the difference between a coherent set (green) and a set that is dispersed by the random-walk dynamics (red).The example shows that coherent sets form weakly coupled clusters and can be regarded as a natural generalization of clusters in undirected graphs.△ This notion of coherence can be easily extended to time-evolving networks as shown in [23].The structure of the clusters may then also change in time.

Transfer operators
Transfer operators on graphs can be either directly expressed in terms of graph-related matrices (such as the transition probability matrix) or estimated from random walk data.We will start with a formal definition of transfer operators.In what follows, U = {f : V → R} denotes the set of all real-valued functions defined on the set of vertices V of a graph G .

Definition 2.5 (Perron-Frobenius & Koopman operators).
Let S be the transition probability matrix associated with the graph G .i) For a function ρ ∈ U, the Perron-Frobenius operator P : U → U is defined by ii) Similarly, for a function f ∈ U, the Koopman operator K : U → U is given by The Koopman operator is the adjoint of the Perron-Frobenius operator with respect to the standard inner product.

Definition 2.6 (Probability density). A probability density is a function
This allows us to define the Perron-Frobenius operator with respect to different probability densities.We define the µ-weighted inner product by Definition 2.7 (Reversibility).The system is called reversible if there exists a probability density π that satisfies the detailed balance condition π(v i ) Random walks on undirected graphs, for instance, are reversible [34,35].We will use this property to analyze the relationships between conventional spectral clustering and our approach.
Definition 2.8 (Reweighted Perron-Frobenius operator).Let µ be the initial density and ν = Pµ the resulting image density, i.e., Furthermore, assume that µ and ν are strictly positive.Given a function u ∈ U, we define the reweighted Perron-Frobenius operator T : U → U by The reweighted Perron-Frobenius operator describes the evolution of functions with respect to the initial and final densities µ and ν.The assumption that µ(v i ) > 0 implies that the probability that a random walker starts in v i is non-zero.Correspondingly, ν(v i ) > 0 means that v i is reachable from at least one vertex, which could be v i itself.Adding selfloops thus regularizes the problem.Definition 2.9 (Forward-backward & backward-forward operators).Let f, u ∈ U as before.i) We define the forward-backward operator F : U → U by ii) Analogously, the backward-forward operator B : U → U is defined by In order to detect coherent sets, we want to find eigenfunctions φ ℓ of the forward-backward operator F corresponding to eigenvalues λ ℓ ≈ 1 such that Fφ ℓ = λ ℓ φ ℓ ≈ φ ℓ .These functions are almost invariant under the forward-backward dynamics.Defining That is, the functions φ ℓ and ψ ℓ come in pairs.

Covariance and cross-covariance operators
In addition to these different types of transfer operators, we define (uncentered) covariance and cross-covariance operators.
Definition 2.10 (Covariance operators).Let S be the transition probability matrix and µ and ν the densities defined above.Given functions f, g ∈ U, we call C xx , C yy : U → U, with covariance operators and C xy : U → U, with cross-covariance operator.
Let X ∼ µ and Y ∼ ν be random variables representing the initial position of the random walker and the position after one step, respectively.It follows that and, since the joint probability distribution is given by P

Matrix representations of operators
In order to simplify the notation, given any function g ∈ U, let g ∈ R n be defined by i.e., ρ, f , u, µ, ν ∈ R n are the vector representations of the functions ρ, f, u, µ, ν ∈ U, respectively.Additionally, we define diagonal matrices so that we can write In what follows, we will use functions and their vector representations as well as operators and their matrix representations interchangeably.
Remark 2.11.Let 1 ∈ R n be the vector of all ones.In [23], we assumed that the random walkers are initially uniformly distributed and defined . However, for this special case

Associated graph structures
How are the graph structures associated with the matrix representations F and B of the forward-backward and backward-forward operators F and B related to the original directed graph G ?
Lemma 2.12.It holds that the entry f ij of the matrix F is nonzero if and only if there exists an index k such that That is, a random walker can go forward from v i to v k and then backward to v j .Similarly, the entry b ij of the matrix B is nonzero if and only if there exists an index k such that The difference between the two matrix representations of the operators is illustrated in Figure 2. Note that the probability of each forward-backward random walk via v k is divided by ν(v k ), which represents the sum of the probabilities to go from any vertex to v k .This is intuitively clear since the more edges enter v k , the more options the random walker has to go backward.On the other hand, the more forward-backward random walks from v i to v j exist, the larger the weight f ij .Remark 2.13.This is strongly related to the bibliometric graph clustering approach described in [22].Here, the bibliographic coupling matrix AA ⊤ and the co-citation matrix A ⊤ A represent the common out-links and in-links between a pair of nodes, respectively.Using the sum of these two matrices gives a symmetrization of the adjacency matrix that accounts for both of these types of common links.As in the forward-backward approach, the weights can be divided by the degrees of the nodes in order to give an effective measure of similarity for nodes that accounts for the number of possible routes or links into and out of the node.Alternatively, in [31] a complex-valued Hermitian matrix 2 .This can then also be viewed as a symmetrization scheme that utilizes the bibliographic coupling matrix AA ⊤ and the co-citation matrix A ⊤ A.  3(b) and 3(c).The graph corresponding to the matrix F , for instance, now contains an edge connecting v 4 and v 8 since a random walker could move forward from v 4 to v 5 and backward to v 8 or the other way around.The probability of this forward-backward random walk, however, is low due to the small edge weight of (v 4 , v 5 ).Note that without the self-loops all vertices except v 4 , v 8 , and v 12 would become disconnected.Every forward-backward random walk starting in one of these vertices would then end up in the initial position (since these vertices have only one outgoing edge) and thus form clusters of size one.This illustrates that adding self-loops can be regarded as a form of regularization.△

Spectral clustering of directed graphs
Spectral clustering methods for undirected graphs are based on the eigenvectors corresponding to the smallest eigenvalues of an associated graph Laplacian.The random-walk normalized Laplacian is defined by where K is the matrix representation of the Koopman operator.Equivalently, we can also compute the eigenvectors corresponding to the largest eigenvalues of K.This shows that conventional spectral clustering computes the dominant eigenfunctions of the Koopman operator, which contain information about the metastable sets of the system.The eigenvalues of the Koopman operator associated with directed graphs, however, are in general complex-valued and standard spectral clustering techniques typically fail, see [23] for a detailed analysis.For directed (and also time-evolving) graphs we thus propose to compute the eigenvectors associated with the largest eigenvalues of the forward-backward or backwardforward operators in order to detect coherent sets.The eigenvalues of these operators are real-valued even if the graph is directed, see Appendix A.
1. Compute the k largest eigenvalues λ ℓ and associated eigenvectors φ ℓ of F .
The number of eigenvalues close to 1 indicates the number of coherent sets.That is, k should be chosen in such a way that there is a spectral gap between λ k and λ k+1 .In order to illustrate the spectral clustering algorithm, we first consider a simple graph that has a clearly defined cluster structure.
Example 2.16.We sample a benchmark problem G ∈ G(4, 100, E) from the DSBM, with q p q q q q q p q q p q p q q q     , where p = 0.8 and q = 0.1.The resulting adjacency matrix is shown in Figure 4(a).Note that there are two different types of clusters, namely dense blocks on the diagonal, representing groups of vertices that are tightly coupled to other vertices contained in the same cluster, and dense off-diagonal blocks, which have the property that the contained vertices are all tightly coupled (unidirectionally) to vertices in another cluster.For both types of clusters, however, the probability that a random walker going forward and then backward ends up in the same cluster is large compared to the escape probability.Applying Algorithm 2.15 yields the results shown in Figure 4(c) and (d).The dominant eigenvalues are λ 1 = 1, λ 2 = 0.72, λ 3 = 0.70, and λ 4 = 0.69, followed by a spectral gap.The eigenfunctions of the forward-backward operator can also be used for visualizing directed graphs1 as shown in Figure 4(b).By defining the position of the vertex v i to be (φ 2 (v i ), φ 3 (v i )), we obtain a two-dimensional embedding of the graph.The edges are colored according to the source nodes.The blue vertices form a cluster in the conventional sense since they are strongly connected but only weakly coupled to the other clusters.The remaining sets of vertices have the property that many directed edges are pointing to one of the other clusters but not vice versa.These clusters correspond to non-sparse off-diagonal blocks.△

Spectral clustering of undirected graphs as a special case
A natural question now is whether conventional spectral clustering methods for undirected graphs can be regarded as a special case of the algorithm derived above.
Lemma 2.17.Assume that the system is reversible with respect to π.Then, setting µ = π, it holds that T = K and Proof.First, we rewrite the detailed balance condition as The second part follows immediately from the definitions of the operators F and B. That is, each function pair is associated with the block of the adjacency matrix marked in the corresponding color.These blocks all have in common that they contain many nonzero entries (i.e., incoming or outgoing edges).Note that since we compute two functions, i.e., φ ℓ and ψ ℓ , it is now possible to detect dense off-diagonal blocks.
If π satisfies the detailed balance condition, then it is also automatically an invariant density.This can be seen as follows: Lemma 2.18.Let G now be an undirected graph with unique invariant density π.Assume that K has only distinct positive eigenvalues.Then conventional spectral clustering and Algorithm 2.15 with µ = π compute the same clusters.
Proof.Let κ ℓ be the eigenvalues of K and λ ℓ the eigenvalues of F .Using Lemma 2.17, we know that λ ℓ = κ 2 ℓ and the eigenvectors are identical.Since all eigenvalues of K are by assumption positive and distinct, the ordering of the eigenvalues and eigenvectors remains unchanged.
Negative eigenvalues of K might affect the ordering of the eigenvectors and thus the spectral clustering results.However, positive eigenvalues can be enforced by turning the transition probability matrix into a lazy Markov chain [37].

Analysis and approximation of transfer operators
In this section, we will derive reduced representations of transfer operators, show how they can be estimated from random walk data, and present alternative interpretations of the proposed spectral clustering approach.

Galerkin approximation
Given an operator L : U → U with matrix representation L, our goal now is to compute its Galerkin projection L r onto the r-dimensional subspace U r ⊂ U with basis {ϕ i } r i=1 .Here, r could potentially be much smaller than n.The matrix representation L r ∈ R r×r of L r is given by L r = G (0) −1 G (1) , with entries

By defining the vector-valued function ϕ(v
the matrices G (0) and G (1) can be compactly written as Any function f ∈ U r is given by Suppose now ξ ℓ is an eigenvector of L r with associated eigenvalue λ ℓ , i.e., L r ξ ℓ = λ ℓ ξ ℓ , then Alternatively, assuming the matrix G (0) is invertible, we can also solve the generalized eigenvalue problem G (1) ξ ℓ = λ ℓ G (0) ξ ℓ .This shows that eigenfunctions of L can be approximated by eigenfunctions of L r .
Example 3.1.Choosing r = n and indicator functions for each vertex, i.e., we obtain Φ V = I and hence G (0) = D µ and G (1) = D µ L so that L r = L.That is, in this case we obtain the full matrix representation of the operator L. △ Theoretically, we could choose any set of basis functions.Our goal, however, is to define the basis functions in such a way that the reduced operator L r has essentially the same dominant eigenvalues as the full operator L and thus retains the metastability (and cluster structure) of the system.Remark 3.2.The optimal basis is given by the eigenvectors φ ℓ corresponding to the largest eigenvalues λ ℓ of L since choosing That is, L r = diag(λ 1 , . . ., λ r ) and the reduced operator has exactly the same dominant eigenvalues as the full operator.However, if we already knew the eigenvectors of L, then we could immediately compute the clusters by applying, e.g., k-means.Also, the number of metastable sets is in general not known in advance.The question then is whether it is possible to derive near-optimal basis functions without having to compute the eigenvectors in the first place.One possibility would be to construct basis functions based on random walks started in different parts of the graph since the random walkers would with a high probability first explore nodes within the clusters before moving to other clusters.The generation of suitable basis functions, however, is beyond the scope of this paper.

Data-driven approximation
The operators introduced above can also be estimated from random walk data.Given m random walkers x (i) sampled from the distribution µ, each random walker takes a single step forward according to the probabilities given by the transition probability matrix S. Let the final positions of the random walkers be denoted by y (i) .Alternatively, instead of considering many random walkers, the data can be extracted from one long random walk x (1) , . . ., x (m+1) by defining y (i) = x (i+1) , for i = 1, . . ., m.The density µ is then determined by the random walk.If the graph admits a unique invariant density π, then µ will converge to π for m → ∞.
Further, let G xx , G yy , G xy ∈ R r×r be defined by Thus, the entries of the matrices Remark 3.3.Note that the matrices G xx and G xy are not the Gram matrices used in kernel-based methods.Here, G stands for Galerkin.For functions f Proof.For the first part, compare G xx and G xy with the matrices G (0) and G (1) introduced above.Then, using Proposition A.1, we have which proves the second part.
If µ is the uniform distribution, then ⟨ϕ i , Kϕ j ⟩ = ⟨Pϕ i , ϕ j ⟩, and G −1 xx G yx is an approximation of the operator P with respect to the standard inner product ⟨•, •⟩.As the forwardbackward and backward-forward operators are simply compositions of these operators (see denotes the first (not shown since it is always 1), the second, the third, and the fourth eigenvalue.The shaded area in the corresponding color represents the standard deviation, the dotted line the infinite-data limit, and the dashed line the eigenvalues of the Galerkin projection of the operator.Recall that the datadriven approximation is a composition of two Galerkin approximations and thus slightly underestimates the correct eigenvalues.The eigenvalues of the Galerkin approximation are also marginally smaller than the eigenvalues of the full operator computed in Example 2.16.(b) Dominant eigenfunctions φ ℓ (top) and ψ ℓ (bottom) of the reduced matrix (solid lines) and the full matrix (dotted lines).The chosen indicator function basis approximates the correct eigenfunctions well since they are essentially almost constant within the clusters.also Proposition A.3), all the transfer operators introduced above can be estimated from data.This, however, implies that we do not directly compute Galerkin approximations of F and B but represent them as compositions of Galerkin projections, which introduces additional errors.
Example 3.5.For the graph introduced in Example 2.16, we define sets A 1 = {v 1 , . . ., v 100 }, A 2 = {v 101 , . . ., v 200 }, A 3 = {v 201 , . . ., v 300 }, and A 4 = {v 301 , . . ., v 400 } and basis functions ϕ j (v i ) = 1 A j (v i ), with j = 1, . . ., 4, where 1 B denotes the indicator function for the set B. The representation of the forward-backward operator F projected onto the space spanned by the four basis functions is then approximately a diagonal matrix and can be regarded as a coarse-grained counterpart of the full operator, where the off-diagonal entries correspond to transition probabilities between the clusters.The resulting eigenvalues and eigenvectors are good approximations of the dominant eigenvalues and eigenvectors of the full operator as shown in Figure 5. △ The example illustrates that it is possible to approximate the dominant eigenvalues and eigenvectors by computing eigendecompositions of potentially much smaller matrices, provided that a suitable set of basis functions is chosen and that the graph exhibits metastability.
Remark 3.6.Assuming that only random-walk data is available, but not the underly-ing graph itself, the data-driven approach can thus also be used for inferring the network structure and the transition probabilities.This will be further investigated in future work.

Optimization problem formulation
We will now show that it is possible to rewrite the clustering approach for directed graphs as a constrained optimization problem.This gives rise to a different interpretation of clusters.Canonical correlation analysis (CCA) [38,39,40] was originally developed to maximize the correlation between random variables-potentially in infinite-dimensional feature spaces if the kernel-based formulation is used.It has been shown in [9] that CCA, when applied to time-series data, can be used to detect coherent sets.In order to illustrate the relationships with the forward-backward operator F (or its matrix representation F ), we will briefly outline the derivation of CCA, which can be written as max Restricted to the subspace U r , defining f The Lagrangian function corresponding to this optimization problem is where κ x and κ y are Lagrange multipliers.The derivatives with respect to c x and c y are Multiplying the first equation by c ⊤ x and the second by c ⊤ y , it follows that κ x = κ y =: κ ℓ .We thus obtain the (generalized) eigenvalue problem If we now solve one of the equations for c x or c y and insert the resulting expression into the other, we have Since G −1 xx G xy is a Galerkin approximation of K and G −1 yy G yx a Galerkin approximation of T , the first expression can be used to compute eigenfunctions of F and the second to compute eigenfunctions of B, where λ ℓ = κ 2 ℓ is the corresponding eigenvalue.This shows that the functions maximizing the correlation are the eigenfunctions of the forward-backward and backward-forward operators.We can again use the dominant eigenfunctions to compute coherent sets.where is p = 0.99 and q = 0.01, p = 0.7 and q = 0.3, p = 0.3 and q = 0.7, and p = 0.01 and q = 0.99, corresponding to the dots of the same color in (a) and (b).Although the correlation decreases due to the "noisier" eigenvectors if p approaches q, the spectral clustering algorithm still works for most cases unless p ≈ q.Note that the sign of ψ 2 flips once q > p (i.e., the off-diagonal blocks are now non-sparse), but the correlation increases again.
Example 3.7.We now generate graphs G ∈ G 2, 100, p q q p with varying p and q using the DSBM.There are two dense diagonal blocks if p is large and q is small and two dense off-diagonal blocks if p is small and q large.In order to analyze the quality of the clustering for different values of p and q, we plot the second-largest eigenvalue κ 2 (i.e., the correlation between φ 2 and ψ 2 ) and the adjusted Rand index [41].The results are shown in Figure 6.The clustering works for both scenarios, but, as expected, fails when the values of p and q are too similar since there are in that case no clearly defined clusters.△ This demonstrates that the spectral clustering approach for directed graphs works for a wide range of parameter settings.Alternatively, we can rewrite (1) as max where c x = G

Optimization problem formulation for undirected graphs
In order to write spectral clustering algorithms for undirected graphs as a solution of a constrained optimization problem, set µ = π.We then solve which is maximized by the dominant eigenvector of the matrix G xx .We then again obtain the eigenvalue problem G −1 xx G xy c = κ c.

Comparison
We have shown in Section 2 that our transfer operator-based spectral clustering algorithmobtained by replacing the Koopman operator by the forward-backward operator-can be interpreted as a straightforward generalization of conventional spectral clustering algorithms for undirected graphs.This is also corroborated by the optimization problem formulation.Additionally, comparing the constrained optimization problems (1) and (3) for directed and undirected graphs, we see that the former is clearly more flexible and powerful since we are allowed to choose two functions, whereas the latter is restricted to one, which makes it impossible to detect dense off-diagonal blocks.

Numerical results
We will present numerical results for a set of benchmark problems.Additional examples, including extensions to time-evolving networks, can be found in [23].

Weighted graphs with inhomogeneous degree distributions
We generate 300 weighted graphs with heterogeneous node-degree distributions and community sizes as described in [32] using their publicly available C++ code (Package 4).All the graphs comprise 500 vertices and have between four and six clusters.We consider three different test cases: (1) only dense blocks on the diagonal, (2) only off-diagonal dense blocks, and (3) a mix of both.2A typical adjacency matrix (case 3) is shown in Figure 7. Table 1: Comparison of the degree-discounted bibliometric symmetrization (DDBS), the Hermitian clustering approach (Herm-RW), and Algorithm 2.15 for graphs generated using a generalization of the model proposed in [32].For each method and test case, we compute the average adjusted Rand index (ARI) and the average number of misclassified vertices (NMV) in per cent.We apply the degree-discounted bibliometric symmetrization [22], the Hermitian clustering method proposed in [31] (both described in Remark 2.13), and Algorithm 2.15 to the three different test cases.We run each algorithm ten times (using the same k-means implementation and settings) for each graph and compute the average adjusted Rand index and the average number of misclassified vertices.The results, listed in Table 1, show that the Hermitian clustering approach works well for graphs with dense off-diagonal blocks, but fails to identify "conventional" clusters, i.e., groups of nodes that are strongly coupled by directed edges, since the method was designed to detect only imbalances in the orientation of the edges.Instead of the random-walk normalized matrix A rw = D −1 ℴ A as proposed in [31], we use the normalized adjacency matrix , which seems to improve the results.Although the reweighting parameters for the degree-discounted bibliometric symmetrization were determined empirically, the method works for all test cases.The transfer operator-based clustering also works well for all considered graph types and leads to marginally better results.The advantage of the dynamical systems approach is that it offers a rigorous interpretation of clusters in terms of coherent sets (or canonical correlation analysis) and that it can be easily extended to dynamic graphs.For more complicated benchmark problems, it might also be beneficial to tune the probability density µ or to use

32-bit adder
As a last example, we consider a 32-bit adder.The data set can be found on the Matrix Market website.Since the matrix contains negative values, we shift the entries so that all weights are positive.The circuit has a fairly simple structure and the spectral clustering algorithm detects 32 blocks of nearly the same size.Reordering the adjacency matrix based on the cluster numbers, we obtain a block matrix with very few nonzero entries in the offdiagonal blocks as shown in Figure 8.The bandwidth of the matrix could be minimized by applying generalizations of the Cuthill-McKee algorithm to the block structure [43].

Conclusion
We have defined different types of transfer operators associated with random walks on graphs and shown that the eigenfunctions contain information about metastable or coherent sets.The main advantages of the dynamical systems perspective are that the derived algorithms can be interpreted as generalizations of conventional spectral clustering methods for undirected graphs and that they can also be easily extended to time-evolving graphs as shown in [23].Future work includes a thorough analysis of the proposed coherent setbased spectral clustering approach for dynamic graphs.Of particular interest are graphs comprising clusters whose structure changes in time, e.g., splitting and merging as well as shrinking and growing clusters.First results presented in [23] illustrate that it is possible to detect time-dependent clusters in dynamic graphs that are based on discretized dynamical systems with coherent sets.However, finding such clusters in real-world graphs, which may have a more intricate and less homogeneous coupling structure, might require tuning and improving the algorithms.Also the properties of transfer operators associated with timeevolving graphs need to be studied in more detail.An open question is whether it is possible to derive related optimization problems that take each layer of the network into account.
Furthermore, it would be interesting to interpret the optimization problems derived above as relaxed versions of discrete counterparts.How is this related to normalized cuts and can these problems be efficiently solved using quantum annealers?Another research problem is the derivation of suitable basis functions for the Galerkin approximation that would allow us to coarse-grain the system without losing essential information about the metastable or coherent sets.While it is theoretically possible to reduce the size of the eigenvalue problem significantly, computing the projected operator and solving the resulting less sparse eigenvalue problem numerically might not be advantageous in practice.
For the benchmark problems in Section 4, we simply computed the forward-backward operator with respect to the uniform density.However, it might be beneficial to choose different densities µ to counterbalance the inhomogeneous degree distributions, similar to the degree-discounted symmetrization proposed in [22].Using standard k-means, we assign each vertex to a cluster.However, some vertices might not belong to any cluster or to multiple clusters at the same time.Soft clustering methods might be able to detect overlapping communities or transition regions.
Proof.By Proposition A.1, F and B are row-stochastic.If µ is the uniform distribution, then F is symmetric.Similarly, B is symmetric if ν is the uniform distribution.
All the transfer operators introduced above can also be represented as compositions of covariance and cross-covariance operators.

Definition 2 . 2 (
Directed stochastic block model).The graph G is said to be sampled from the directed stochastic block model (DSBM) G(r b , n b , E), with E ∈ [0, 1] r b ×r b , if the adjacency matrix

Figure 1 :
Figure 1: (a) Two different sets of random walkers starting in {v 1 , . . ., v 4 } and {v 7 , . . ., v 10 } marked in green and red, respectively.The weight of the solid edges is 1 and the weight of the dashed edges is 0.01.(b) Positions after one step forward.Only one green random walker leaves the set.(c) Positions after one step forward and one step backward.The green set is coherent-only two random walkers escaped-, whereas the red set is clearly less invariant under the forward-backward dynamics.
where the operators (with a slight abuse of notation) are applied component-wise.That is, P, K, T, F, B are the matrix representations of the corresponding operators P, K, T , F, B, respectively.The matrices D µ and D ν are invertible since we assumed µ and ν to be strictly positive.For the covariance and cross-covariance operators C xx , C yy , and C xy , we obtain the matrix representations C xx = D µ , C yy = D ν , and C xy = D µ S and can thus write ⟨f, C xx f ⟩ = f ⊤ D µ f , ⟨g, C yy g⟩ = g ⊤ D ν g, and ⟨f, C xy g⟩ = f ⊤ D µ S g.
and the definitions are equivalent.The properties of transfer operators are analyzed in detail in Appendix A. We in particular show that the eigenvalues of the operators F and B are contained in the interval [0, 1].

Figure 2 :
Figure 2: (a) Forward-backward edge between v i and v j .(b) Backward-forward edge between v i and v j .

Figure 3 :
Figure 3: (a) Directed graph with three clusters.Self-loops are omitted for the sake of clarity.(b) Graph structure of the matrix F for uniform µ.(c) Graph structure of the matrix B for uniform ν.The dashed edges have again a low weight.Clustering the dominant eigenvectors of the matrices F or B results in three coherent sets, see also Example 2.4.

Figure 4 :
Figure 4: (a) Adjacency matrix of a directed graph comprising four clusters.The blocks are colored according to the cluster numbers.(b) Visualization of the clustered graph using the eigenfunctions φ 2 and φ 3 as coordinates.(c) Four dominant eigenfunctions φ ℓ (top) and ψ ℓ (bottom), where denotes the first, the second, the third, and the fourth eigenfunction.The functions are roughly constant within the clusters.(d) Clusters extracted from the functions φ ℓ (top) and ψ ℓ (bottom) using k-means.The blue indicator function picks row 3 and column 3 of the block matrix, the red function row 1 and column 2, the yellow function row 2 and column 4, and the purple function row 4 and column 1.That is, each function pair is associated with the block of the adjacency matrix marked in the corresponding color.These blocks all have in common that they contain many nonzero entries (i.e., incoming or outgoing edges).Note that since we compute two functions, i.e., φ ℓ and ψ ℓ , it is now possible to detect dense off-diagonal blocks.
and ⟨f, C xy g⟩ = c ⊤ x G xy c y .If we choose the basis functions defined in Example 3.1, then G xx = C xx = D µ and G xy = C xy = D µ S. Lemma 3.4.It follows that:

Figure 5 :
Figure 5: (a) Mean of the eigenvalues estimated from random walk data (averaged over 1000 runs) as a function of m, where denotes the first (not shown since it is always 1), the second, the third, and the fourth eigenvalue.The shaded area in the corresponding color represents the standard deviation, the dotted line the infinite-data limit, and the dashed line the eigenvalues of the Galerkin projection of the operator.Recall that the datadriven approximation is a composition of two Galerkin approximations and thus slightly underestimates the correct eigenvalues.The eigenvalues of the Galerkin approximation are also marginally smaller than the eigenvalues of the full operator computed in Example 2.16.(b) Dominant eigenfunctions φ ℓ (top) and ψ ℓ (bottom) of the reduced matrix (solid lines) and the full matrix (dotted lines).The chosen indicator function basis approximates the correct eigenfunctions well since they are essentially almost constant within the clusters.

Figure 6 :
Figure 6: (a) Correlation between φ 2 and ψ 2 .(b) Adjusted Rand index of the obtained clustering.A value of 1 implies that the algorithm identified the correct clusters.(c) Eigenfunctions φ 2 (top) and ψ 2 (bottom) for different values of p and q, whereis p = 0.99 and q = 0.01, p = 0.7 and q = 0.3, p = 0.3 and q = 0.7, and p = 0.01 and q = 0.99, corresponding to the dots of the same color in (a) and (b).Although the correlation decreases due to the "noisier" eigenvectors if p approaches q, the spectral clustering algorithm still works for most cases unless p ≈ q.Note that the sign of ψ 2 flips once q > p (i.e., the off-diagonal blocks are now non-sparse), but the correlation increases again.

1 / 2 xx c x and c y = G 1 / 2 yy− 1 / 2 yy v 1
c y .The maximum of the cost function(2) is the largest singular value σ 1 of the matrix G− 1 /2 xx G xy G − 1 /2yy , attained at c x = u 1 and c y = v 1 , where u 1 and v 1 are the corresponding left and right singular vectors[42].Thus, setting c x = G − 1 /2 xx u 1 and c y = G solves the CCA problem.This can be extended to the subsequent singular values and vectors.

Remark 3 . 8 .
resulting in the Lagrangian functionL(c, κ) = c ⊤ G xy c − κ c ⊤ G xx c − 1 and hence ∇ c L(c, κ) = 2 G xy c − 2 κ G xx c ! = 0,where we used that G xy is symmetric for µ = π since due to the detailed balance conditionC xy = D π S = S ⊤ D π = C ⊤xy and hence also G xy = G ⊤ xy .Thus, we obtain the eigenvalue problem G −1 xx G xy c = κ c.This is a Galerkin approximation of the Koopman operator and the cost function is maximized by the eigenfunction associated with the largest eigenvalue.Subdominant eigenfunctions can again be used for detecting metastable sets.If the matrix G xy is not symmetric, we can still solve the optimization problem, but would then obtain the eigenvalue problem 1 2 G −1 xx (G xy + G yx )c = κ c, which can be regarded as a simple form of symmetrization.Defining c = G 1 /2 xx c, we can also write (3) as max ∥ c∥=1

Figure 7 :
Figure 7: (a) Adjacency matrix of a graph with six clusters.(b) Permuted adjacency matrix, where the vertices are reordered according to the identified clusters.

Figure 8 :
Figure 8: (a) Structure of the original adjacency matrix of the 32-bit adder.(b) Dominant eigenvalues of the forward-backward operator.There is a spectral gap between the 32nd and 33rd eigenvalue.We thus set k = 32.(c).Permuted adjacency matrix, where we again reorder the vertices according to the identified clusters.

Proposition A. 3 .
Let C yx = C * xy .Then it holds that: i) K = C −1 xx C xy , ii) P = C −1 xx C yx , provided that µ is the uniform distribution, iii) T = C −1 yy C yx , iv) F = C −1 xx C xy C −1 yy C yx , v) B = C −1 yy C yx C −1 xx C xy .Proof.We can either use properties of the operators or their matrix representations: i) This follows from ⟨f,C xy g⟩ = ⟨f, Kg⟩ µ = ⟨f, C xx Kg⟩ or C −1 xx C xy = S = K. ii)Using the duality between P and K, we write⟨Pρ, f ⟩ = ρ, C −1 xx C xy f = C yx C −1 xx ρ, f = C −1 xx C yx ρ, f ,where C −1 xx and C yx commute if µ is the uniform distribution.Alternatively, this can be seen as follows:C −1 xx C yx = D −1 µ S ⊤ D µ = n I S ⊤ 1 n I = S ⊤ = P .iii)By Proposition A.1, we have⟨T u, f ⟩ ν = ⟨u, Kf ⟩ µ = u, C −1 xx C xy f µ = ⟨u, C xy f ⟩ = ⟨C yx u, f ⟩ = C −1 yy C yx u, f ν .Similarly, C −1 yy C yx = D −1 ν S ⊤ D µ = T .iv) and v) This follows immediately from the definitions of F and B.