Module representatives for refining gene co-expression modules

This paper concerns the identification of gene co-expression modules in transcriptomics data, i.e. collections of genes which are highly co-expressed and potentially linked to a biological mechanism. Weighted gene co-expression network analysis (WGCNA) is a widely used method for module detection based on the computation of eigengenes, the weights of the first principal component for the module gene expression matrix. This eigengene has been used as a centroid in a k-means algorithm to improve module memberships. In this paper, we present four new module representatives: the eigengene subspace, flag mean, flag median and module expression vector. The eigengene subspace, flag mean and flag median are subspace module representatives which capture more variance of the gene expression within a module. The module expression vector is a weighted centroid of the module which leverages the structure of the module gene co-expression network. We use these module representatives in Linde–Buzo–Gray clustering algorithms to refine WGCNA module membership. We evaluate these methodologies on two transcriptomics data sets. We find that most of our module refinement techniques improve upon the WGCNA modules by two statistics: (1) module classification between phenotype and (2) module biological significance according to Gene Ontology terms.


Introduction
Modules of genes are useful structures to group genes into their downstream biological functions. Among their many applications, gene modules have been used as tools for finding prognostic markers by classifying subtypes of cancer and identifying groups of genes which are related to heart failure [1,2]. A gene module is generally defined as a set of 'co-expressed genes.' Gene expression levels are used to determine gene co-expression and are measured using transcriptomics data collected using either microarrays or the newer technology, RNA sequencing (RNA-seq).
Early work in the detection of co-expression modules used clustering of biological co-expression networks [3][4][5][6]. Edge weights in these networks are found either using known biological connections between genes or statistical methods to measure similarity between gene expression levels, most commonly calculating correlation. In 2005, Zhang and Horvath proposed weighted gene co-expression network analysis (WGCNA). It uses a branch cut of an agglomerative hierarchical clustering tree of a correlation gene co-expression network to detect modules [7]. In 2008, Langfelder and Horvath released a comprehensive R package to perform WGCNA [8]. In the same year, these methodologies were interpreted geometrically by Horvath et al [9]. The R package provided a stable platform for the WGCNA workflow. In 2016, Bailey et al used WGCNA to detect modules and provide a detailed analysis of pancreatic cancer [10]. Other approaches for module detection include techniques like Fuzzy clustering by Local Approximation of Memberships (FLAME) and independent component analysis [11,12]. A 2016 survey of module detection methods provides a comparison of more than 40 methods and highlights some of the shortcomings of WGCNA [13]. A more recent survey paper of gene co-expression analysis by Van Dam et al referenced 9 methods for co-expression module detection and underlined the significance of the WGCNA workflow for co-expression module detection [14]. Even the simple correlation edge weight techniques from WGCNA are used in the popular protein-protein interaction network database STRING [15]. In summary, module detection using co-expression networks has at least a 20 year history in bioinformatics and the WGCNA algorithm has played a central role in applications.
WGCNA also addresses the question of module representation by using the 'eigengene.' Zhang and Horvath use the 'eigengene' as a single vector representative of the genes in a module [7]. Correlation between eigengenes is used by Zhang and Horvath to compare modules to one another. These eigengenes are also used to define 'fuzzy' module membership by calculating similarities between gene expression vectors and each eigengene. Langfelder and Horvath use eigengenes to compare different biological groups (e.g. tissue types, species and other biological delineations) using 3 steps: (1) 'consensus module' detection, (2) hierarchically clustering the eigengenes of each module, and (3) comparing these clusterings and eigengenes to one another [16]. An overview of additional methods for module level analysis is provided by Wang et al This includes references to module comparison methods that incorporate proteinprotein interaction and transcriptomics data to generate module networks [17].
More recently, researchers have been building upon hierarchical clustering for module detection using module refinement. These methods use WGCNA modules as an initialization for another clustering algorithm to generate refined modules from WGCNA. For example, k-means is one of these module refinement algorithms. The application of k-means clustering is sensitive to center initialization; so using the WGCNA hierarchical clustering technique as a method for initialization provides a significant head start for the clustering. k-means clustering with eigengenes to refine WGCNA the module membership was introduced in 2017 and was shown to increase biological significance of the modules through functional enrichment analysis [18]. In 2021, Hou et al modified the module refinement algorithm by clustering using module connectivity which is measured by 'correlation distance,' while eliminating the eigengene [19].
The focus of this paper is on developing new methods for the representation of modules, analyzing their properties, and implementing them in a k-means-type Linde-Buzo-Gray (LBG) clustering algorithm for gene co-expression module refinement. We present a methodology which seeks to improve upon the WGCNA method for module representation by introducing eigengene subspaces, flag mean [20], flag median [21], and module expression vectors [22]. We seek to develop methods which improve biological significance of the resulting modules, at the possible expense of Machine Learning classification rates. The main contributions of this paper are listed below.
• A generalization of the eigengene for module representation. • The application of the LBG subspace clustering schemes with the flag mean, flag median, eigengene subspace and module expression vector to refine WGCNA modules. • A demonstration that the proposed refinements to WGCNA may lead to enhanced biological significance of the resulting modules.

Methods
Our workflow is detailed in figure 1 and summarized in the following three bullets: • Calculate initial modules by running WGCNA.
• Find refined modules by running LBG clustering with different central prototypes and distances with initial clusters from WGCNA. • Evaluate biological significance of the refined modules.
In this section, we will describe how to use the new module representatives along with LBG clustering to compute refined modules. Then we explain how we use a classification problem and Gene Ontological (GO) significance to evaluate these modules. Section 2.2 describes the flag mean, flag median and the eigengene subspace module representatives. The module expression module representative is introduced in section 2.3. Our LBG clustering implementation is explained in section 2.4. The final section, section 2.5 outlines our methods for determining the viability of the modules we find.

Introduction to the Grassmann manifold
The Grassmann manifold, denoted Gr(k, n), is the manifold whose points parameterize k dimensional subspaces of n dimensional space. Let [X] ∈ Gr(k, n) denote a k-dimensional subspace of n dimensional space that is spanned by the columns of the matrix X. We restrict our matrix representatives for points in Gr(k, n) to be X ∈ R n×k with k orthonormal columns.
Suppose we have a gene expression matrix with n samples of m genes in a module G ∈ R n×m . G holds m gene expression vectors, G = [g 1 , g 2 , . . . , g m ]. Then we can transform these m gene expression vectors to m points on Gr(k, n) using k-nearest neighbors with cosine distance. Given two gene expression vectors g i , g j ∈ R n , we measure the cosine distance between g i and g j as 1 minus their cosine similarity: Note: this is in fact the square of the sines of the principal angles between the one dimensional subspaces spanned by g i and g j respectively. This is equivalent to the chordal distance on Gr(1,n) that we use in LBG module refinement with one-dimensional data and center representations.
Then we take the subspace representative for the gene i as the span of it is k-nearest neighbors. Suppose h 1 , h 2 , . . . h k ∈ R n are the k nearest neighbors to gene i. Note: this means that g i = h j for some j = 1, 2, . . . , k. Now we construct the matrix H i = [h 1 , h 2 , . . . h k ] ∈ R n×k . We generate a representative for a point on Gr(k, n) by taking the first k columns of the QR-decomposition of H i . We denote this representative X i . So, we have [X i ] ∈ Gr(k, n). Note: in our experiments we will take k = 1, 2 dimensional subspaces for each of our samples. On the other hand, we will be calculating r = 1, 2, 4, 8 dimensional module representatives. We will refer to the subspace dimensions k as the data dimension and r as the center dimension.
Principal angles between subspaces are a common dissimilarity measure that is invariant to orthogonal transformations [20,23]. Take [X] ∈ Gr(k, n) and [Y] ∈ Gr(r, n). We denote the vector of principal angles between [X] and [Y] as θ([X], [Y]) ∈ R min{k,r} . The entries in this vector are between 0 and π/2. For details on the calculation of principal angles, see [23].
The vector of cosines of the principal angles between [X] and [Y] is used to measure similarity between subspaces and can be computed by ∥ cos(θ([X], [Y]))∥ 2 2 = tr(X T YY T X). When r = k = 1, namely when X and Y are just a unit vectors, this similarity amounts to the correlation between X and Y.
The chordal distance on Gr(k, n) is ∥ sin(θ([X], [Y]))∥ 2 and the geodesic distance is ∥θ([X], [Y])∥ 2 [24]. We can calculate the chordal distance by using sin 2 tr(X T YY T X). This is preferred to the canonical Riemannian geodesic distance since it can be calculated without needing to find the actual principal angles. In addition, the algorithms for chordal distance prototypes (e.g. SVD and FlagIRLS [20,21]) are generally faster to compute than algorithms for geodesic distance prototypes (e.g. Karcher and Riemannian-Weiszfeld [25,26]).

Subspace prototypes
In order to allow more flexibility in our generalization of optimization problems to subspaces, we work with points that are not all necessarily on the same Grassmannian but are in the same ambient space. Suppose we have a set of subspaces of k-dimensional space, We want to find an r-dimensional subspace of R n , [Y * ] ∈ Gr(r, n), that is the center of these points, where d : Gr(k, n) × Gr(r, n) → R measures dissimilarity between its arguments. The r-dimensional flag mean is the subspace central prototype that solves The flag mean is just the r left singular vectors of the block matrix [X 1 , X 2 , . . . , X m ] associated with the r largest singular values [20]. The complexity of this The r-dimensional flag median is the subspace central prototype that solves The flag median is found by iteratively re-weighted flag means in the algorithm FlagIRLS [21]. The complexity of this algorithm depends on N δ , the number of iterations and δ, the convergence parameter. The The eigengene comes from the Principal Component Analysis (PCA) of the gene expression matrix. Instead of representing the columns of G as subspaces, we only mean-center the rows of G. Let G denote G with mean centered rows. The eigengene y * is a maximizer of max y T y=1 y T G G T y.
We define the eigengene subspace as the generalization of the eigengene to higher dimensional subspsaces. Specifically, the eigengene subspace is the weights for the first r principal components of the PCA of the gene expression matrix. The eigengene subspace is the solution to The solution to this problem is exactly the r right singular vectors of the matrix G T that are associated with the r largest singular values. This is the same as the r left singular vectors of the matrix G that are associated with the r largest singular values. The complexity of computing the eigengene subspace is the same as the flag mean: O nm 2 .
There is a direct link between the flag mean and the eigengene subspace which can be seen by rewriting (6) as Notice, this is the r-dimensional flag mean optimization problem of a set of one-dimensional subspaces if we require each g i to be a unit vector.

Module expression
Pathway expression is a methodology for dimensionality reduction of gene expression data which translates the gene expression feature space into a pathway expression feature space [22]. Linear pathway expression methods generate a pathway expression value by averaging gene expression levels for the genes in the pathway. Centrality pathway expression also calculates an average, but weights the genes in the pathway by their centralities in a gene network for the pathway.
In this paper we translate the notion of pathway expressions to module expression by using collections of genes in modules rather than pathways. Suppose we have a gene expression matrix for a module G ∈ R n×m with n subjects and m genes. In a gene expression matrix, each gene is assigned a gene index and each subject is assigned a subject index. We can map G to a 'module expression vector' y ∈ R n . We define such a linear mapping y = Gm (8) using the module transition vector m ∈ R n . For linear module expression, we take m i = 1 for all i. For centrality module expression, we take m i = c i where c i ∈ R is the centrality of gene i in a network of genes in the module.
In this paper, we will use correlation between genes expression levels to generate edges for our module gene network and PageRank centrality [27] for our centrality measure. We choose this because PageRank centrality with a correlation network produced a higher inter-quartile range of balanced success rate (BSR) than degree centrality in the initial pathway expression analysis paper [22]. With these parameters, the computational complexity of module expression for one module is the same as the computational complexity of PageRank. PageRank is essentially the complexity of computing the largest eigenvalue of a m × m matrix which we can do using the power iteration algorithm. Let N δ be the number of iterations and δ be the convergence parameter. Then, the computational complexity is O m 2 N δ .

Module refinement with LBG clustering
We mathematically formalize module refinement as a mapping between sets of modules. Let F be the set of all features for a data set and {W 1 , W 2 , . . . , W s } be a set of WGCNA modules. Note that W i ⊆ F for each i and So module refinement is a mapping between two set partitions. We call {R 1 , R 2 , . . . , R s ′ } the refined modules and R i the ith refined module. Note: the number of modules can change since s ′ and s are not necessarily equal.
Given the training data for one fold, we use PyWGCNA [28] to compute WGCNA modules. Then we run LBG clustering with each of the four types of centers (eigengene subspace, flag mean, flag median and module expression) initialized with these WGCNA modules. See figure 6 for our data partitioning scheme and figure 1 for a visual of our general workflow. We define a set of refined modules is the modules output from a run of LBG clustering that was initialized with WGCNA. Specifically, the set of features found using this methodology is a refined module.
The centers (e.g. module representatives) in our LBG implementations are the eigengene subspace, flag mean, flag median and module expression. We run LBG clustering with r = 1, 2, 4, 8 dimensional eigengene subspaces, flag means, and flag medians. Eigengene subspaces and module expression are only run with k = 1 dimensional subspace representations for the features (e.g. genes or probe identifiers). We use k = 1 and k = 2 dimensional representations for the features for the flag mean and flag median LBG clustering implementations. With all of these various module representatives and their corresponding subspace dimensions, we have a total of 21 methods for module refinement.
For our LBG implementation, we calculate distances between subspaces by using the square of the sine of a principal angle between them. For data dimension of 1 or module representative dimension of 1 we use the only principal angle. For a data dimension of 2, we use second smallest principal angle. A detailed explanation of principal angles can be found in [23]. Let for distance in our LBG clustering. Otherwise, we measure distances for clustering using Note: we ran our experiments on gene expression data with the first principal angle and found comparable and sometimes worse results.
Let d k be the cluster distortion at iteration k for LBG clustering. We terminate LBG clustering when |d k − d k−1 |/d k−1 < 10 −11 or we have reached 20 iterations. If we have reached 20 iterations and |d k − d k−1 |/d k−1 ⩾ 10 −11 then we consider this a failed clustering and do not report the results of this module refinement.

Module evaluation
We compare the refined modules with LBG clustering to the modules that were detected using just WGCNA. We will evaluate these modules by (1) comparing their ability to discriminate between phenotype and (2) comparing the GO significance of the modules.
We begin by partitioning our data into five stratified folds. Then we detect modules on the training data for each fold. We choose to run this cross validation scheme in order to examine module robustness across folds. We only calculate modules on the training data so we can transfer these features over to the test data to see how well they discriminate between phenotype.
First, we will describe our methods for determining the ability of a set of modules to discriminate between phenotypes. Since we do five-fold cross validation for our module detection, we have a set of modules found on the training data and a sequestered test data set for each of the five-folds. So, for each fold and using only features in 1 module, we train and test an support vector machine (SVM) classifier to classify between phenotype on the sequestered test set and record the BSR. This give us a BSR for each module found by each module refinement method and WGCNA for each fold.
We now explain how to compare the GO significance between a refined module detection scheme and WGCNA. This methodology mirrors what was done by Botia et al [18]. For this method, we use the Python package gprofiler to determine the p-value of the overlap between the genes in the module and each GO terms. Let A be the set of GO terms. We denote the p-value for the ith module (set of features) with the GO term a ∈ A as p(i, a). Now define the piecewise function σ : R → R as Then we compute the ontological significance of the ith module as a∈A − log 10 σ(p(i, a)).
Suppose we have m j modules in fold j. Given a module detection method, we call the score (BSR or ontological significance) for the ith module of fold j, s(i, j) ∈ R. Then the score for this method at fold j is The we can compare the relative gain RG(j) of the score of a refined module s r (j) to the score of the original WGNCA module s w (j) by computing where ϵ = 0.0001. We do this for each fold to get a five different values of RG(j) and estimate the relative gain in score of a given module refinement technique over WGCNA. That is to say, the score of a module refinement technique is better than WGCNA if RG(j) > 0 for each j = 1, 2, . . . , 5. Now we can calculate the mean relative gain for a given module refinement technique and data set as the mean of the relative gains across all five folds for both BSR and GO significance. We use this mean to rank the module refinement techniques by BSR and GO significance.

Experiments
We begin with some synthetic experiments to gain intuition for the different properties of module expression, the eigengene subspace, the flag mean and the flag median with respect to the distances used in our LBG clustering algorithm. Then we introduce the GSE73072 data set [29], and the Salmonella collaborative cross mice data set from Texas A&M [30,31]. Then we will go on to evaluate our module refinement techniques with respect to their ability to classify between phenotype and their significance to GO terms. For the sake of simplicity, we will refer to both probe identifiers and genes identifiers as 'features' for the rest of this section.
For figures 12, 16-18, and table 2 we denote a module refinement method by it is central prototype along with the dimensions. The method for this identification is {central prototype}_{center dimen-sion}_{data dimension}. For example, flag median with a center dimension of 8 with data dimension of 2 is flag_median_8_2.

Synthetic data
We generate a synthetic module of 'genes' using sklearn.datasets.make_blobs [32] with one blob, 50 features (we will call them genes), and 50 or 10 samples. For example, with 10 samples, this gives us a gene expression matrix for a synthetic module as G ∈ R 10×50 . Using G, we compute module expression, the eigengene subspace, the flag mean, and the flag median of the genes in the module. We do this with data (e.g. gene) dimensions of 1 and 2 and compute our subspace module representatives dimensions (a.k.a. center dimensions) as 1, 2, 4, and 8. We compute a distance matrix between these module representatives and the genes using the methods from section 2.4.
Then we preform multidimensional scaling (MDS) [33] using this distance matrix to visualize our genes and module representatives in figures 2 and 3. For one and two dimensional data representations with center dimensions of 1, we see that module representations clustered around the center of the gene distribution. As we increase our center dimension, our module representatives pull away from the center of our data. For two-dimensional data representations, we see our genes are uniformly spread apart in comparison to one-dimensional data representations.
We also compute the sum of the distances (e.g. (9) and (10)), a.k.a. cost, between each module representative and the 50 genes in the module. We do this for datasets with 10 samples and 50 samples in figures 4 and 5. The flag median and flag mean have the lowest costs. The eigengene has a higher cost than the flag mean and median for 50 samples than it does for 10 samples. In general, the module representatives costs separate more for higher dimensions. Also, higher dimensional subspace module representations result in lower cost because higher dimensional subspaces are closer together.
Overall, we say the best module representative should have a small distance and be close to the center in our MDS plots. We will provide a further analysis of these results in section 4.

Gene expression data
The GSE73072 data set [29] is a microarray gene expression data set with 22 277 probe identifiers and 148 human subjects. The humans subjects in GSE73072 are infected with various respiratory viruses and are sampled at irregular time intervals from 38 h before infection to 680 h after infection. The data are available on the NCBI Gene Expression Omnibus. We normalize the entire GSE73072 data set with the robust multi-array average method [34]. Our experiments are preformed on 2 studies from the data set that only have subjects with human rhinovirus (HRV). Since this is a microarray data set, we do all our analysis on the probe identifiers. For these data, we classify between controls and shedders. Controls are subjects before infection (e.g. 38 h to 0 h before infection) and shedders are subjects shedding HRV between 48 to 64 h after infection. This gives us an   HRV data set with 138 samples from human subjects with 93 controls and 45 shedders.
The second data set is an RNA-sequencing (RNAseq) data set of collaborative cross mice [31]. A collaborative cross population of mice are mice that are bred to simulate the genetic diversity of human populations. An initial study with these data was done by Scoggin et al [30]. All the mice in this data set are infected with Salmonella enterica serotype Typhimurium for 21 d and RNA-seq samples are taken from the liver and the spleen. The phenotypes in this data set are 'delayed susceptible,' 'tolerant' and 'resistant.' A tolerant mouse is one that has few signs of infection even with high pathogen loads. The 'delayed susceptible' phenotype are mice that lived 7 days in the initial experiments, but died quickly in the second set of experiments (21 days). For this paper we refer to the 'delayed susceptible' phenotype as just 'susceptible.' There are 26 332 gene identifiers in this data set. For our experiments, we restrict ourselves to working with only the data which are sampled from the liver and only those samples which are labeled tolerant and susceptible. With this restriction to liver samples and only tolerant and susceptible mice, we are left with a total of 29 samples with 19 labeled tolerant and 10 labeled susceptible. The method for partitioning data for module detection. Here we use the GSE73072 labels, but the same scheme is used for the Salmonella data with susceptible and tolerant as labels rather than control and shedder. We use five-fold cross validation. Then, for each fold, we divide the training data into three sets depending on class. Then we use these data for module generation. The · · · indicate that we preform the same process for each fold. We use the data partitioning methodology in figure 6.
The identifiers for these data partitions are shown in table 1.
A summary of the cross validation methodology in figure 6 is as follows.
(i) Separate the data into five stratified folds. (ii) Break the training data in each fold into three sets. The first set of modules is generated using all the data. The other two data sets are generated using only the data from the subjects in a class (e.g. for GSE73072 data we have shedders or controls). (iii) Then we detect modules on the training data from each of the 3-sets for each of the five-folds.
Using 5 fold cross validation in step (i) allows us to analyze module robustness across folds. Since we are detecting modules for each fold, we are able to test each module set on a sequestered test set. Additionally, a stratified split corrects for the class imbalance. In step (ii), we detect modules using each of the 3 different training data sets for each fold (e.g. control, shedder, control and shedder) to examine the affect of phenotype on module generation.

Module sizes
Since LBG clustering can change the number of modules, there are some experiments where our module refinement techniques detected only 1 enormous module. Specifically, there are seven modules across all module refinement methods and folds with more than 20 000 features. Since we tested 21 module refinement techniques on 30 partitions of our 2 data sets, we only detect one large module containing all the features in less than 1% of data partitions and module refinement methods. These modules are in table 2 and are not included in figure 7.
Note: the eight-dimensional subspace prototypes can not be employed with the salmonella_Li-ver_susceptible given there are only ten subjects.
In figure 7, we see our module refinement methods increase the variance in module size while reducing the size of extremely large modules. The only data set where module refinement   significantly decreases module size is salmon-ella_Liver_susceptible. Module size is maintained for two-dimensional subspace data representations for the flag mean and the flag median across all data sets other than salmonella_Liver_susceptible.

Eigengene subspace captures more variance
We plot the explained variance ratio (EVR) for the modules on the training data in figure 8. As we increase the dimension of the eigengene subspace, we see a logarithmic increase in EVR for modules across all data sets and folds. For the salmonella_Liver and the salmon-ella_Liver_tolerant data sets, the interquartile range of the EVR decreases as the subspace dimension increases. The EVR of the modules on the GSE73072 data sets is generally lower than that of the modules for the Salmonella data sets.

Classification with modules
We evaluate the modules using the BSR for an SVM classifier on the test data for each fold restricted to the features in a module. In terms of data normalization, we subtract by the mean and divide by the variance. We do an additional log 2 normalization for only the GSE73072 data sets.
First we plot the test BSR from each module across data sets for WGCNA in figure 9. The BSR of the modules detected on the different GSE73072 data sets is essentially the same regardless of the data used to generate the modules. The Salmonella data set with all the data has the highest median BSR followed by the modules from the tolerant and the susceptible Salmonella data sets.
We plot the relative gain in classification BSR for each method in figure 10. A robust module detection technique will have low variance in relative gain across folds. Figure 10 is used to determine which dataset to use for a closer analysis of the different module refinement methods. Most methods have a small variance in their relative gain for each dataset. However, the salmonella_Liver_susceptible dataset has a high variability in relative gain across module refinement methods and folds. All the GSE73072 data sets and the salmonella_Liver_susceptible data set have a small negative median relative gain in BSR.
We investigate the relative gain of the module refinement methods on the data sets which produce the highest median relative gain in figure 11. For the GSE73072 data set, the only methods which result in a positive relative gain in BSR are the flag mean and median with two-dimensional subspace data representations and center subspaces with dimension higher than 1. Only the four-dimensional flag median with a data dimension of 2 results in a positive relative gain for each fold.
The mean relative gain in BSR for each method on each data set is in figure 12. In general, two dimensional subspace data representations result in module refinement with the highest mean relative gain in BSR across data sets.

GO significance
Before we consider the relative gains in GO significance we will consider the GO Significance of the WGCNA modules in figure 13.  Now we plot the relative gain in GO significance for each method in figure 14. There are at most 5 points for each method which correspond to the relative gain over each fold. A module refinement method improves classification GO significance over WGCNA when it has a positive relative gain and is robust when it has a low variance in relative gain across folds.
In figure 14 we see that there is little difference between relative gain in GO significance across the GSE73072 data sets; but the highest median gain is with the modules generated using data from both shedders and controls. On the other hand, there is some variance in the relative gain in GO significance across different Salmonella data sets. There are some extremely high outlier relative gains with the modules detected with only the susceptible subjects in the Salmonella data set. These likely due to the large modules listed in table 2.
We use figure 14 to find the data sets which are used to find modules with the highest median relative gains. Then we plot the relative gain of the refined modules from these data sets in figure 15.
For GO significance with the eigengene subspace, we see that using higher dimensional eigengene subspaces results in a small increase in relative gain with the GSE73072 data set and a substantial decrease in relative gain for the Salmonella data set. This suggests that higher dimensional eigengene subspaces are not optimal module representatives for increasing GO significance with LBG module refinement. For   Figure 13. GO significance as a function of dataset for WGCNA modules over all folds. We remove the one module that has an GO significance higher than 10 000 that was detected on the salmonella_Liver_susceptible dataset.
one dimensional data representations with the flag mean and flag median we see a decrease in GO significance. Only module expression, one-dimensional flag mean and flag median with one-dimensional data representations result in a positive relative gain in GO significance on the GSE73072 data set. All of the   one-dimensional subspace prototypes and the module expression vector result in much higher relative gains in GO significance on the Salmonella data set.
We summarize the relative gains in GO significance for each data set and central prototype in figure 16. The values of the pixels in this heat map correspond to the mean relative gain in GO significance across folds. Overall, the module refinement techniques that represent the data with two dimensional subspaces result in negative mean relative gains in GO significance. In contrast, one-dimensional module representatives  generally result in positive mean relative gain in GO significance.

Results summary
We rank these module refinement techniques across all data sets using the mean relative gain. The top 3 methods with the highest relative gains for GSE73072 and Salmonella are in figures 17 and 18. Notice that flag_median_1_1 and module_expression appear in the top 3 for both GO mean relative gain for both data sets. This suggests that this is one of the better module refinement techniques for GO significance. For BSR, we see that two-dimensional subspace data representations result in the best relative gains.
Certain module representatives result in modules with increased average GO significance and classification BSR between for both the GSE73072 and Salmonella data sets. For example, module expression module representation with LBG clustering on the Salmonella data set with all subjects results in modules with an average 5% relative gain in BSR and an average 25% relative gain in GO significance over WGNCA. Figure 19 highlights all such methods which produce positive mean relative gains in classification BSR and GO significance.

Discussion
In this work we have introduced multiple new methods for gene co-expression module representation and used these module representatives to refine WGCNA gene co-expression modules. We evaluated these refined gene co-expression modules on synthetic data and by their relative gain over WGCNA modules in BSR of a SVM classifier on a sequestered test set and GO significance according to a statistical test. Our examples with the GSE73072 and Salmonella mice data set provided insight into the optimal data set for calculating modules along with the ideal module representation technique. Given the proper data set and module representative, we were able to increase classification rates and GO significance over WGCNA modules, even when averaged across 5 folds of modules.
We do recognize the limitations of five-fold cross validation with these data. For example, we have different samples of the same subjects in training and test. This is acceptable since we are not seeking state of the art accuracy. Instead, we are using this metric as a statistic for module viability. We leave such a search for state of the art classification rates using features from refined modules to future work.
Our choice of data set for module detection had a small effect on the relative gains of the refined modules in terms of classification BSR and GO significance. We found that the refined modules using both subjects in either the GSE73072 or Salmonella data sets produced the highest classification BSRs. On the other hand, using only the tolerant subjects for the Salmonella liver data set produced the highest mean relative gains in GO significance. So perhaps, we found more improvement in GO signal by detecting refined modules using only data from tolerant mice. Now we turn to discussing the best module refinement techniques in our experiments. First we visit insights about our module representatives from our synthetic experiments. As we increased the subspace dimension of our module representatives, we saw two things happen. Firstly, this decreased the distance between the module representative and all the genes in the module. Secondly, in our MDS plots, the relative distance between the module representatives and the apparent center of the data gradually decreased. This suggests that we prefer lower dimensional subspace module representatives if we want a our module refinement to find modules with little variance. On the other hand, higher dimensional subspace module representatives introduces more variance into our modules. The MDS plots in our synthetic experiments also suggest that two-dimensional data representation causes the genes in a module to be more uniformly distributed which may cause our LBG module refinement to detect modules with greater signal variance. Finally, the number of samples affects the cost of the module representatives. The eigengene, flag mean and flag median all produced similar costs for 20 samples. But, when we used 50 samples, the flag mean and flag median produced much lower costs than the eigengene, especially for high center dimension. This indicates that the eigengene is expected to preform worse for LBG module refinement for higher center dimensions.
The best module refinement method for our gene expression datasets had an increase relative gain in BSR and GO significance across both data sets. This means, the modules from this method needed to capture enough variance in signal to discriminate between phenotype while still maintaining enough of a uniform biological signal to result in high GO significance. These modules were colored red in figure 19. The modules computed on data sets with both phenotypes of subjects had positive relative gains in both classification BSR and GO significance (e.g. gse73072_hrv_48_64). The module representatives for these ideal module refinement methods were the one dimensional flag mean and one dimensional flag median with one dimensional subspace data representations and the module expression vector.
For both the Salmonella and GSE73072 data sets, the methods which had the highest relative gains in BSR did not also have the highest gains in GO significance. Specifically, in figures 17 and 18, there was no central prototype in the top three relative gains for both BSR and GO significance. Specifically, onedimensional subspace flag mean and medians with one-dimensional subspace data representations and module expression were the best LBG cluster refinement methods for increasing GO significance. On the other hand, two, four and eight dimensional subspace prototypes all with two-dimensional data representations produced the highest relative gains in BSR. In general we understand that lower dimensional subspace prototypes and module expression result in modules with uniform biological signal; whereas higher dimensional subspace prototypes result in modules with higher variance signal which improves classification rates.
We also notice that the two-dimensional flag mean and median along with the eight-dimensional flag median resulted in LBG which did not converge in 20 iterations. Due to the scope of our study, we chose to not include these experiments. Future work could increase the number of iterations of LBG clustering for these methods to hopefully result in eventual convergence. Even better, one could work on proving convergence results for LBG clustering with subspace prototypes (e.g. flag mean and flag median.) Then, using this criteria, one could determine the optimal central prototype and subspace dimension for module refinement using LBG clustering.
We move to comparing our findings and algorithms to those from Botia et al [18]. They suggested a method for module refinement using kmeans clustering with correlation distance and onedimensional eigengene module representation. Our one dimensional center and data representation LBG module refinement mirrors what was done by Botia et al since correlation distance is just the square of the sines of the principal angles between one dimensional subspaces and LBG clustering is a cousin of k-means clustering. What we have done is essentially provided a generalization of k-means module refinement using different options for module representatives (e.g. flag mean) while leveraging a generalization correlation distance to higher dimensional subspaces to improve upon the results in Botia et al In our results on gene expression data we found that the best module representatives for LBG module refinement are never the one dimensional eigengene suggested by Botia et al.
There are numerous directions for potential future work with these new module representatives. Overall, our work points toward module expression, the flag mean, and the flag median with onedimensional center and data representatives as the ideal parameters for LBG module refinement. However, for future work with a new dataset, we suggest the use of a validation partition to determine the preferred module representative. One could also investigate the utility of LBG module refinement with different initial clustering algorithms for co-expression modules like FLAME [11] or they could investigate an in depth comparison between our methods and other module refinement techniques on various datasets. This could lead to a comprehensive survey summarizing prominent methods for detecting and refining gene co-expression modules that could provide even more insight into which module refinement technique should be used for a given dataset.

Data availability statement
The data will be publicly deposited at some point in the future along with a functional biology focused manuscript being prepared The data that support the findings of this study are available upon reasonable request from the authors. salmonella mice data, human respiratory virus data (GSE73072).