scGMM-VGAE: a Gaussian mixture model-based variational graph autoencoder algorithm for clustering single-cell RNA-seq data

Cell type identification using single-cell RNA sequencing data is critical for understanding disease mechanisms and drug discovery. Cell clustering analysis has been widely studied in health research for rare tumor cell detection. In this study, we propose a Gaussian mixture model-based variational graph autoencoder on scRNA-seq data (scGMM-VGAE) that integrates a statistical clustering model to a deep learning algorithm to significantly improve the cell clustering performance. This model feeds a cell-cell graph adjacency matrix and a gene feature matrix into a graph variational autoencoder (VGAE) to generate latent data. These data are then used for cell clustering by the Gaussian mixture model (GMM) module. To optimize the algorithm, a designed loss function is derived by combining parameter estimates from the GMM and VGAE. We test the proposed method on four publicly available and three simulated datasets which contain many biological and technical zeros. The scGMM-VGAE outperforms four selected baseline methods on three evaluation metrics in cell clustering. By successfully incorporating GMM into deep learning VGAE on scRNA-seq data, the proposed method shows higher accuracy in cell clustering on scRNA-seq data. This improvement has a significant impact on detecting rare cell types in health research. All source codes used in this study can be found at https://github.com/ericlin1230/scGMM-VGAE.


Introduction
Significant advances in single cell RNA-sequencing (scRNA-seq) methods for measuring gene expression at the single-cell level have been widely developed and applied to research in microbiology, biomedical sciences, neuroscience, and oncology (Wei et al 2022). In contrast to traditional bulk RNA-sequencing, scRNA-seq begins with cell isolation before RNA extraction (Chaudhry et al 2019), which is useful for a small population of target cells and for comparing the various cell types in a sample. Due to the nature of heterogeneity and high dimensionality, scale, and noise, analyzing scRNA-seq data is more challenging than analyzing RNA sequencing data from bulk cell populations (Wang et al 2022). scRNA-seq data are high resolution, thus, effectively analyzing and mining either the complex relationships or potential regulation patterns between cells are critical to the success of downstream research and personalized medicine (Su et al 2021).
Clustering analysis has received significant attention from bioinformatics researchers in analyzing gene expression data (Govek et al 2019), cancer research, and personalized medicine . Cell clustering is a fundamental step in scRNA-seq data computational analysis, in which clustering is performed to classify a set of cells from unknown cell groups into meaningful cell types based on transcriptome similarity or existing biological knowledge (Krzak et al 2019, Li et al 2020, Shiga et al 2021). The results of cell clustering are important for a variety of downstream analyzes, including cell type differential expression, cellular composition estimation, and rare cell type discovery (Chen et al 2019). Thus, methods for cell clustering analysis have been in high demand for answering research questions using scRNA-seq data. However, due to the gene expression nature of these data, issues such as high noise and a large proportion of missing values (zeroes) due to the fraction of non-sequenced transcripts data pose a challenge to clustering performance accuracy (Haque et al 2017). Since scRNA-seq data are high-dimensional, traditional clustering methods are limited in performance accuracy due to 'curse of dimensionality' (Wang et al 2022).
As a result, several state-of-the-art clustering methods in machine learning have been recently developed to cluster scRNA-seq data to improve accuracy. Seurat (Butler et al 2018, Stuart et al 2019, Hao et al 2021 includes a graph-based clustering analysis tool for scRNA-seq data that uses a shared nearest neighbor (SNN) modularity optimization-based clustering algorithm to identify cell clusters. Lopez et al (2018) presented the single-cell variational inference (scVI) algorithm, which groups similar cells by first generating conditional distributions with a deep neural network and then applying them to a hierarchical Bayesian model. Deep count autoencoder (DCA) networks (Eraslan et al 2019) use a negative binomial noise model with or without zero-inflation to capture nonlinear gene-gene dependencies that were used to denoise scRNA-seq datasets while accounting for count distribution, over-dispersion, and sparsity. In graph-structured data, variational graph auto-encoder (VGAE) (Kipf and Welling 2016), an unsupervised learning framework, is a latent variable model which minimizes the reconstruction error of the data by optimizing the variational lower bound using a single Gaussian distribution. Buterez et al (2021) introduced cell variational graph autoencoder (CellVGAE) which applies VGAE to scRNA-seq data.
Despite the fact that these promising deep learning approaches outperform traditional scRNA-seq clustering methods in terms of clustering performance, cell clustering analysis using an extendable framework integrating statistical clustering with powerful deep learning approaches is still not extensively studied. The Gaussian mixture model (GMM) has been utilized to combine deep learning models to overcome the limitations in distribution and data structure. Jiang et al (2017) applied GMM to variational deep embedding using images for both input and output. Guo et al (2020) presented a variational autoencoder (VAE) model by optimizing the GMM priors to overcome the underfitting issue in VAE which is caused by using standard normal distribution. The performance of the model is verified on MNIST, Omnigot, and Frey Face image datasets. Hui et al (2020) later developed a collaborative graph convolutional network (GCN) in which the Gaussian mixture model-based variational graph autoencoder (GMM-VGAE) component was used for attributed graph clustering. This unsupervised generative clustering framework combines a statistical model GMM and VGAE for clustering tasks where the Gaussian prior distribution is replaced by a mixture of Gaussian prior distributions. The GMM employs the expectation maximization (EM) algorithm (Do and Batzoglou 2008) to account for initial cluster noise, sensitivity, and outliers. As a result, this probabilistic clustering model has also been widely incorporated into a dimensional reduction deep learning framework to deliver more robust cell clustering results in single cell sequencing data. Xiong et al (2019) proposed single-cell assay for transposase-accessible chromatin with sequencing (ATAC-seq) analysis via latent feature extraction, which combines the VAE framework and a probabilistic GMM to learn latent features to analyze scATAC-seq data. Yu et al (2021) proposed scGMAI, a GMM based on autoencoder networks that can identify cell types and cluster cells in scRNA-Seq data by reducing the impact of missing values. However, scGMAI is a multistage pipeline in which the GMM parameters are not used in the learning process of autoencoder networks.
In this study, with a focus on high-dimensional scRNA-seq data, we consider a single-cell Gaussian mixture model-based variational graph autoencoder (scGMM-VGAE) framework which primarily applies to these data for cell clustering. In this framework, the GMM module is used to cluster cell types from latent data encoded from the VGAE. GMM parameters contribute to the VGAE learning procedure where the probabilistic GMM model takes the distribution of the low-dimensional representation generated by VGAE into account. The latent features that capture the essential characteristics of scRNA-seq data are then used to cluster cells. As benchmark methods, we consider Seurat, DCA + Leiden, scVI, and CellVGAE. The clustering analysis results on real and simulated data show that the proposed algorithm outperforms baseline models in cell clustering performance on seqRNA-seq data.

Real datasets
We consider three gold standard real labeled scRNA-seq datasets in this study: Baron3, Baron4 (Baron et al 2016), and Darmanis (Darmanis et al 2015). The truth cell labels are included along the scRNA cells × genes count matrix. The sequenced cells in both Baron3 and Baron4 datasets are from human pancreatic islets. Specifically, Baron3 data are collected from a male donor aged 38 with a BMI of 27.5 and no diabetes status, while Baron4 data are collected from a female donor aged 59 with a BMI of 29.9 and type 2 diabetes mellitus. Baron3 and Baron4 data share the same scRNA experimental protocol, sequenced genes, and numbers of clusters but have different numbers of cells. In this study, these two datasets are named Baron3 data and Baron4 data, respectively. Darmanis et al (2015) generated a human brain scRNA-seq capturing the cellular complexity of fetal human brain in adults at the whole transcriptome level. This dataset is named Darmanis data. The cells are divided into eight categories, which correspond to eight major brain cell types (e.g. strocytes, oligodendrocytes, oligodendrocyte precursor cells, neurons, microglia, and vascular cells). In addition, we consider a dataset with no true labels generated which includes liver cells (hepatoblasts, hepatocytes, and cholangiocytes) of E10.5 to E17.5 mouse embryos (Yang et al 2017). We denote this dataset as Yang data. A summary of the number of cells, genes, and clusters for both the labeled and the unlabeled datasets is described in table 1. These publicly available scRNA-seq datasets are available on the Gene Expression Omnibus (GEO) database. The accession numbers of Baron3 and Baron4 data are GSM2230759 and GSM2230760, respectively. Darmanis data can be downloaded at https://github.com/ciortanmadalina/single-cell-sota/tree/ master/input/brainCIDR. Raw Darmanis data can also be found at GEO (accession no. GSE67835). The accession number for the raw data files of Yang data reported in this study is GSE90047. The R-package Seurat v3 (Stuart et al 2019) is used for pre-processing scRNA-seq data. Since scRNA-seq data are usually highly variable (Cui et al 2021), to stabilize the variation within the feature matrix, we apply log-normalization which is commonly used pre-processing procedure (Booeshaghi and Pachter 2021). In this research, we consider the top 1200 genes, which are selected by the variance stabilizing transformation (VST) selection method adapted in the FindVariableFeatures function of the Seurat v3 package. VST chooses the top variable genes using the relationship between variance and mean (Buterez et al 2021).

Simulated datasets
To further assess the performance of scGMM-VGAE in terms of stability and time efficiency for cell clustering tasks, we conduct a simulation analysis using the proposed method on some simulated datasets. These datasets are generated using a state-of-the-art simulation algorithm SPARSim (Baruzzo et al 2020). In the simulation, SPARSim first estimates the gene expression level intensities (Z), gene expression level variabilities (Φ ), and library size (L) from a given template to simulate a dataset with similar properties. The expression level (X ij ) of gene i in cell j is later modeled using a gamma distribution To represent the systematic and sample-dependent bias, SPARSim simulates technical variability (Y j ) for the expression of cell j where Y j is modeled using multivariate hypergeometric distribution parameterized with n = L j and m = X j .
In this study, to simulate single-cell datasets, we consider Zheng preset template  implemented in SPARSim to generate the simulated datasets. The parameters, i.e. intensities, variabilities, and library size, in the preset are estimated from the peripheral blood mononuclear cells sample, and the original template dataset contains 3388 cells, 15 041 genes, and 4 labeled cell types . The parameters estimated from the Zheng dataset are preset in the algorithm. To evaluate the scGMM-VGAE clustering stability of different sizes' datasets, we simulate two additional datasets with 1694 cells and 6776 cells, respectively. These datasets are generated using the Zheng preset by halving and doubling the cell numbers of each cell type. As for the simulated genes, we select the top 1200 genes from the original templates using VST to be consistent to the real datasets. We consider ten replications for each simulation setting and calculated the average mean and standard deviation of the selected evaluation metric to evaluate the clustering performance of the proposed method. Details of simulated datasets are summarized in table 2.

Architecture of scGMM-VGAE
The scGMM-VGAE is a deep learning framework utilizing the benefits of GMM in discovering the distribution of scRNA-Seq data to deliver more precise cluster results. Given the filtered and normalized cells × genes matrix as input of scGMM-VGAE, we use R-package SingleCellExperiment 3.14 to build a K-nearest neighbor-based cell-cell graph for each data set. The optimal K to produce the best results for our selected datasets was reported as 5 (Booeshaghi and Pachter 2021). The scGMM-VGAE framework considers a processed feature matrix X of dimension n × p (i.e. cells×genes) and an adjacency matrix A of the cell-cell graph in dimension n × n as inputs. The inputs are later passed through a VAE-decoder network of two-layer GCN integrating with GMM. The scGMM-VGAE model has two stages: pre-train and training stages. The actual GMM is fitted in the pre-train stage where the parameters are introduced as a 'guidance' for the VGAE component. In the training stage, GMM learned parameters (means, variances, and weights of clusters) are used for the loss function of the VGAE. The loss function is modified based on the variational evidence lower bound (ELBO) (Kingma and Welling 2014). The unsupervised learning algorithm scGMM-VGAE aims to partition the cells captured in a graph into K clusters without using cell labels. The algorithm will then return outputs as a reconstructed adjacency matrix A of the cell-cell graph and cell clustering results. However, in this study, we focus on the cell clustering task of the algorithm. The overview of scGMM-VGAE in clustering scRNA-seq data is illustrated in figure 1. To define the scGMM-VGAE structure and clustering procedure in scRNA-Seq data, we will first define some basic notations and GMM.

Background and notations
E is a set of edges between nodes, and X ∈ R n×p is an attribute feature matrix of all nodes. In the context of scRNA-Seq data, n and p represent the number of cells and the size of genes, respectively. To better understand the structure of scGMM-VGAE which is also a GCN, we consider a graph convolutional layer, which is a layer-wise convolution to transform the lth hidden layer H (l) to H (l+1) (Kipf and Welling 2017). Given the adjacency matrix A of a cell-cell graph and a processed feature matrix X, here A = α ij ∈ R n×n is a topological structure of G where a ij = 1 if nodes v i and v j are connected as edge e ij , otherwise α ij = 0. The spectral convolution function f ϕ is defined as where H (0) = X, A = A + I n , and D ii = j A ij . Here W (l) denotes weight parameters, ϕ is an activation function (e.g. rectified linear unit (ReLU), Sigmoid, tanh), and I n is an identity matrix.

Encoder and decoder modules
The autoencoder is used for embedding the input scRNA-seq gene expression data X ∈ R n×p into low dimensional space, where n and p are the number of cells and the size of genes, respectively. The autoencoder is an unsupervised neural network that consists of the encoder and decoder modules. Kipf and Welling (2016) proposed a simple inference model parameterized by a two-layer GCN which is defined as where h i is the stochastic latent variable, µ i and σ i are vectors of GCN mean and standard deviation, respectively. Now, the encoder takes adjacency matrix A and the processed feature matrix X and generate the latent variable matrix H.
For simplicity, we define β gcn = µ gcn , σ gcn where µ gcn and σ gcn are matrices of µ i and σ i , respectively.
Let W (0) be the weight of the first layer and W (1) β be the weight of β gcn , the architecture for the encoder as follows
The decoder is conducted to reconstruct the structure of the adjacency matrix A using a generative model defined as an inner product between latent variables, Here p A ij |h i , h j = ϕ D h ⊤ i , h j where ϕ D is an activation function and A ij is the reconstructed adjacency matrix from the decoder.

GMM module
GMM has been widely used for unsupervised model-based clustering of complex data (Reynolds 2015, McLachlan et al 2019. In this study, we consider the latent matrix H = [h id ] n×pe where i = 1, 2, . . . , n and d = 1, 2, . . . , p e . Each row h i represents one cell and p e is the latent embedding size. The goal of using GMM is to cluster h i into K components or clusters. The probability density function of K-component GMM is given by where µ k and Σ k are mean vectors and covariance matrices of mixture components, respectively. For k = 1, 2, . . . , K, π k > 0 is the weight of the kth mixture component restricted to In the proposed method, we consider Σ k as a diagonal covariance.

Clustering optimization
Hui et al (2020) proposed GMM-VGAE as an attributed graph clustering model. We extend this model to cluster cells in scRNA-Seq data which contain a significant number of biological and technical zeros. In this algorithm, the number of cell clusters K will be either pre-determined based on the prior information (e.g. the number of labels in the labeled data) or estimated based on the unlabeled data. Given K, an adjacency matrix A derived from a cell-cell graph and a processed feature matrix X, an initial clustering assignment is conducted using GMM to obtain the estimates of cluster parameters π k , µ k , σ k which are probability, mean, and standard deviation of cluster k such that π ∈ R K + and K k=1 π k = 1. These parameters are later merged to obtain the combined designed loss function for the framework. The scGMM-VGAE is then trained and optimized by maximizing the modified loss function, which can be defined as where p (α, h, k) = p (α|h) p (h|k) p (k) and q (h, k|x, α) = q (h|x, α) q (k|x, α) is the variational posterior. To optimize the clustering (CL) L CL using standard stochastic gradient methods, we utilize stochastic gradient variational Bayes estimator and the reparameterization trick as proposed for ELBO loss in VAE (Kingma and Welling 2014). The L CL can also be divided into reconstruction loss L REC and Kullback-Leibler (KL) divergence loss L KL . Once the training is completed by maximizing equation (8), the latent representation h can be obtained by q (h|x, α) = N h; µ, σ 2 I where µ and σ 2 are derived by H (2) µ gcn and H (2) σgcn (equation (5)). The clustering assignments can be obtained by The general approach in scGMM-VGAE clustering is presented in algorithm 1.
1: input number of clusters K, processed feature matrix X, adjacency matrix A, 2: and max iterations Tmax.
12: output cell cluster labels

Hyper-parameters setting
In scGMM-VGAE framework, number of neurons, embedding size, and the optimizer are the required parameters of activation functions. We consider a wide range of the number of neurons (from 24 to 64) and embedding size (12-32) as the numbers out of the range either produced errors frequently or created unstable results. The options for activation functions are varied such as Sigmoid, Tanh, and ReLU functions. For activation functions, we consider Sigmoid since other functions show worse results in the experiment, which may be due to the sparse nature of the data. Among several available optimizers (e.g. Adam, stochastic gradient descent (SGD), and RMSProp), only the Adam optimizer shows better results. The final parameters are applied on all datasets for scGMM-VGAE algorithm and the optimal parameters corresponding to each dataset are reported in tables S1 and S2 of supplementary information. All source codes used in this study can be found at https://github.com/ericlin1230/scGMM-VGAE.

Benchmark methods
To compare the clustering results with scGMM-VGAE method, we consider four baseline methods. The same pre-processed cell × genes feature matrix is used in all the baseline methods. While scVI requires the feature matrix as the input, CellVGAE requires both the feature matrix and the cell-cell adjacency matrix of the graph.

scVI
scVI is a deep neural network, which first generates conditional distributions and applies them to a hierarchical Bayesian model (Lopez et al 2018). While the encoder of the network encodes cell information using a non-linear transformation, the decoder uses another non-linear transformation to create a posterior estimate of the genes. This allows the similar cells to be grouped.

CellVGAE
This method uses VGAE with encoders and decoders to learn latent variables (Buterez et al 2021). Cell-VGAE encoders employ graph attention networks which are a neural framework for graph data. The graph is later reconstructed from the latent variables, which are used to create clusters.

Seurat
This clustering method includes a graph-based clustering analysis tool for scRNA-seq data, which applies an SNN modularity optimization-based clustering algorithm to identify the cell clusters (Stuart et al 2019).

DCA + Leiden
DCA networks use a negative binomial noise model with or without zero-inflation to capture the nonlinear gene-gene dependencies (Eraslan et al 2019). The network first denoises scRNA-seq datasets and then clusters the cells by Leiden clustering.

Evaluation metrics and visualization for clustering
To evaluate the clustering performance of scGMM-VGAE in comparison to four baseline models, we consider three metrics: adjusted Rand index (ARI), normalized mutual information (NMI), and Silhouette score. However, for unlabeled datasets, only the Silhouette metric can be considered since this metric does not require truth labels. Furthermore, to visualize the clustering results, we consider uniform manifold approximation and projection (UMAP).

ARI
ARI calculates the similarities between the true labels and the predicted labels of a given cluster. The Rand index values range between 0 and 1 where 0 represents a random result and 1 represents a complete agreement between the clusters (Yeung and Ruzzo 2001). However, ARI can yield negative values if the index is smaller than the expected index. The equation for calculating the ARI value is defined as where N ij represents the number of cells with the true label i assigned to cluster j. N i represents the number of cells assigned in cluster i, N j represents the number of cells with true label j.

NMI
The mutual information (MI) score is used to measure the similarity and exploits the grouping property (Kraskov and Grassberger 2009). NMI is a normalization of MI in which MI is normalized by the generalized mean of the true labels and predicted labels. NMI is defined as where Y is the predicted cell labels, C is the true cell labels, H (Y) and H (C) are the entropy of the true and predicted cell labels, respectively. Thus, NMI scales the result from 0 (no mutual information) to 1 (perfect correlation).

Silhouette coefficient
Silhouette (Zhao et al 2018) was developed as a graphical tool for the validation of clustering results. This metric evaluates the degree of separation between clusters by measuring the tightness and separation between clusters. SC is given as where a is dissimilarity within a cluster i and b is the dissimilarity between cluster i and its nearest cluster. The score ranges from −1 to 1 where the larger value indicates a higher degree of separation among clusters.

UMAP
UMAP is a visualization method for clustering results (McInnes et al 2018). This method performs dimension reduction on the feature matrix and represents the cells in a two-dimensional setting. Since UMAP uses a framework based on Riemannian geometry, the structure of the original data in a reduced dimension can be preserved. In UMAP visualization, cells are visualized in a two-dimensional setting and labels are plotted in different colors to present the clustering results.

Clustering performance in real datasets
As discussed in section 2.3, we apply scGMM-VGAE and four other baseline clustering methods (i.e. Seurat, DCA + Leiden, scVI, and CellVGAE) to three real labeled datasets (i.e. Baron3 data, Baron4 data, and Darmanis data) and one real unlabeled dataset (i.e. Yang data). To evaluate the performance of scGMM-VGAE, we compare the results to four baseline clustering methods based on the two true label-based metrics (i.e. ARI and NMI) and the internal metric silhouette coefficient (SC). Figure 2 illustrates the clustering results of the three real labeled datasets (panels (A)-(C)) and one unlabeled dataset (panel (D)). For scGMM-VGAE, we use all six parameter settings (table 3) for each measurement. For the labeled data, scGMM-VGAE outperforms the other four baseline methods on three evaluation metrics, although these labeled datasets contain many zero values. The second-best results are from either Cell-VGAE or Seurat, depending on the datasets and metrics. Specifically, the ARI and NMI results of scGMM-VGAE surpass the ARI and NMI of the second ranked method by 0.02-0.06 and 0.01-0.1, respectively. These findings show that scGMM-VGAE produces the best results although a significant number of zeros are presented in scRNA-seq data.
On the other hand, to compare the clustering performance of scGMM-VGAE with four baseline clustering methods on unlabeled Yang data, only SC metric is used for evaluation. The clustering results for Yang data are illustrated in figure 2(D). As shown in the histogram, scGMM-VGAE is the best candidate for cell clustering in comparison to four other methods on Yang data.

UMAP visualization
We consider UMAP visualization to demonstrate the clustering results of the proposed scGMM-VGAE and the four selected baseline methods in comparison to the true labels (i.e. Truth). The UMAP visualization of Baron3 data (figure 3) and each method show that the DCA + Leiden overestimated cluster counts. When compared to the true labels, many clusters are located at the top of the UMAP cluster. Other methods correctly clustered the top portion of the UMAP visualization. As seen in the bottom right clusters, the Seurat and DCA + Leiden methods produce over-clustered results with more clusters than the truth label. Other methods appear to have appropriate comparisons to the truth.
The same results can be observed in Baron4 dataset. Over-clustering can be observed in DAC + Leiden method and Seurat method, with DCA + Leiden methods having more over-clustering in both the top and the bottom sections. UMAP visualization for clustering results of Baron4 data and Darmanis data can be found in figures S1 and S2.

Clustering performance in simulated datasets
To further evaluate the performance and stability of scGMM-VGAE in cell clustering, we apply the internal metric SC along with the label-based ARI and NMI metrics on the simulated datasets with three different cell numbers generated by SPARSim using the Zheng preset (section 2.1), as summarized in table 2. For each size (i.e. number of cells) of the simulated data, we generate ten copies of the datasets and calculate the average mean and standard deviation of each metric. The histograms illustrating the clustering performance of the five methods on the three simulated datasets are presented in figure 4.     Similar to the previous results on real labeled and unlabeled datasets, scGMM-VGAE consistently shows the best results among the five methods on all simulated data for all three metrics. Cell-VGAE achieves the second-best results for ARI, NMI, and SC. Specifically, for 1694-cell data, ARI, NMI, and SC of scGMM-VGAE surpass those of Cell-VGAE by 0.02, 0.04, and 0.02. For 3388-cell data, scGMM-VGAE produces similar results as Cell-VGAE. The difference in each metric score between the two methods is within 0.005. As for 6776-cell data, scGMM-VGAE surpasses Cell-VGAE by 0.02 for ARI, 0.04 for NMI, and 0.25 for SC. To test the clustering stability of scGMM-VGAE among different-size datasets, we group all the simulation results of scGMM-VGAE by evaluation metrics, as shown in figure 5. The results from three different sizes of simulated data are almost the same for all three metrics. Thus, scGMM-VGAE shows great stability.

Time efficiency
When analyzing the largest simulated dataset (i.e. 6776-cell data), we record the time/run for each algorithm, which is summarized in table 3. The scGMM-VGAE achieves 2.75 min per run, approximately 2.5 times faster than another VGAE-based algorithm Cell-VGAE (7 min/run). Seurat is the fastest algorithm among five algorithms with less than 10 s per run since the algorithm does not require a deep neural network. However, among the deep learning-based methods, scGMM-VGAE is one of the fastest algorithms.
The results were collected using CUDA to shorten the run time. All runs in this study were conducted on a GPU, a GTX 1050 with 2GB VRAM. Python and R are in versions 3.7 and 4.1.2, respectively.

Discussion and conclusion
Based on the clustering results of the labeled unlabeled (section 3.1) and simulated data (section 3.3), we can conclude that scGMM-VGAE shows better performance over other selected baseline methods for all selected datasets. Specifically, scGMM-VGAE outperforms another VGAE-based clustering method, Cell-VGAE, which achieves the second-best results for most datasets. Thus, this study confirms that scGMM-VGAE can be effectively used for clustering scRNA data containing a large number of biological and technical zeroes since the algorithm incorporates a GMM into the encoder data generated using VGAE to further enhance the clustering performance. From the stability and time efficiency tests, we also observe that scGMM-VGAE shows excellent running speed. In conclusion, scGMM-VGAE successfully combines an unsupervised deep learning model (VGAE) with a statistical model (GMM) to enhance the clustering performance of scRNA data.
The algorithm requires a predefined number of clusters for initialization. However, knowing the exact number of cell types in advance is difficult, and experienced determination is not always reliable. For the unlabeled data, we tried different numbers of clusters as input, and chose the final number of clusters that generated the largest silhouette score value. There are many other approaches that have been developed to estimate the number of clusters in scRNA-seq data, we will further evaluate them in future research.
Another limitation is that only the top 1200 genes were selected in our study. Since the scRNA-seq data is high-dimensional, and some of the features may be quite noisy, different numbers of top genes used in the cell clustering will produce different feature matrices. This change may have an impact on the final clustering results. In addition to using the top 1200 genes measured by ARI, NMI, and SC, we also evaluate the performance of the proposed method using other top 800, 1000, 1400, and 1600 genes applied to the Darmanis data. The ARI and NMI values for each of the top 1200 genes are the highest, which are 0.7981 and 0.7923, respectively. Thus, we decide to use the top 1200 as a cutoff to analyze all the data sets. The detailed results are reported in table S3 of the supplementary information. In future applications, we will consider more parameter settings and the selected number of cells and aim to extend the scGMM-VGAE without requiring a predefined number of clusters. We will further evaluate the model stability using other scRNA-seq simulators. We also aim to explore other possibilities for utilizing the scGMM-VGAE model in biological data analysis (e.g. clustering microbiome data).

Data availability statement
All data that support the findings of this study are included within the article (and any supplementary files).

Funding
This work has been supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Canada Research Chairs Tier II Program. P H is the holder of Manitoba Medical Services Foundation (MMSF) Allen Rouse Basic Science Career Development Research Award.